https://github.com/apache/spark
Revision d7af1d20f06412f80798c53d8588356ee1490afe authored by Angerszhuuuu on 12 August 2022, 02:52:33 UTC, committed by Wenchen Fan on 12 August 2022, 02:54:09 UTC
### What changes were proposed in this pull request? `ArrayInterscet` miss judge if null contains in right expression's hash set. ``` >>> a = [1, 2, 3] >>> b = [3, None, 5] >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) >>> df.show() +---------+------------+ | a| b| +---------+------------+ |[1, 2, 3]|[3, null, 5]| +---------+------------+ >>> df.selectExpr("array_intersect(a,b)").show() +---------------------+ |array_intersect(a, b)| +---------------------+ | [3]| +---------------------+ >>> df.selectExpr("array_intersect(b,a)").show() +---------------------+ |array_intersect(b, a)| +---------------------+ | [3, null]| +---------------------+ ``` In origin code gen's code path, when handle `ArrayIntersect`'s array1, it use the below code ``` def withArray1NullAssignment(body: String) = if (left.dataType.asInstanceOf[ArrayType].containsNull) { if (right.dataType.asInstanceOf[ArrayType].containsNull) { s""" |if ($array1.isNullAt($i)) { | if ($foundNullElement) { | $nullElementIndex = $size; | $foundNullElement = false; | $size++; | $builder.$$plus$$eq($nullValueHolder); | } |} else { | $body |} """.stripMargin } else { s""" |if (!$array1.isNullAt($i)) { | $body |} """.stripMargin } } else { body } ``` We have a flag `foundNullElement` to indicate if array2 really contains a null value. But when implement https://issues.apache.org/jira/browse/SPARK-36829, misunderstand the meaning of `ArrayType.containsNull`, so when implement `SQLOpenHashSet.withNullCheckCode()` ``` def withNullCheckCode( arrayContainsNull: Boolean, setContainsNull: Boolean, array: String, index: String, hashSet: String, handleNotNull: (String, String) => String, handleNull: String): String = { if (arrayContainsNull) { if (setContainsNull) { s""" |if ($array.isNullAt($index)) { | if (!$hashSet.containsNull()) { | $hashSet.addNull(); | $handleNull | } |} else { | ${handleNotNull(array, index)} |} """.stripMargin } else { s""" |if (!$array.isNullAt($index)) { | ${handleNotNull(array, index)} |} """.stripMargin } } else { handleNotNull(array, index) } } ``` The code path of ` if (arrayContainsNull && setContainsNull) ` is misinterpreted that array's openHashSet really have a null value. In this pr we add a new parameter `additionalCondition ` to complements the previous implementation of `foundNullElement`. Also refactor the method's parameter name. ### Why are the changes needed? Fix data correct issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #37436 from AngersZhuuuu/SPARK-39776-FOLLOW_UP. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dff5c2f2e9ce233e270e0e5cde0a40f682ba9534) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
1 parent 21d9db3
Tip revision: d7af1d20f06412f80798c53d8588356ee1490afe authored by Angerszhuuuu on 12 August 2022, 02:52:33 UTC
[SPARK-39976][SQL] ArrayIntersect should handle null in left expression correctly
[SPARK-39976][SQL] ArrayIntersect should handle null in left expression correctly
Tip revision: d7af1d2
File | Mode | Size |
---|---|---|
.github | ||
.idea | ||
R | ||
assembly | ||
bin | ||
binder | ||
build | ||
common | ||
conf | ||
core | ||
data | ||
dev | ||
docs | ||
examples | ||
external | ||
graphx | ||
hadoop-cloud | ||
launcher | ||
licenses | ||
licenses-binary | ||
mllib | ||
mllib-local | ||
project | ||
python | ||
repl | ||
resource-managers | ||
sbin | ||
sql | ||
streaming | ||
tools | ||
.asf.yaml | -rw-r--r-- | 1.1 KB |
.gitattributes | -rw-r--r-- | 130 bytes |
.gitignore | -rw-r--r-- | 2.0 KB |
CONTRIBUTING.md | -rw-r--r-- | 997 bytes |
LICENSE | -rw-r--r-- | 13.1 KB |
LICENSE-binary | -rw-r--r-- | 22.4 KB |
NOTICE | -rw-r--r-- | 2.0 KB |
NOTICE-binary | -rw-r--r-- | 56.5 KB |
README.md | -rw-r--r-- | 4.4 KB |
appveyor.yml | -rw-r--r-- | 2.7 KB |
pom.xml | -rw-r--r-- | 137.5 KB |
scalastyle-config.xml | -rw-r--r-- | 22.0 KB |
Computing file changes ...