https://github.com/apache/spark
Revision d7af1d20f06412f80798c53d8588356ee1490afe authored by Angerszhuuuu on 12 August 2022, 02:52:33 UTC, committed by Wenchen Fan on 12 August 2022, 02:54:09 UTC
### What changes were proposed in this pull request?
`ArrayInterscet` miss judge if null contains in right expression's hash set.

```
>>> a = [1, 2, 3]
>>> b = [3, None, 5]
>>> df = spark.sparkContext.parallelize(data).toDF(["a","b"])
>>> df.show()
+---------+------------+
|        a|           b|
+---------+------------+
|[1, 2, 3]|[3, null, 5]|
+---------+------------+

>>> df.selectExpr("array_intersect(a,b)").show()
+---------------------+
|array_intersect(a, b)|
+---------------------+
|                  [3]|
+---------------------+

>>> df.selectExpr("array_intersect(b,a)").show()
+---------------------+
|array_intersect(b, a)|
+---------------------+
|            [3, null]|
+---------------------+
```

In origin code gen's code path, when handle `ArrayIntersect`'s array1, it use the below code
```
        def withArray1NullAssignment(body: String) =
          if (left.dataType.asInstanceOf[ArrayType].containsNull) {
            if (right.dataType.asInstanceOf[ArrayType].containsNull) {
              s"""
                 |if ($array1.isNullAt($i)) {
                 |  if ($foundNullElement) {
                 |    $nullElementIndex = $size;
                 |    $foundNullElement = false;
                 |    $size++;
                 |    $builder.$$plus$$eq($nullValueHolder);
                 |  }
                 |} else {
                 |  $body
                 |}
               """.stripMargin
            } else {
              s"""
                 |if (!$array1.isNullAt($i)) {
                 |  $body
                 |}
               """.stripMargin
            }
          } else {
            body
          }
```
We have a flag `foundNullElement` to indicate if array2 really contains a null value. But when implement https://issues.apache.org/jira/browse/SPARK-36829, misunderstand the meaning of `ArrayType.containsNull`,
so when implement  `SQLOpenHashSet.withNullCheckCode()`
```
  def withNullCheckCode(
      arrayContainsNull: Boolean,
      setContainsNull: Boolean,
      array: String,
      index: String,
      hashSet: String,
      handleNotNull: (String, String) => String,
      handleNull: String): String = {
    if (arrayContainsNull) {
      if (setContainsNull) {
        s"""
           |if ($array.isNullAt($index)) {
           |  if (!$hashSet.containsNull()) {
           |    $hashSet.addNull();
           |    $handleNull
           |  }
           |} else {
           |  ${handleNotNull(array, index)}
           |}
         """.stripMargin
      } else {
        s"""
           |if (!$array.isNullAt($index)) {
           | ${handleNotNull(array, index)}
           |}
         """.stripMargin
      }
    } else {
      handleNotNull(array, index)
    }
  }
```
The code path of `  if (arrayContainsNull && setContainsNull) `  is misinterpreted that array's openHashSet really have a null value.

In this pr we add a new parameter `additionalCondition ` to complements the previous implementation of `foundNullElement`. Also refactor the method's parameter name.

### Why are the changes needed?
Fix data correct issue

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #37436 from AngersZhuuuu/SPARK-39776-FOLLOW_UP.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit dff5c2f2e9ce233e270e0e5cde0a40f682ba9534)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
1 parent 21d9db3
History
Tip revision: d7af1d20f06412f80798c53d8588356ee1490afe authored by Angerszhuuuu on 12 August 2022, 02:52:33 UTC
[SPARK-39976][SQL] ArrayIntersect should handle null in left expression correctly
Tip revision: d7af1d2

README.md

back to top