https://github.com/apache/spark
Revision 36fc73e7c42b84e05b15b2caecc0f804610dce20 authored by tianlzhang on 14 July 2022, 04:49:57 UTC, committed by Wenchen Fan on 14 July 2022, 04:49:57 UTC
### What changes were proposed in this pull request?
Add more checks to`removeProjectBeforeFilter` in `ColumnPruning` and only remove the project if
1. the filter condition contains correlated subquery
2. same attribute exists in both output of child of Project and subquery

### Why are the changes needed?

This is a legitimate self-join query and should not throw exception when de-duplicating attributes in subquery and outer values.

```sql
select * from
(
select v1.a, v1.b, v2.c
from v1
inner join v2
on v1.a=v2.a) t3
where not exists (
  select 1
  from v2
  where t3.a=v2.a and t3.b=v2.b and t3.c=v2.c
)
```

Here's what happens with the current code. The above query is analyzed into following `LogicalPlan` before `ColumnPruning`.
```
Project [a#250, b#251, c#268]
+- Filter NOT exists#272 [(a#250 = a#266) && (b#251 = b#267) && (c#268 = c#268#277)]
   :  +- Project [1 AS 1#273, _1#259 AS a#266, _2#260 AS b#267, _3#261 AS c#268#277]
   :     +- LocalRelation [_1#259, _2#260, _3#261]
   +- Project [a#250, b#251, c#268]
      +- Join Inner, (a#250 = a#266)
         :- Project [a#250, b#251]
         :  +- Project [_1#243 AS a#250, _2#244 AS b#251]
         :     +- LocalRelation [_1#243, _2#244, _3#245]
         +- Project [a#266, c#268]
            +- Project [_1#259 AS a#266, _3#261 AS c#268]
               +- LocalRelation [_1#259, _2#260, _3#261]
```

Then in `ColumnPruning`, the Project before Filter (between Filter and Join) is removed. This changes the `outputSet` of the child of Filter among which the same attribute also exists in the subquery. Later, when `RewritePredicateSubquery` de-duplicates conflicting attributes, it would complain `Found conflicting attributes a#266 in the condition joining outer plan`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add UT.

Closes #37074 from manuzhang/spark-39672.

Lead-authored-by: tianlzhang <tianlzhang@ebay.com>
Co-authored-by: Manu Zhang <OwenZhang1990@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
1 parent 07f5926
History
Tip revision: 36fc73e7c42b84e05b15b2caecc0f804610dce20 authored by tianlzhang on 14 July 2022, 04:49:57 UTC
[SPARK-39672][SQL][3.1] Fix removing project before filter with correlated subquery
Tip revision: 36fc73e

back to top