https://github.com/apache/spark
Revision 2fe16015cdd701f395693b4e6bfa72cd101a8b8c authored by tianlzhang on 14 July 2022, 04:49:57 UTC, committed by Wenchen Fan on 14 July 2022, 05:03:45 UTC
Add more checks to`removeProjectBeforeFilter` in `ColumnPruning` and only remove the project if 1. the filter condition contains correlated subquery 2. same attribute exists in both output of child of Project and subquery This is a legitimate self-join query and should not throw exception when de-duplicating attributes in subquery and outer values. ```sql select * from ( select v1.a, v1.b, v2.c from v1 inner join v2 on v1.a=v2.a) t3 where not exists ( select 1 from v2 where t3.a=v2.a and t3.b=v2.b and t3.c=v2.c ) ``` Here's what happens with the current code. The above query is analyzed into following `LogicalPlan` before `ColumnPruning`. ``` Project [a#250, b#251, c#268] +- Filter NOT exists#272 [(a#250 = a#266) && (b#251 = b#267) && (c#268 = c#268#277)] : +- Project [1 AS 1#273, _1#259 AS a#266, _2#260 AS b#267, _3#261 AS c#268#277] : +- LocalRelation [_1#259, _2#260, _3#261] +- Project [a#250, b#251, c#268] +- Join Inner, (a#250 = a#266) :- Project [a#250, b#251] : +- Project [_1#243 AS a#250, _2#244 AS b#251] : +- LocalRelation [_1#243, _2#244, _3#245] +- Project [a#266, c#268] +- Project [_1#259 AS a#266, _3#261 AS c#268] +- LocalRelation [_1#259, _2#260, _3#261] ``` Then in `ColumnPruning`, the Project before Filter (between Filter and Join) is removed. This changes the `outputSet` of the child of Filter among which the same attribute also exists in the subquery. Later, when `RewritePredicateSubquery` de-duplicates conflicting attributes, it would complain `Found conflicting attributes a#266 in the condition joining outer plan`. No. Add UT. Closes #37074 from manuzhang/spark-39672. Lead-authored-by: tianlzhang <tianlzhang@ebay.com> Co-authored-by: Manu Zhang <OwenZhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 36fc73e7c42b84e05b15b2caecc0f804610dce20) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
1 parent acf8f66
Tip revision: 2fe16015cdd701f395693b4e6bfa72cd101a8b8c authored by tianlzhang on 14 July 2022, 04:49:57 UTC
[SPARK-39672][SQL][3.1] Fix removing project before filter with correlated subquery
[SPARK-39672][SQL][3.1] Fix removing project before filter with correlated subquery
Tip revision: 2fe1601
File | Mode | Size |
---|---|---|
.github | ||
.idea | ||
R | ||
assembly | ||
bin | ||
binder | ||
build | ||
common | ||
conf | ||
core | ||
data | ||
dev | ||
docs | ||
examples | ||
external | ||
graphx | ||
hadoop-cloud | ||
launcher | ||
licenses | ||
licenses-binary | ||
mllib | ||
mllib-local | ||
project | ||
python | ||
repl | ||
resource-managers | ||
sbin | ||
sql | ||
streaming | ||
tools | ||
.asf.yaml | -rw-r--r-- | 1.1 KB |
.gitattributes | -rw-r--r-- | 130 bytes |
.gitignore | -rw-r--r-- | 2.0 KB |
CONTRIBUTING.md | -rw-r--r-- | 997 bytes |
LICENSE | -rw-r--r-- | 13.1 KB |
LICENSE-binary | -rw-r--r-- | 22.4 KB |
NOTICE | -rw-r--r-- | 2.0 KB |
NOTICE-binary | -rw-r--r-- | 56.5 KB |
README.md | -rw-r--r-- | 4.4 KB |
appveyor.yml | -rw-r--r-- | 2.7 KB |
pom.xml | -rw-r--r-- | 137.4 KB |
scalastyle-config.xml | -rw-r--r-- | 22.0 KB |
Computing file changes ...