Revision - 68d9f35 - [SPARK-46779][SQL] `InMemoryRelation` instances of the same [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision 68d9f353300ed7de0b47c26cb30236bada896d25 authored by Bruce Robbins on 22 January 2024, 19:09:01 UTC, committed by Dongjoon Hyun on 22 January 2024, 19:09:44 UTC

[SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent

When canonicalizing `output` in `InMemoryRelation`, use `output` itself as the schema for determining the ordinals, rather than `cachedPlan.output`.

`InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't necessarily use the same exprIds. E.g.:
```
+- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, deserialized, 1 replicas)
      +- LocalTableScan [c1#254, c2#255]

```
Because of this, `InMemoryRelation` will sometimes fail to fully canonicalize, resulting in cases where two semantically equivalent `InMemoryRelation` instances appear to be semantically nonequivalent.

Example:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(3, 7),
(4, 5);

cache table data;

select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all;
```
If plan change validation checking is on (i.e., `spark.sql.planChangeValidation=true`), the failure is:
```
[PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS count(c2)#83L]
...
is not a valid aggregate expression: [SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an aggregate function.
```
If plan change validation checking is off, the failure is more mysterious:
```
[INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
```
If you remove the cache command, the query succeeds.

The above failures happen because the subquery in the aggregate expressions and the subquery in the grouping expressions seem semantically nonequivalent since the `InMemoryRelation` in one of the subquery plans failed to completely canonicalize.

In `CacheManager#useCachedData`, two lookups for the same cached plan may create `InMemoryRelation` instances that have different exprIds in `output`. That's because the plan fragments used as lookup keys  may have been deduplicated by `DeduplicateRelations`, and thus have different exprIds in their respective output schemas. When `CacheManager#useCachedData` creates an `InMemoryRelation` instance, it borrows the output schema of the plan fragment used as the lookup key.

The failure to fully canonicalize has other effects. For example, this query fails to reuse the exchange:
```
create or replace temp view data(c1, c2) as values
(1, 2),
(1, 3),
(2, 4),
(3, 7),
(7, 22);

cache table data;

set spark.sql.autoBroadcastJoinThreshold=-1;
set spark.sql.adaptive.enabled=false;

select *
from data l
join data r
on l.c1 = r.c1;
```

No.

New tests.

No.

Closes #44806 from bersprockets/plan_validation_issue.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

1 parent 04d3249

Files
Changes

Permalinks

Tip revision: 68d9f353300ed7de0b47c26cb30236bada896d25 authored by Bruce Robbins on 22 January 2024, 19:09:01 UTC
[SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent

Tip revision: 68d9f35

File	Mode	Size
.github
R
assembly
bin
binder
build
common
conf
connector
core
data
dev
docs
examples
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.asf.yaml	-rw-r--r--	1.3 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	1.8 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.0 KB
LICENSE-binary	-rw-r--r--	22.4 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	56.5 KB
README.md	-rw-r--r--	4.5 KB
appveyor.yml	-rw-r--r--	2.8 KB
pom.xml	-rw-r--r--	139.1 KB
scalastyle-config.xml	-rw-r--r--	23.7 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent

README.md