https://github.com/apache/spark
Revision 8d957d7724d36ce415029d454740352699bcc862 authored by Gengliang Wang on 25 January 2019, 02:24:49 UTC, committed by gatorsmile on 25 January 2019, 02:25:56 UTC
## What changes were proposed in this pull request?

When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results:
```
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")
```
The result is supposed to be `null`. However, with the optimization the result is `5`.

The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.

It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.

## How was this patch tested?

Unit test

Closes #23635 from gengliangwang/optimizeMetadata.

Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(cherry picked from commit f5b9370da2745a744f8b2f077f1690e0e7035140)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
1 parent e8e9b11
History
Tip revision: 8d957d7724d36ce415029d454740352699bcc862 authored by Gengliang Wang on 25 January 2019, 02:24:49 UTC
[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly
Tip revision: 8d957d7

README.md

back to top