https://github.com/apache/spark
Revision 8d957d7724d36ce415029d454740352699bcc862 authored by Gengliang Wang on 25 January 2019, 02:24:49 UTC, committed by gatorsmile on 25 January 2019, 02:25:56 UTC
## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit f5b9370da2745a744f8b2f077f1690e0e7035140) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
1 parent e8e9b11
Tip revision: 8d957d7724d36ce415029d454740352699bcc862 authored by Gengliang Wang on 25 January 2019, 02:24:49 UTC
[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly
[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly
Tip revision: 8d957d7
File | Mode | Size |
---|---|---|
.github | ||
R | ||
assembly | ||
bin | ||
build | ||
common | ||
conf | ||
core | ||
data | ||
dev | ||
docs | ||
examples | ||
external | ||
graphx | ||
hadoop-cloud | ||
launcher | ||
licenses | ||
licenses-binary | ||
mllib | ||
mllib-local | ||
project | ||
python | ||
repl | ||
resource-managers | ||
sbin | ||
sql | ||
streaming | ||
tools | ||
.gitattributes | -rw-r--r-- | 40 bytes |
.gitignore | -rw-r--r-- | 1.3 KB |
CONTRIBUTING.md | -rw-r--r-- | 995 bytes |
LICENSE | -rw-r--r-- | 13.0 KB |
LICENSE-binary | -rw-r--r-- | 20.9 KB |
NOTICE | -rw-r--r-- | 1.5 KB |
NOTICE-binary | -rw-r--r-- | 41.9 KB |
README.md | -rw-r--r-- | 3.9 KB |
appveyor.yml | -rw-r--r-- | 2.2 KB |
pom.xml | -rw-r--r-- | 100.5 KB |
scalastyle-config.xml | -rw-r--r-- | 18.0 KB |
![swh spinner](/static/img/swh-spinner.gif)
Computing file changes ...