Revision - 8d957d7 - [SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

17 June 2024, 00:28:55 UTC

Revision 8d957d7724d36ce415029d454740352699bcc862 authored by Gengliang Wang on 25 January 2019, 02:24:49 UTC, committed by gatorsmile on 25 January 2019, 02:25:56 UTC

[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly

## What changes were proposed in this pull request?

When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results:
```
sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
sql("SELECT MAX(p1) FROM t")
```
The result is supposed to be `null`. However, with the optimization the result is `5`.

The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem.

It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default.

## How was this patch tested?

Unit test

Closes #23635 from gengliangwang/optimizeMetadata.

Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(cherry picked from commit f5b9370da2745a744f8b2f077f1690e0e7035140)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

1 parent e8e9b11

Files
Changes

Permalinks

Tip revision: 8d957d7724d36ce415029d454740352699bcc862 authored by Gengliang Wang on 25 January 2019, 02:24:49 UTC
[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly

Tip revision: 8d957d7

File	Mode	Size
.github
R
assembly
bin
build
common
conf
core
data
dev
docs
examples
external
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.gitattributes	-rw-r--r--	40 bytes
.gitignore	-rw-r--r--	1.3 KB
CONTRIBUTING.md	-rw-r--r--	995 bytes
LICENSE	-rw-r--r--	13.0 KB
LICENSE-binary	-rw-r--r--	20.9 KB
NOTICE	-rw-r--r--	1.5 KB
NOTICE-binary	-rw-r--r--	41.9 KB
README.md	-rw-r--r--	3.9 KB
appveyor.yml	-rw-r--r--	2.2 KB
pom.xml	-rw-r--r--	100.5 KB
scalastyle-config.xml	-rw-r--r--	18.0 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly

README.md