Revision - e8866f9 - [SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

17 June 2024, 00:28:55 UTC

Revision e8866f9fc62095b78421d461549f7eaf8e9070b3 authored by Reynold Xin on 14 December 2016, 20:22:49 UTC, committed by Herman van Hovell on 14 December 2016, 20:23:01 UTC

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics

## What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #16274 from rxin/SPARK-18853.

(cherry picked from commit 5d799473696a15fddd54ec71a93b6f8cb169810c)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

1 parent af12a21

Files
Changes

Permalinks

Tip revision: e8866f9fc62095b78421d461549f7eaf8e9070b3 authored by Reynold Xin on 14 December 2016, 20:22:49 UTC
[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics

Tip revision: e8866f9

File	Mode	Size
.github
R
assembly
bin
build
common
conf
core
data
dev
docs
examples
external
graphx
launcher
licenses
mesos
mllib
mllib-local
project
python
repl
sbin
sql
streaming
tools
yarn
.gitattributes	-rw-r--r--	40 bytes
.gitignore	-rw-r--r--	1.2 KB
.travis.yml	-rw-r--r--	1.7 KB
CONTRIBUTING.md	-rw-r--r--	995 bytes
LICENSE	-rw-r--r--	17.4 KB
NOTICE	-rw-r--r--	24.1 KB
README.md	-rw-r--r--	3.7 KB
appveyor.yml	-rw-r--r--	1.8 KB
pom.xml	-rw-r--r--	98.2 KB
scalastyle-config.xml	-rw-r--r--	16.7 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics

README.md