Revision - a3fef2c - [SPARK-6052][SQL]In JSON schema inference, we should always set [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

17 June 2024, 00:28:55 UTC

Revision a3fef2c02f93b48c15feec21515567d6fded19f1 authored by Yin Huai on 02 March 2015, 15:18:07 UTC, committed by Cheng Lian on 02 March 2015, 15:18:31 UTC

[SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true

Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go.

JIRA: https://issues.apache.org/jira/browse/SPARK-6052

Author: Yin Huai <yhuai@databricks.com>

Closes #4806 from yhuai/jsonArrayContainsNull and squashes the following commits:

05eab9d [Yin Huai] Change containsNull to true.

(cherry picked from commit 3efd8bb6cf139ce094ff631c7a9c1eb93fdcd566)
Signed-off-by: Cheng Lian <lian@databricks.com>

1 parent c59871c

Files
Changes

Permalinks

Tip revision: a3fef2c02f93b48c15feec21515567d6fded19f1 authored by Yin Huai on 02 March 2015, 15:18:07 UTC
[SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true

Tip revision: a3fef2c

File	Mode	Size
assembly
bagel
bin
build
conf
core
data
dev
docker
docs
ec2
examples
external
extras
graphx
mllib
network
project
python
repl
sbin
sbt
sql
streaming
tools
yarn
.gitattributes	-rw-r--r--	40 bytes
.gitignore	-rw-r--r--	962 bytes
.rat-excludes	-rw-r--r--	985 bytes
CONTRIBUTING.md	-rw-r--r--	663 bytes
LICENSE	-rw-r--r--	45.0 KB
NOTICE	-rw-r--r--	22.0 KB
README.md	-rw-r--r--	3.5 KB
make-distribution.sh	-rwxr-xr-x	8.6 KB
pom.xml	-rw-r--r--	58.3 KB
scalastyle-config.xml	-rw-r--r--	7.6 KB
tox.ini	-rw-r--r--	838 bytes

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true

README.md