Revision - 0f2e3ec - [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision 0f2e3ecb9943aec91204c168b6402f3e5de53ca2 authored by Hyukjin Kwon on 05 May 2022, 07:23:28 UTC, committed by Hyukjin Kwon on 05 May 2022, 07:23:43 UTC

[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189):

```scala
import org.apache.spark.sql.types._
val ds = Seq("a,", "a,b").toDS
spark.read.schema(
  StructType(
    StructField("f1", StringType, nullable = false) ::
    StructField("f2", StringType, nullable = false) :: Nil)
  ).option("mode", "DROPMALFORMED").csv(ds).show()
```

**Before:**

```
+---+---+
| f1| f2|
+---+---+
|  a|  b|
+---+---+
```

**After:**

```
+---+----+
| f1|  f2|
+---+----+
|  a|null|
|  a|   b|
+---+----+
```

This PR adds a configuration to restore **Before** behaviour.

### Why are the changes needed?

To avoid breakage of valid usecases.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided.

### How was this patch tested?

Unittests were added.

Closes #36435 from HyukjinKwon/SPARK-35912.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 6689b97ec76abe5bab27f02869f8f16b32530d1a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

1 parent 1fa3171

Files
Changes

Permalinks

Tip revision: 0f2e3ecb9943aec91204c168b6402f3e5de53ca2 authored by Hyukjin Kwon on 05 May 2022, 07:23:28 UTC
[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

Tip revision: 0f2e3ec

File	Mode	Size
.github
.idea
R
assembly
bin
binder
build
common
conf
core
data
dev
docs
examples
external
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.asf.yaml	-rw-r--r--	1.1 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	1.8 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.1 KB
LICENSE-binary	-rw-r--r--	22.4 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	56.5 KB
README.md	-rw-r--r--	4.4 KB
appveyor.yml	-rw-r--r--	2.7 KB
pom.xml	-rw-r--r--	137.1 KB
scalastyle-config.xml	-rw-r--r--	22.0 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)

README.md