https://github.com/apache/spark
Revision 0f2e3ecb9943aec91204c168b6402f3e5de53ca2 authored by Hyukjin Kwon on 05 May 2022, 07:23:28 UTC, committed by Hyukjin Kwon on 05 May 2022, 07:23:43 UTC
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189):

```scala
import org.apache.spark.sql.types._
val ds = Seq("a,", "a,b").toDS
spark.read.schema(
  StructType(
    StructField("f1", StringType, nullable = false) ::
    StructField("f2", StringType, nullable = false) :: Nil)
  ).option("mode", "DROPMALFORMED").csv(ds).show()
```

**Before:**

```
+---+---+
| f1| f2|
+---+---+
|  a|  b|
+---+---+
```

**After:**

```
+---+----+
| f1|  f2|
+---+----+
|  a|null|
|  a|   b|
+---+----+
```

This PR adds a configuration to restore **Before** behaviour.

### Why are the changes needed?

To avoid breakage of valid usecases.

### Does this PR introduce _any_ user-facing change?

Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided.

### How was this patch tested?

Unittests were added.

Closes #36435 from HyukjinKwon/SPARK-35912.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 6689b97ec76abe5bab27f02869f8f16b32530d1a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent 1fa3171
History
Tip revision: 0f2e3ecb9943aec91204c168b6402f3e5de53ca2 authored by Hyukjin Kwon on 05 May 2022, 07:23:28 UTC
[SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds)
Tip revision: 0f2e3ec

README.md

back to top