https://github.com/apache/spark
Revision ee8cafbd0ff36116a212ac99fdf65b24c486cae8 authored by Kris Mok on 27 July 2022, 23:49:33 UTC, committed by Jungtaek Lim on 27 July 2022, 23:49:45 UTC
### What changes were proposed in this pull request?

Update the `UnsafeRow` structural integrity check in `UnsafeRowUtils.validateStructuralIntegrity` to handle a special case with null variable-length DecimalType value.

### Why are the changes needed?

The check should follow the format that `UnsafeRowWriter` produces. In general, `UnsafeRowWriter` clears out a field with zero when the field is set to be null, c.f. `UnsafeRowWriter.setNullAt(ordinal)` and `UnsafeRow.setNullAt(ordinal)`.

But there's a special case for `DecimalType` values: this is the only type that is both:
- can be fixed-length or variable-length, depending on the precision, and
- is mutable in `UnsafeRow`.

To support a variable-length `DecimalType` to be mutable in `UnsafeRow`, the `UnsafeRowWriter` always leaves a 16-byte space in the variable-length section of the `UnsafeRow` (tail end of the row), regardless of whether the `Decimal` value being written is null or not. In the fixed-length part of the field, it would be an "OffsetAndSize", and the `offset` part always points to the start offset of the variable-length part of the field, while the `size` part will either be `0` for the null value, or `1` to at most `16` for non-null values.
When `setNullAt(ordinal)` is called instead of passing a null value to `write(int, Decimal, int, int)`, however, the `offset` part gets zero'd out and this field stops being mutable. There's a comment on `UnsafeRow.setDecimal` that mentions to keep this field able to support updates, `setNullAt(ordinal)` cannot be called, but there's no code enforcement of that.

So we need to recognize that in the structural integrity check and allow variable-length `DecimalType` to have non-zero field even for null.

Note that for non-null values, the existing check does conform to the format from `UnsafeRowWriter`. It's only null value of variable-length `DecimalType` that'd trigger a bug, which can affect Structured Streaming's checkpoint file read where this check is applied.

### Does this PR introduce _any_ user-facing change?

Yes, previously the `UnsafeRow` structural integrity validation will return false positive for correct data, when there's a null value in a variable-length `DecimalType` field. The fix will no longer return false positive.
Because the Structured Streaming checkpoint file validation uses this check, previously a good checkpoint file may be rejected by the check, and the only workaround is to disable the check; with the fix, the correct checkpoint file will be allowed to load.

### How was this patch tested?

Added new test case in `UnsafeRowUtilsSuite`

Closes #37252 from rednaxelafx/fix-unsaferow-validation.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit c608ae2fc6a3a50f2e67f2a3dad8d4e4be1aaf9f)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
1 parent 9fdd097
History
Tip revision: ee8cafbd0ff36116a212ac99fdf65b24c486cae8 authored by Kris Mok on 27 July 2022, 23:49:33 UTC
[SPARK-39839][SQL] Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check
Tip revision: ee8cafb

README.md

back to top