Revision - f3baf08 - [SPARK-43393][SQL][3.5] Address sequence expression overflow bug - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision f3baf086acdf166445aef81181d13d4884d44e92 authored by Deepayan Patra on 17 November 2023, 21:17:43 UTC, committed by Dongjoon Hyun on 17 November 2023, 21:17:43 UTC

[SPARK-43393][SQL][3.5] Address sequence expression overflow bug

### What changes were proposed in this pull request?
Spark has a (long-standing) overflow bug in the `sequence` expression.

Consider the following operations:
```
spark.sql("CREATE TABLE foo (l LONG);")
spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});")
spark.sql("SELECT sequence(0, l) FROM foo;").collect()
```

The result of these operations will be:
```
Array[org.apache.spark.sql.Row] = Array([WrappedArray()])
```
an unintended consequence of overflow.

The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like:

```
  0111111111111111111111111111111111111111111111111111111111111111
- 0000000000000000000000000000000000000000000000000000000000000000
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
/ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  0111111111111111111111111111111111111111111111111111111111111111
+ 0000000000000000000000000000000000000000000000000000000000000001
------------------------------------------------------------------
  1000000000000000000000000000000000000000000000000000000000000000
```

The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`.

Other overflows are similarly problematic.

This PR addresses the issue by checking numeric operations in the length computation for overflow.

### Why are the changes needed?
There is a correctness bug from overflow in the `sequence` expression.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tests added in `CollectionExpressionsSuite.scala`.

Closes #43820 from thepinetree/spark-sequence-overflow-3.5.

Authored-by: Deepayan Patra <deepayan.patra@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

1 parent 9e492b7

Files
Changes

Permalinks

Tip revision: f3baf086acdf166445aef81181d13d4884d44e92 authored by Deepayan Patra on 17 November 2023, 21:17:43 UTC
[SPARK-43393][SQL][3.5] Address sequence expression overflow bug

Tip revision: f3baf08

File	Mode	Size
.github
R
assembly
bin
binder
build
common
conf
connector
core
data
dev
docs
examples
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.asf.yaml	-rw-r--r--	1.3 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	1.8 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.0 KB
LICENSE-binary	-rw-r--r--	22.4 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	56.5 KB
README.md	-rw-r--r--	4.5 KB
appveyor.yml	-rw-r--r--	2.8 KB
pom.xml	-rw-r--r--	139.1 KB
scalastyle-config.xml	-rw-r--r--	23.7 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-43393][SQL][3.5] Address sequence expression overflow bug

README.md