Revision - 6b021fe - [SPARK-39184][SQL][3.1] Handle undersized result array in date [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision 6b021fe72e91e0b6421cc61330ac169a201c2d39 authored by Bruce Robbins on 29 August 2022, 01:45:21 UTC, committed by Hyukjin Kwon on 29 August 2022, 01:45:21 UTC

[SPARK-39184][SQL][3.1] Handle undersized result array in date and timestamp sequences

Backport of #37513 and its follow-up #37542.

### What changes were proposed in this pull request?

Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence.

### Why are the changes needed?

`InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array.

`getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed.

`getSequenceLength` sometimes underestimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR).

For example:
```
select sequence(
timestamp'2022-03-13 00:00:00',
timestamp'2022-03-14 00:00:00',
interval 1 day) as x;
```
In the America/Los_Angeles time zone, this results in the following error:
```
java.lang.ArrayIndexOutOfBoundsException: 1
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77)
```
This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1.

However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array.

The new unit test includes examples of problematic date sequences.

This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

Closes #37699 from bersprockets/date_sequence_array_size_issue_31.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

1 parent ca8cdb2

Files
Changes

Permalinks

Tip revision: 6b021fe72e91e0b6421cc61330ac169a201c2d39 authored by Bruce Robbins on 29 August 2022, 01:45:21 UTC
[SPARK-39184][SQL][3.1] Handle undersized result array in date and timestamp sequences

Tip revision: 6b021fe

File	Mode	Size
.github
R
assembly
bin
binder
build
common
conf
core
data
dev
docs
examples
external
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.asf.yaml	-rw-r--r--	1.1 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	1.5 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.1 KB
LICENSE-binary	-rw-r--r--	22.7 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	56.3 KB
README.md	-rw-r--r--	4.4 KB
appveyor.yml	-rw-r--r--	2.6 KB
pom.xml	-rw-r--r--	121.3 KB
scalastyle-config.xml	-rw-r--r--	20.0 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-39184][SQL][3.1] Handle undersized result array in date and timestamp sequences

README.md