https://github.com/apache/spark
Revision 6b021fe72e91e0b6421cc61330ac169a201c2d39 authored by Bruce Robbins on 29 August 2022, 01:45:21 UTC, committed by Hyukjin Kwon on 29 August 2022, 01:45:21 UTC
Backport of #37513 and its follow-up #37542.

### What changes were proposed in this pull request?

Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence.

### Why are the changes needed?

`InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array.

`getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed.

`getSequenceLength` sometimes underestimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR).

For example:
```
select sequence(
  timestamp'2022-03-13 00:00:00',
  timestamp'2022-03-14 00:00:00',
  interval 1 day) as x;
```
In the America/Los_Angeles time zone, this results in the following error:
```
java.lang.ArrayIndexOutOfBoundsException: 1
	at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77)
```
This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1.

However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array.

The new unit test includes examples of problematic date sequences.

This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

Closes #37699 from bersprockets/date_sequence_array_size_issue_31.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent ca8cdb2
History
Tip revision: 6b021fe72e91e0b6421cc61330ac169a201c2d39 authored by Bruce Robbins on 29 August 2022, 01:45:21 UTC
[SPARK-39184][SQL][3.1] Handle undersized result array in date and timestamp sequences
Tip revision: 6b021fe

README.md

back to top