https://github.com/apache/spark
Revision 6b021fe72e91e0b6421cc61330ac169a201c2d39 authored by Bruce Robbins on 29 August 2022, 01:45:21 UTC, committed by Hyukjin Kwon on 29 August 2022, 01:45:21 UTC
Backport of #37513 and its follow-up #37542.

### What changes were proposed in this pull request?

Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence.

### Why are the changes needed?

`InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array.

`getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed.

`getSequenceLength` sometimes underestimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR).

For example:
```
select sequence(
  timestamp'2022-03-13 00:00:00',
  timestamp'2022-03-14 00:00:00',
  interval 1 day) as x;
```
In the America/Los_Angeles time zone, this results in the following error:
```
java.lang.ArrayIndexOutOfBoundsException: 1
	at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77)
```
This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1.

However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array.

The new unit test includes examples of problematic date sequences.

This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

Closes #37699 from bersprockets/date_sequence_array_size_issue_31.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent ca8cdb2
Raw File
Tip revision: 6b021fe72e91e0b6421cc61330ac169a201c2d39 authored by Bruce Robbins on 29 August 2022, 01:45:21 UTC
[SPARK-39184][SQL][3.1] Handle undersized result array in date and timestamp sequences
Tip revision: 6b021fe
.gitignore
*#*#
*.#*
*.iml
*.ipr
*.iws
*.pyc
*.pyo
*.swp
*~
.DS_Store
.bsp/
.cache
.classpath
.ensime
.ensime_cache/
.ensime_lucene
.generated-mima*
.idea/
.idea_modules/
.project
.pydevproject
.scala_dependencies
.settings
/lib/
R-unit-tests.log
R/unit-tests.out
R/cran-check.out
R/pkg/vignettes/sparkr-vignettes.html
R/pkg/tests/fulltests/Rplots.pdf
build/*.jar
build/apache-maven*
build/scala*
build/zinc*
cache
checkpoint
conf/*.cmd
conf/*.conf
conf/*.properties
conf/*.sh
conf/*.xml
conf/java-opts
conf/slaves
dependency-reduced-pom.xml
derby.log
dev/create-release/*final
dev/create-release/*txt
dev/pr-deps/
dist/
docs/_site/
docs/api
sql/docs
sql/site
lib_managed/
lint-r-report.log
log/
logs/
out/
project/boot/
project/build/target/
project/plugins/lib_managed/
project/plugins/project/build.properties
project/plugins/src_managed/
project/plugins/target/
python/lib/pyspark.zip
python/.eggs/
python/deps
python/docs/_site/
python/docs/source/reference/api/
python/test_coverage/coverage_data
python/test_coverage/htmlcov
python/pyspark/python
.mypy_cache/
reports/
scalastyle-on-compile.generated.xml
scalastyle-output.xml
scalastyle.txt
spark-*-bin-*.tgz
spark-tests.log
src_managed/
streaming-tests.log
target/
unit-tests.log
work/
docs/.jekyll-metadata
docs/.jekyll-cache

# For Hive
TempStatsStore/
metastore/
metastore_db/
sql/hive-thriftserver/test_warehouses
warehouse/
spark-warehouse/

# For R session data
.RData
.RHistory
.Rhistory
*.Rproj
*.Rproj.*

.Rproj.user

# For SBT
.jvmopts
back to top