Revision - 660a9f8 - [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision 660a9f845f954b4bf2c3a7d51988b33ae94e3207 authored by Ivan Sadikov on 02 May 2022, 23:30:05 UTC, committed by Hyukjin Kwon on 02 May 2022, 23:32:40 UTC

[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure.

The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used.

Fixes the JVM crash when checking isEmpty() on a dataset.

No.

I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix.

Closes #36425 from sadikovi/fix-pyspark-iter-2.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

1 parent 2a996f1

Files
Changes

Permalinks

Tip revision: 660a9f845f954b4bf2c3a7d51988b33ae94e3207 authored by Ivan Sadikov on 02 May 2022, 23:30:05 UTC
[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

Tip revision: 660a9f8

File	Mode	Size
.github
R
assembly
bin
binder
build
common
conf
core
data
dev
docs
examples
external
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.asf.yaml	-rw-r--r--	1.1 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	1.5 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.1 KB
LICENSE-binary	-rw-r--r--	22.7 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	56.3 KB
README.md	-rw-r--r--	4.4 KB
appveyor.yml	-rw-r--r--	2.6 KB
pom.xml	-rw-r--r--	121.3 KB
scalastyle-config.xml	-rw-r--r--	20.0 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

README.md