Revision - 92a71a6 - [SPARK-20685] Fix BatchPythonEvaluation bug in case of single [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision 92a71a667dd3e13664015f2a9dd2a39e2c1514eb authored by Josh Rosen on 10 May 2017, 23:50:57 UTC, committed by Xiao Li on 10 May 2017, 23:51:16 UTC

[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg.

## What changes were proposed in this pull request?

There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error.

This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).

This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple.

## How was this patch tested?

New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17927 from JoshRosen/SPARK-20685.

(cherry picked from commit 8ddbc431d8b21d5ee57d3d209a4f25e301f15283)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>

1 parent bdc08ab

Files
Changes

Permalinks

Tip revision: 92a71a667dd3e13664015f2a9dd2a39e2c1514eb authored by Josh Rosen on 10 May 2017, 23:50:57 UTC
[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg.

Tip revision: 92a71a6

File	Mode	Size
.github
R
assembly
bin
build
common
conf
core
data
dev
docs
examples
external
graphx
launcher
licenses
mesos
mllib
mllib-local
project
python
repl
sbin
sql
streaming
tools
yarn
.gitattributes	-rw-r--r--	40 bytes
.gitignore	-rw-r--r--	1.2 KB
.travis.yml	-rw-r--r--	1.7 KB
CONTRIBUTING.md	-rw-r--r--	995 bytes
LICENSE	-rw-r--r--	17.4 KB
NOTICE	-rw-r--r--	24.1 KB
README.md	-rw-r--r--	3.7 KB
appveyor.yml	-rw-r--r--	1.8 KB
pom.xml	-rw-r--r--	98.5 KB
scalastyle-config.xml	-rw-r--r--	16.7 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg.

README.md