https://github.com/apache/spark
Revision 92a71a667dd3e13664015f2a9dd2a39e2c1514eb authored by Josh Rosen on 10 May 2017, 23:50:57 UTC, committed by Xiao Li on 10 May 2017, 23:51:16 UTC
## What changes were proposed in this pull request?

There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error.

This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).

This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple.

## How was this patch tested?

New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17927 from JoshRosen/SPARK-20685.

(cherry picked from commit 8ddbc431d8b21d5ee57d3d209a4f25e301f15283)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
1 parent bdc08ab
History
Tip revision: 92a71a667dd3e13664015f2a9dd2a39e2c1514eb authored by Josh Rosen on 10 May 2017, 23:50:57 UTC
[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg.
Tip revision: 92a71a6

README.md

back to top