Revision - 53e2e7b - [SPARK-46189][PS][SQL] Perform comparisons and arithmetic [...] - origin: https://github.com/apache/spark

visit type:

https://github.com/apache/spark

05 April 2024, 20:24:39 UTC

Revision 53e2e7bdd618e2a7dec5a84b9d5ae965fb136179 authored by Bruce Robbins on 01 December 2023, 02:28:33 UTC, committed by Ruifeng Zheng on 01 December 2023, 02:29:00 UTC

[SPARK-46189][PS][SQL] Perform comparisons and arithmetic between same types in various Pandas aggregate functions to avoid interpreted mode errors

### What changes were proposed in this pull request?

In various Pandas aggregate functions, remove each comparison or arithmetic operation between `DoubleType` and `IntergerType` in `evaluateExpression` and replace with a comparison or arithmetic operation between `DoubleType` and `DoubleType`.

Affected functions are `PandasStddev`, `PandasVariance`, `PandasSkewness`, `PandasKurtosis`, and `PandasCovar`.

### Why are the changes needed?

These functions fail in interpreted mode. For example, `evaluateExpression` in `PandasKurtosis` compares a double to an integer:
```
If(n < 4, Literal.create(null, DoubleType) ...
```
This results in a boxed double and a boxed integer getting passed to `SQLOrderingUtil.compareDoubles` which expects two doubles as arguments. The scala runtime tries to unbox the boxed integer as a double, resulting in an error.

Reproduction example:
```
spark.sql("set spark.sql.codegen.wholeStage=false")
spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

import numpy as np
import pandas as pd

import pyspark.pandas as ps

pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a")
psser = ps.from_pandas(pser)

psser.kurt()
```
See Jira (SPARK-46189) for the other reproduction cases.

This works fine in codegen mode because the integer is already unboxed and the Java runtime will implictly cast it to a double.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44099 from bersprockets/unboxing_error.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
(cherry picked from commit 042d8546be5d160e203ad78a8aa2e12e74142338)
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

1 parent 00bb4ad

Files
Changes

Permalinks

Tip revision: 53e2e7bdd618e2a7dec5a84b9d5ae965fb136179 authored by Bruce Robbins on 01 December 2023, 02:28:33 UTC
[SPARK-46189][PS][SQL] Perform comparisons and arithmetic between same types in various Pandas aggregate functions to avoid interpreted mode errors

Tip revision: 53e2e7b

File	Mode	Size
.github
R
assembly
bin
binder
build
common
conf
connector
core
data
dev
docs
examples
graphx
hadoop-cloud
launcher
licenses
licenses-binary
mllib
mllib-local
project
python
repl
resource-managers
sbin
sql
streaming
tools
.asf.yaml	-rw-r--r--	1.3 KB
.gitattributes	-rw-r--r--	130 bytes
.gitignore	-rw-r--r--	1.8 KB
CONTRIBUTING.md	-rw-r--r--	997 bytes
LICENSE	-rw-r--r--	13.0 KB
LICENSE-binary	-rw-r--r--	22.4 KB
NOTICE	-rw-r--r--	2.0 KB
NOTICE-binary	-rw-r--r--	56.5 KB
README.md	-rw-r--r--	4.5 KB
appveyor.yml	-rw-r--r--	2.8 KB
pom.xml	-rw-r--r--	139.1 KB
scalastyle-config.xml	-rw-r--r--	23.7 KB

Showing with 0 additions and 0 deletions (0 / 0 diffs computed)

Computing file changes ...

https://github.com/apache/spark

[SPARK-46189][PS][SQL] Perform comparisons and arithmetic between same types in various Pandas aggregate functions to avoid interpreted mode errors

README.md