https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
44cc676 [SPARK-47111][SQL][TESTS][3.5] Upgrade `PostgreSQL` JDBC driver to 42.7.2 and docker image to 16.2 ### What changes were proposed in this pull request? This PR aims to upgrade `PostgreSQL` JDBC driver and docker images. - JDBC Driver: `org.postgresql:postgresql` from 42.7.0 to 42.7.2 - Docker Image: `postgres` from `15.1-alpine` to `16.2-alpine` ### Why are the changes needed? To use the latest PostgreSQL combination in the following integration tests. - PostgresIntegrationSuite - PostgresKrbIntegrationSuite - v2/PostgresIntegrationSuite - v2/PostgresNamespaceSuite ### Does this PR introduce _any_ user-facing change? No. This is a pure test-environment update. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45899 from dongjoon-hyun/SPARK-47111. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 18:56:34 UTC
d2cee4d [SPARK-46411][BUILD][3.5][FOLLOWUP] Fix pom.xml file in common/network-common module ### What changes were proposed in this pull request? This PR aims to fix `common/network-common/pom.xml`. ### Why are the changes needed? To fix the cherry-pick mistake. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45897 from dongjoon-hyun/SPARK-46411. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 17:46:41 UTC
dec1e74 [SPARK-46411][BUILD] Change to use `bcprov/bcpkix-jdk18on` for UT This PR migrates the test dependency `bcprov/bcpkix` from `jdk15on` to `jdk18on`, and upgrades the version from 1.70 to 1.77, the `jdk18on` jars are compiled to work with anything from Java 1.8 up. The full release notes as follows: - https://www.bouncycastle.org/releasenotes.html#r1rv77 No, just for test. Pass GitHub Actions. No Closes #44359 from LuciferYang/bouncycastle-177. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 16:08:06 UTC
3cb6a44 [SPARK-47734][PYTHON][TESTS] Fix flaky DataFrame.writeStream doctest by stopping streaming query ### What changes were proposed in this pull request? This PR deflakes the `pyspark.sql.dataframe.DataFrame.writeStream` doctest. PR https://github.com/apache/spark/pull/45298 aimed to fix that test but misdiagnosed the root issue. The problem is not that concurrent tests were colliding on a temporary directory. Rather, the issue is specific to the `DataFrame.writeStream` test's logic: that test is starting a streaming query that writes files to the temporary directory, the exits the temp directory context manager without first stopping the streaming query. That creates a race condition where the context manager might be deleting the directory while the streaming query is writing new files into it, leading to the following type of error during cleanup: ``` File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in pyspark.sql.dataframe.DataFrame.writeStream Failed example: with tempfile.TemporaryDirectory() as d: # Create a table with Rate source. df.writeStream.toTable( "my_table", checkpointLocation=d) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.11/doctest.py", line 1353, in __run exec(compile(example.source, filename, "single", File "<doctest pyspark.sql.dataframe.DataFrame.writeStream[3]>", line 1, in <module> with tempfile.TemporaryDirectory() as d: File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__ self.cleanup() File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors) File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree _rmtree(name, onerror=onerror) File "/usr/lib/python3.11/shutil.py", line 738, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/usr/lib/python3.11/shutil.py", line 736, in rmtree os.rmdir(path, dir_fd=dir_fd) OSError: [Errno 39] Directory not empty: '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq' ``` In this PR, I update the doctest to properly stop the streaming query. ### Why are the changes needed? Fix flaky test. ### Does this PR introduce _any_ user-facing change? No, test-only. Small user-facing doc change, but one that is consistent with other doctest examples. ### How was this patch tested? Manually ran updated test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45885 from JoshRosen/fix-flaky-writestream-doctest. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0107435cb39d68eccf8a6900c3c781665deca38b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2024, 02:14:53 UTC
465a853 [SPARK-47568][SS][3.5] Fix race condition between maintenance thread and load/commit for snapshot files Backports https://github.com/apache/spark/pull/45724 to 3.5 ### What changes were proposed in this pull request? This PR fixes a race condition between the maintenance thread and task thread when change-log checkpointing is enabled, and ensure all snapshots are valid. 1. The maintenance thread currently relies on class variable lastSnapshot to find the latest checkpoint and uploads it to DFS. This checkpoint can be modified at commit time by Task thread if a new snapshot is created. 2. The task thread was not resetting the lastSnapshot at load time, which can result in newer snapshots (if a old version is loaded) being considered valid and uploaded to DFS. This results in VersionIdMismatch errors. ### Why are the changes needed? These are logical bugs which can cause `VersionIdMismatch` errors causing user to discard the snapshot and restart the query. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test cases. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45881 from sahnib/rocks-db-fix-3.5. Authored-by: Bhuwan Sahni <bhuwan.sahni@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2024, 01:51:24 UTC
2dda441 [SPARK-47666][SQL][3.5] Fix NPE when reading mysql bit array as LongType ### What changes were proposed in this pull request? This PR fixes NPE when reading mysql bit array as LongType ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45792 from yaooqinn/PR_TOOL_PICK_PR_45790_BRANCH-3.5. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 02 April 2024, 10:16:35 UTC
fed3eb5 [SPARK-47676][BUILD] Clean up the removed `VersionsSuite` references ### What changes were proposed in this pull request? This PR aims to clean up the removed `VersionsSuite` reference. ### Why are the changes needed? At Apache Spark 3.3.0, `VersionsSuite` is removed via SPARK-38036 . - https://github.com/apache/spark/pull/35335 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45800 from dongjoon-hyun/SPARK-47676. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 128f74b055d3f290003f42259ffa23861eaa69e1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 April 2024, 23:49:47 UTC
2da520e [SPARK-45593][BUILD][3.5] Correct relocation connect guava dependency ### What changes were proposed in this pull request? This PR amins to correct relocation connect guava dependency and remove duplicate connect-common from SBT build jars. This PR cherry-pick from https://github.com/apache/spark/pull/43436 and https://github.com/apache/spark/pull/44801 as a backport to 3.5 branch. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Follow the steps described at https://github.com/apache/spark/pull/43195#issue-1921234067 to test manually. In addition, will continue to observe the GA situation in recent days. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45775 from Yikf/branch-3.5. Authored-by: yikaifei <yikaifei@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> 01 April 2024, 07:35:23 UTC
e049a6c [SPARK-47646][SQL] Make try_to_number return NULL for malformed input ### What changes were proposed in this pull request? This PR proposes to add NULL check after parsing the number so the output can be safely null for `try_to_number` expression. ```scala import org.apache.spark.sql.functions._ val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11"))) df.select(try_to_number($"value", lit("$99.99"))).show() ``` ``` java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) ``` ### Why are the changes needed? To fix the bug, and let `try_to_number` return `NULL` for malformed input as designed. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug. Previously, `try_to_number` failed with NPE. ### How was this patch tested? Unittest was added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45771 from HyukjinKwon/SPARK-47646. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d709e20066becf15adf5aa35e1bdd8eecf500b4b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 March 2024, 08:38:22 UTC
edae8ed [SPARK-47636][K8S][3.5] Use Java `17` instead of `17-jre` image in K8s Dockerfile ### What changes were proposed in this pull request? This PR aims to use Java 21 instead of 21-jre in K8s Dockerfile . ### Why are the changes needed? Since Apache Spark 3.5.0, SPARK-44153 starts to use `jmap` like the following. https://github.com/apache/spark/blob/c832e2ac1d04668c77493577662c639785808657/core/src/main/scala/org/apache/spark/util/Utils.scala#L2030 ``` $ docker run -it --rm eclipse-temurin:17-jre jmap /__cacert_entrypoint.sh: line 30: exec: jmap: not found ``` ``` $ docker run -it --rm eclipse-temurin:17 jmap | head -n2 Usage: jmap -clstats <pid> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45762 from dongjoon-hyun/SPARK-47636. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 March 2024, 23:34:25 UTC
8bcbf77 [SPARK-47561][SQL] Fix analyzer rule order issues about Alias ### What changes were proposed in this pull request? We found two analyzer rule execution order issues in our internal workloads: - `CreateStruct.apply` creates `NamePlaceholder` for unresolved `NamedExpression`. However, with certain rule execution order, the `NamedExpression` may be removed (e.g. remove unnecessary `Alias`) before `NamePlaceholder` is resolved, then `NamePlaceholder` can't be resolved anymore. - UNPIVOT uses `UnresolvedAlias` to wrap `UnresolvedAttribute`. There is a conflict about how to determine the final alias name. If `ResolveAliases` runs first, then `UnresolvedAlias` will be removed and eventually the alias will be `b` for nested column `a.b`. If `ResolveReferences` runs first, then we resolve `a.b` first and then `UnresolvedAlias` will determine the alias as `a.b` not `b`. This PR fixes the two issues - `CreateStruct.apply` should determine the field name immediately if the input is `Alias` - The parser rule for UNPIVOT should follow how we parse SELECT and return `UnresolvedAttribute` directly without the `UnresolvedAlias` wrapper. It's a bit risky to fix the order issue between `ResolveAliases` and `ResolveReferences` as it can change the final query schema, we will save it for later. ### Why are the changes needed? fix unstable analyzer behavior with different rule execution orders. ### Does this PR introduce _any_ user-facing change? Yes, some failed queries can run now. The issue for UNPIVOT only affects the error message. ### How was this patch tested? verified by our internal workloads. The repro query is quite complicated to trigger a certain rule execution order so we won't add tests for it. The fix is quite obvious. ### Was this patch authored or co-authored using generative AI tooling? no Closes #45718 from cloud-fan/rule. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 March 2024, 23:50:16 UTC
9ad7b75 [SPARK-47537][SQL][3.5] Fix error data type mapping on MySQL Connector/J ### What changes were proposed in this pull request? This PR fixes: - BIT(n>1) is wrongly mapping to boolean instead of long for MySQL Connector/J. This is because we only have a case branch for Maria Connector/J. - MySQL Docker Integration Tests were using Maria Connector/J, not MySQL Connector/J ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45690 from yaooqinn/SPARK-47537-B. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 March 2024, 15:50:00 UTC
2016db6 [SPARK-47503][SQL][3.5] Make makeDotNode escape graph node name always ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/45640 To prevent corruption of dot file a node name should be escaped even if there is no metrics to display ### Why are the changes needed? This pr fixes a bug in spark history server which fails to display query for cached JDBC relation named in quotes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45684 from alex35736/branch-3.5. Authored-by: alexey <seko.alexsey13@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 March 2024, 22:17:10 UTC
30cb7ed [SPARK-47521][CORE] Use `Utils.tryWithResource` during reading shuffle data from external storage ### What changes were proposed in this pull request? In method FallbackStorage.open, file open is guarded by Utils.tryWithResource to avoid file handle leakage incase of failure during read. ### Why are the changes needed? To avoid file handle leakage in case of read failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #45663 from maheshk114/SPARK-47521. Authored-by: maheshbehera <maheshbehera@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 245669053a34cb1d4a84689230e5bd1d163be5c6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 March 2024, 17:45:03 UTC
91d1156 [SPARK-47440][SQL] Fix pushing unsupported syntax to MsSqlServer ### What changes were proposed in this pull request? In this PR, I propose a change in SQLQuery builder of MsSqlServer dialect. I override build method to check for boolean operator in binary comparisons and throw exception if encountered. ### Why are the changes needed? MsSqlServer syntax prevents boolean operators in any binary comparison. Reasoning is lack of boolean data type in MsSqlServer. It was possible to construct Spark query that would generate this situation in MsSqlServer and engine would throw syntax exception on the MsSqlServer side. This PR solves this bug. For example, in table `people` there is a `name` column. In MsSqlServer if we try to execute: `SELECT * FROM people WHERE (name LIKE 'a%') = (name LIKE '%b')` we would get a syntax error. However this query is fine in other major engines. ### Does this PR introduce _any_ user-facing change? Yes, user will not encounter syntax exception in MsSqlServer when writing these queries. ### How was this patch tested? By running a unit test in MsSqlServerIntegrationSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #45564 from stefanbuk-db/SQLServer_like_operator_bugfix. Authored-by: Stefan Bukorovic <stefan.bukorovic@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 227a50a1766ac1476b0031e1c60d2604eccdb9a7) Signed-off-by: Kent Yao <yao@apache.org> 22 March 2024, 10:44:46 UTC
e57a7d0 [SPARK-47462][SQL][FOLLOWUP][3.5] Add migration guide for TINYINT mapping changes ### What changes were proposed in this pull request? Add migration guide for TINYINT type mapping changes ### Why are the changes needed? behavior change doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ### Was this patch authored or co-authored using generative AI tooling? no Closes #45658 from yaooqinn/SPARK-47462-FB. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 March 2024, 06:06:03 UTC
1fe3962 [SPARK-47398][SQL] Extract a trait for InMemoryTableScanExec to allow for extending functionality ### What changes were proposed in this pull request? We are proposing to allow the users to have a custom `InMemoryTableScanExec`. To accomplish this we can follow the same convention we followed for `ShuffleExchangeLike` and `BroadcastExchangeLike` ### Why are the changes needed? In the PR added by ulysses-you, we are wrapping `InMemoryTableScanExec` inside `TableCacheQueryStageExec`. This could potentially cause problems, especially in the RAPIDS Accelerator for Apache Spark, where we replace `InMemoryTableScanExec` with a customized version that has optimizations needed by us. This situation could lead to the loss of benefits from [SPARK-42101](https://issues.apache.org/jira/browse/SPARK-42101) or even result in Spark throwing an Exception. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45525 from razajafri/extract-inmem-trait. Authored-by: Raza Jafri <rjafri@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 6a27789ad7d59cd133653a49be0bb49729542abe) Signed-off-by: Thomas Graves <tgraves@apache.org> 21 March 2024, 19:46:44 UTC
203f943 [SPARK-47507][BUILD][3.5] Upgrade ORC to 1.9.3 ### What changes were proposed in this pull request? This PR aims to upgrade ORC to 1.9.3 for Apache Spark 3.5.2. ### Why are the changes needed? Apache ORC 1.9.3 is the latest maintenance release. To bring the latest bug fixes, we had better upgrade. - https://orc.apache.org/news/2024/03/20/ORC-1.9.3/ - https://github.com/apache/orc/pull/1692 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45646 from dongjoon-hyun/SPARK-47507. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 March 2024, 19:16:25 UTC
e17fdba [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` ### What changes were proposed in this pull request? This PR aims to fix a typo `slf4j-to-jul` to `jul-to-slf4j`. There exists only one. ``` $ git grep slf4j-to-jul common/utils/src/main/scala/org/apache/spark/internal/Logging.scala: // slf4j-to-jul bridge order to route their logs to JUL. ``` Apache Spark uses `jul-to-slf4j` which includes a `java.util.logging` (jul) handler, namely `SLF4JBridgeHandler`, which routes all incoming jul records to the SLF4j API. https://github.com/apache/spark/blob/bb3e27581887a094ead0d2f7b4a6b2a17ee84b6f/pom.xml#L735 ### Why are the changes needed? This typo was there since Apache Spark 1.0.0. ### Does this PR introduce _any_ user-facing change? No, this is a comment fix. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45625 from dongjoon-hyun/jul-to-slf4j. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bb0867f54d437f6467274e854506aea2900bceb1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 March 2024, 05:01:44 UTC
430a407 [SPARK-47494][DOC] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 ### What changes were proposed in this pull request? Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 ### Why are the changes needed? Show the behavior change to users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? Yes, there are some doc suggestion from copilot in docs/sql-migration-guide.md Closes #45623 from gengliangwang/SPARK-47494. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 11247d804cd370aaeb88736a706c587e7f5c83b3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 March 2024, 22:17:30 UTC
9baf82b [SPARK-47481][INFRA][3.5] Fix Python linter ### What changes were proposed in this pull request? The pr aims to fix `python linter issue` on `branch-3.5` through pinning `matplotlib==3.7.2` ### Why are the changes needed? Fix `python linter issue` on `branch-3.5`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45550 from panbingkun/branch-3.5_scheduled_job. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 March 2024, 14:19:29 UTC
df2fddd [SPARK-47473][SQL] Fix correctness issue of converting postgres INFINITY timestamps This PR fixes a bug involved with #41843 that Epoch Second is used instead of epoch millis to create a timestamp value bugfix no revised tests no Closes #45599 from yaooqinn/SPARK-47473. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit ad8ac17dbdfa763236ab3303eac6a3115ba710cc) Signed-off-by: Kent Yao <yao@apache.org> 20 March 2024, 07:59:50 UTC
8fcd9a1 [SPARK-47455][BUILD] Fix resource leak during the initialization of `scalaStyleOnCompileConfig` in `SparkBuild.scala` ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/e01ed0da22f24204fe23143032ff39be7f4b56af/project/SparkBuild.scala#L157-L173 `Source.fromFile(in)` opens a `BufferedSource` resource handle, but it does not close it, this pr fix this issue. ### Why are the changes needed? Close resource after used. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45582 from LuciferYang/SPARK-47455. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 85bf7615f85eea3e9192a7684ef711cf44042e05) Signed-off-by: yangjie01 <yangjie01@baidu.com> 20 March 2024, 07:19:50 UTC
3857c16 [SPARK-47435][SPARK-45561][SQL][3.5] Fix overflow issue of MySQL UNSIGNED TINYINT caused by ### What changes were proposed in this pull request? SPARK-45561 mapped java.sql.Types.TINYINT to ByteType in MySQL Dialect, which caused unsigned TINYINT overflow. As regardless of signed or unsigned types, the TINYINT is used for java.sql.Types. In this PR, we put the signed info into the metadata for mapping TINYINT to short or byte. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? Uses can read MySQL UNSIGNED TINYINT values after this PR like versions before 3.5.0 which has breaked since 3.5.1 ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45579 from yaooqinn/SPARK-47435-B. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 19 March 2024, 05:37:28 UTC
bb7a613 [SPARK-47434][WEBUI] Fix `statistics` link in `StreamingQueryPage` ### What changes were proposed in this pull request? Like SPARK-24553, this PR aims to fix redirect issues (incorrect 302) when one is using proxy settings. Change the generated link to be consistent with other links and include a trailing slash ### Why are the changes needed? When using a proxy, an invalid redirect is issued if this is not included ### Does this PR introduce _any_ user-facing change? Only that people will be able to use these links if they are using a proxy ### How was this patch tested? With a proxy installed I went to the location this link would generate and could go to the page, when it redirects with the link as it exists. Edit: Further tested by building a version of our application with this patch applied, the links work now. ### Was this patch authored or co-authored using generative AI tooling? No. Page with working link <img width="913" alt="Screenshot 2024-03-18 at 4 45 27 PM" src="https://github.com/apache/spark/assets/5205457/dbcd1ffc-b7e6-4f84-8ca7-602c41202bf3"> Goes correctly to <img width="539" alt="Screenshot 2024-03-18 at 4 45 36 PM" src="https://github.com/apache/spark/assets/5205457/89111c82-b24a-4b33-895f-9c0131e8acb5"> Before it would redirect and we'd get a 404. <img width="639" alt="image" src="https://github.com/apache/spark/assets/5205457/1adfeba1-a1f6-4c35-9c39-e077c680baef"> Closes #45527 from HuwCampbell/patch-1. Authored-by: Huw Campbell <huw.campbell@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9b466d329c3c75e89b80109755a41c2d271b8acc) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 March 2024, 14:38:17 UTC
cc6912e [SPARK-47432][PYTHON][CONNECT][DOCS][3.5] Add `pyarrow` upper bound requirement, `<13.0.0` ### What changes were proposed in this pull request? This PR aims to add `pyarrow` upper bound requirement, `<13.0.0`, to Apache Spark 3.5.x. ### Why are the changes needed? PyArrow 13.0.0 has breaking changes mentioned by #42920 which is a part of Apache Spark 4.0.0. ### Does this PR introduce _any_ user-facing change? No, this only clarifies the upper bound. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45553 from dongjoon-hyun/SPARK-47432. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 March 2024, 22:15:50 UTC
8049a20 [SPARK-45141][PYTHON][INFRA][TESTS] Pin `pyarrow==12.0.1` in CI ### What changes were proposed in this pull request? Pin `pyarrow==12.0.1` in CI ### Why are the changes needed? to fix test failure, https://github.com/apache/spark/actions/runs/6167186123/job/16738683632 ``` ====================================================================== FAIL [0.095s]: test_from_to_pandas (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in _assert_pandas_equal assert_series_equal( File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 931, in assert_series_equal assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}") File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 415, in assert_attr_equal raise_assert_detail(obj, msg, left_attr, right_attr) File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 599, in raise_assert_detail raise AssertionError(msg) AssertionError: Attributes of Series are different Attribute "dtype" are different [left]: datetime64[ns] [right]: datetime64[us] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI and manually test ### Was this patch authored or co-authored using generative AI tooling? No Closes #42897 from zhengruifeng/pin_pyarrow. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit e3d2dfa8b514f9358823c3cb1ad6523da8a6646b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 March 2024, 21:28:34 UTC
8c6eeb8 [SPARK-45587][INFRA] Skip UNIDOC and MIMA in `build` GitHub Action job ### What changes were proposed in this pull request? This PR aims to skip `Unidoc` and `MIMA` phases in many general test pipelines. `mima` test is moved to `lint` job. ### Why are the changes needed? By having an independent document generation and mima checking GitHub Action job, we can skip them in the following many jobs. https://github.com/apache/spark/blob/73f9f5296e36541db78ab10c4c01a56fbc17cca8/.github/workflows/build_and_test.yml#L142-L190 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually check the GitHub action logs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43422 from dongjoon-hyun/SPARK-45587. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 March 2024, 05:37:39 UTC
d594252 [SPARK-47428][BUILD][3.5] Upgrade Jetty to 9.4.54.v20240208 ### What changes were proposed in this pull request? This PR aims to upgrade Jetty to 9.4.54.v20240208 ### Why are the changes needed? To bring the latest bug fixes. - https://github.com/jetty/jetty.project/releases/tag/jetty-9.4.54.v20240208 - https://github.com/jetty/jetty.project/releases/tag/jetty-9.4.53.v20231009 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45543 from dongjoon-hyun/SPARK-47428. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 March 2024, 05:28:45 UTC
ef7e7db [SPARK-47375][DOC][FOLLOWUP] Fix a mistake in JDBC's preferTimestampNTZ option doc ### What changes were proposed in this pull request? Fix a mistake in JDBC's preferTimestampNTZ option doc ### Why are the changes needed? Fix a mistake in doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45510 from gengliangwang/reviseJdbcDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 63b79c1eac01fe7ec88e608008916258b088aeff) Signed-off-by: Kent Yao <yao@apache.org> 14 March 2024, 13:01:31 UTC
e3317a7 [SPARK-47385] Fix tuple encoders with Option inputs ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/40755 adds a null check on the input of the child deserializer in the tuple encoder. It breaks the deserializer for the `Option` type, because null should be deserialized into `None` rather than null. This PR adds a boolean parameter to `ExpressionEncoder.tuple` so that only the user that https://github.com/apache/spark/pull/40755 intended to fix has this null check. ## How was this patch tested? Unit test. Closes #45508 from chenhao-db/SPARK-47385. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9986462811f160eacd766da8a4e14a9cbb4b8710) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 March 2024, 06:27:58 UTC
706f54f [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc ### What changes were proposed in this pull request? Correct the preferTimestampNTZ option description in JDBC doc as per https://github.com/apache/spark/pull/45496 ### Why are the changes needed? The current doc is wrong about the jdbc option preferTimestampNTZ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45502 from gengliangwang/ntzJdbc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit abfbd2718159d62e3322cca8c2d4ef1c29781b21) Signed-off-by: Gengliang Wang <gengliang@apache.org> 14 March 2024, 04:00:45 UTC
3018a5d [SPARK-47368][SQL]][3.5] Remove inferTimestampNTZ config check in ParquetRo… ### What changes were proposed in this pull request? The configuration `spark.sql.parquet.inferTimestampNTZ.enabled` is not related the parquet row converter. This PR is the remove the config check `spark.sql.parquet.inferTimestampNTZ.enabled` in the ParquetRowConverter ### Why are the changes needed? Bug fix. Otherwise reading TimestampNTZ columns may fail when `spark.sql.parquet.inferTimestampNTZ.enabled` is disabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #45492 from gengliangwang/PR_TOOL_PICK_PR_45480_BRANCH-3.5. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 March 2024, 05:42:45 UTC
6629b9a [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files ### What changes were proposed in this pull request? Add migration doc: TimestampNTZ type inference on Parquet files ### Why are the changes needed? Update docs. The behavior change was not mentioned in the SQL migration guide ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45482 from gengliangwang/ntzMigrationDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 621f2c88f3e56257ee517d65e093d32fb44b783e) Signed-off-by: Gengliang Wang <gengliang@apache.org> 12 March 2024, 22:11:45 UTC
d7ce89a [MINOR][DOCS][PYTHON] Fix documentation typo in takeSample method ### What changes were proposed in this pull request? Fixed an error in the docstring documentation for the parameter `withReplacement` of `takeSample` method in `pyspark.RDD`, should be of type `bool`, but is `list` instead. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.takeSample.html ### Why are the changes needed? They correct a mistake in the documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? \- ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45419 from kimborowicz/master. Authored-by: Michał Kimborowicz <michal.kimbor@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 7a429aa84a5ed2c4b6448d43e88c475919ea2210) Signed-off-by: Kent Yao <yao@apache.org> 07 March 2024, 11:34:42 UTC
37307fb [SPARK-47241][SQL] Fix rule order issues for ExtractGenerator ### What changes were proposed in this pull request? The rule `ExtractGenerator` does not define any trigger condition when rewriting generator functions in `Project`, which makes the behavior quite unstable and heavily depends on the execution order of analyzer rules. Two bugs I've found so far: 1. By design, we want to forbid users from using more than one generator function in SELECT. However, we can't really enforce it if two generator functions are not resolved at the same time: the rule thinks there is only one generate function (the other is still unresolved), then rewrite it. The other one gets resolved later and gets rewritten later. 2. When a generator function is put after `SELECT *`, it's possible that `*` is not expanded yet when we enter `ExtractGenerator`. The rule rewrites the generator function: insert a `Generate` operator below, and add a new column to the projectList for the generator function output. Then we expand `*` to the child plan output which is `Generate`, we end up with two identical columns for the generate function output. This PR fixes it by adding a trigger condition when rewriting generator functions in `Project`: the projectList should be resolved or a generator function. This is the same trigger condition we used for `Aggregate`. To avoid breaking changes, this PR also allows multiple generator functions in `Project`, which works totally fine. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now multiple generator functions are allowed in `Project`. And there won't be duplicated columns for generator function output. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #45350 from cloud-fan/generate. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 51f4cfa7560bba576577d3a5f254daaad516849d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 March 2024, 09:02:26 UTC
679f3b1 [SPARK-47305][SQL] Fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming ### What changes were proposed in this pull request? This PR proposes to fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming. ### Why are the changes needed? When filter is evaluated to be always false, PruneFilters replaces the filter with empty LocalRelation, which effectively prunes filter. The logic cares about migration of the isStreaming flag, but incorrectly migrated in some case, via picking up the value of isStreaming flag from root node rather than filter (or child). isStreaming flag is true if the value of isStreaming flag from any of children is true. Flipping the coin, some children might have isStreaming flag as "false". If the filter being pruned is a descendant to such children (in other word, ancestor of streaming node), LocalRelation is incorrectly tagged as streaming where it should be batch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT verifying the fix. The new UT fails without this PR and passes with this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45406 from HeartSaVioR/SPARK-47305. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 8d6bd9bbd29da6023e5740b622e12c7e1f8581ce) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 07 March 2024, 06:11:30 UTC
3f54769 [SPARK-47146][CORE][FOLLOWUP] Rename incorrect logger name ### What changes were proposed in this pull request? Rename incorrect logger name in `UnsafeSorterSpillReader`. ### Why are the changes needed? The logger name in UnsafeSorterSpillReader is incorrect. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #45404 from JacobZheng0927/loggerNameFix. Authored-by: JacobZheng0927 <zsh517559523@163.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5089140e2e6a43ffef584b42aed5cd9bc11268b6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 March 2024, 13:00:04 UTC
ba22961 [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF ### What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/45380 to branch-3.5 The handy util function should not support streaming dataframes, currently if you call it upon streaming queries, it throws a relatively hard-to-understand error: ``` >>> df1 = spark.readStream.format("rate").load() >>> df2 = spark.readStream.format("rate").load() >>> from pyspark.testing.utils import QuietTest, assertDataFrameEqual >>> assertDataFrameEqual(df1, df2) /Users/wei.liu/oss-spark/python/pyspark/pandas/__init__.py:43: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. warnings.warn( Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/wei.liu/oss-spark/python/pyspark/testing/utils.py", line 936, in assertDataFrameEqual actual_list = actual.collect() File "/Users/wei.liu/oss-spark/python/pyspark/sql/dataframe.py", line 1453, in collect sock_info = self._jdf.collectToPython() File "/Users/wei.liu/oss-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/Users/wei.liu/oss-spark/python/pyspark/errors/exceptions/captured.py", line 221, in deco raise converted from None pyspark.errors.exceptions.captured.AnalysisException: Queries with streaming sources must be executed with writeStream.start(); rate ``` Because the function calls `collect` which is not supported on streaming dataframes. It'd be good if we can catch this earlier. ### Why are the changes needed? Improve usability ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Github Copilot It helped me to pick the error class UNSUPPORTED_OPERATION Closes #45395 from WweiL/assertDFEqual-3.5. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 March 2024, 03:25:32 UTC
e9f7d36 [SPARK-47146][CORE][3.5] Possible thread leak when doing sort merge join This pr backport https://github.com/apache/spark/pull/45327 to branch-3.5 ### What changes were proposed in this pull request? Add TaskCompletionListener to close inputStream to avoid thread leakage caused by unclosed ReadAheadInputStream. ### Why are the changes needed? To fix the issue SPARK-47146 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #45390 from JacobZheng0927/SPARK-47146-3.5. Authored-by: JacobZheng0927 <zsh517559523@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> 06 March 2024, 02:35:42 UTC
14762b3 [SPARK-47177][SQL] Cached SQL plan do not display final AQE plan in explain string ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #45282 from ulysses-you/SPARK-47177. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> (cherry picked from commit 6e62a5690b810edb99e4fc6ad39afbd4d49ef85e) Signed-off-by: youxiduo <youxiduo@corp.netease.com> 05 March 2024, 02:13:48 UTC
8991530 [SPARK-47202][PYTHON][TESTS][FOLLOW-UP] Run the test only with Python 3.9+ This PR proposes to run the tzinfo test only with Python 3.9+. This is a followup of https://github.com/apache/spark/pull/45308. To make the Python build passing with Python 3.8. It fails as below: ```python Starting test(pypy3): pyspark.sql.tests.test_arrow (temp output: /__w/spark/spark/python/target/605c2e61-b7c8-4898-ac7b-1d86f495bd4f/pypy3__pyspark.sql.tests.test_arrow__qrwyvw4l.log) Traceback (most recent call last): File "/usr/local/pypy/pypy3.8/lib/pypy3.8/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/pypy/pypy3.8/lib/pypy3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/__w/spark/spark/python/pyspark/sql/tests/test_arrow.py", line 26, in <module> from zoneinfo import ZoneInfo ModuleNotFoundError: No module named 'zoneinfo' ``` https://github.com/apache/spark/actions/runs/8082492167/job/22083534905 No, test-only. CI in this PR should test it out. No, Closes #45324 from HyukjinKwon/SPARK-47202-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 11e8ae42af9234adf56dc2f32b92e87697c777e4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 March 2024, 05:09:50 UTC
9770016 [SPARK-47236][CORE] Fix `deleteRecursivelyUsingJavaIO` to skip non-existing file input ### What changes were proposed in this pull request? This PR aims to fix `deleteRecursivelyUsingJavaIO` to skip non-existing file input. ### Why are the changes needed? `deleteRecursivelyUsingJavaIO` is a fallback of `deleteRecursivelyUsingUnixNative`. We should have identical capability. Currently, it fails. ``` [info] java.nio.file.NoSuchFileException: /Users/dongjoon/APACHE/spark-merge/target/tmp/spark-e264d853-42c0-44a2-9a30-22049522b04f [info] at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) [info] at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) [info] at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) [info] at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) [info] at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:148) [info] at java.base/java.nio.file.Files.readAttributes(Files.java:1851) [info] at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:126) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is difficult to test this `private static` Java method. I tested this with #45344 . ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45346 from dongjoon-hyun/SPARK-47236. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1cd7bab5c5c2bd8d595b131c88e6576486dbf123) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 March 2024, 03:08:25 UTC
e6f3dd9 [SPARK-47202][PYTHON][TESTS][FOLLOW-UP] Test timestamp with tzinfo in toPandas and createDataFrame with Arrow optimized ### What changes were proposed in this pull request? This PR is a follow up of https://github.com/apache/spark/pull/45301 that actually test the change. ### Why are the changes needed? To prevent a regression. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran the tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45308 from HyukjinKwon/SPARK-47202-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 721c2a41a54bb00ea885093f322edf704e63d17f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2024, 07:56:49 UTC
c0a4416 [SPARK-47202][PYTHON] Fix typo breaking datetimes with tzinfo This PR fixes a bug caused due to a typo. This bug is preventing users from having datetime.datetime objects with tzinfo when using the `TimestampType` No, just a bug fix. There ought to be CI that lints code and catches these simple errors at the time of opening the PR. No Closes #45301 from arzavj/SPARK-47202. Authored-by: Arzav Jain <arzavj@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit bf8396eaef9f95de3dab712e8895e4bc63adef7c) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2024, 05:11:22 UTC
96688ac [SPARK-47063][SQL] CAST long to timestamp has different behavior for codegen vs interpreted ### What changes were proposed in this pull request? When an overflow occurs casting long to timestamp there are different behaviors between codegen and interpreted ``` scala> Seq(Long.MaxValue, Long.MinValue).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) +--------------------+-------------------+---------------+ |v |ts |unix_micros(ts)| +--------------------+-------------------+---------------+ |9223372036854775807 |1969-12-31 20:59:59|-1000000 | |-9223372036854775808|1969-12-31 21:00:00|0 | +--------------------+-------------------+---------------+ scala> spark.conf.set("spark.sql.codegen.wholeStage", false) scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") scala> Seq(Long.MaxValue, Long.MinValue).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) +--------------------+-----------------------------+--------------------+ |v |ts |unix_micros(ts) | +--------------------+-----------------------------+--------------------+ |9223372036854775807 |+294247-01-10 01:00:54.775807|9223372036854775807 | |-9223372036854775808|-290308-12-21 15:16:20.224192|-9223372036854775808| +--------------------+-----------------------------+--------------------+ ``` To align the behavior this PR change the codegen function the be the same as interpreted (https://github.com/apache/spark/blob/f0090c95ad4eca18040104848117a7da648ffa3c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687) ### Why are the changes needed? This is necesary to be consistent in all cases ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With unit test and manually ### Was this patch authored or co-authored using generative AI tooling? No Closes #45294 from planga82/bugfix/spark47063_cast_codegen. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f18d945af7b69fbc89b38b9ca3ca79263b0881ed) Signed-off-by: Kent Yao <yao@apache.org> 28 February 2024, 03:44:23 UTC
b4118e0 [SPARK-45599][CORE][3.5] Use object equality in OpenHashSet ### What changes were proposed in this pull request? This is a backport of fcc5dbc9b6c8a8e16dc2e0854f3eebc8758a5826 / https://github.com/apache/spark/pull/45036 with a tweak so that it works on Scala 2.12. ### Why are the changes needed? This is a correctness bug fix. The original fix against `master` suppresses a warning category that doesn't exist on certain versions of Scala 2.13 and 2.12, and the exact versions are [not documented anywhere][1]. To be safe, this backport simply suppresses all warnings instead of just `other-non-cooperative-equals`. It would be interesting to see if `-Wconf:nowarn` complains, since the warning about non-cooperative equals itself is also not thrown on all versions of Scala, but I don't think that's a priority. [1]: https://github.com/scala/scala/pull/8120#issuecomment-1967413860 ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? CI + added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45296 from nchammas/SPARK-45599-OpenHashSet. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 February 2024, 02:30:57 UTC
cbf25fb Revert "[SPARK-45599][CORE] Use object equality in OpenHashSet" This reverts commit 588a55d010fefda7a63cde3b616ac38728fe4cfe. 27 February 2024, 16:38:54 UTC
588a55d [SPARK-45599][CORE] Use object equality in OpenHashSet ### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fcc5dbc9b6c8a8e16dc2e0854f3eebc8758a5826) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 February 2024, 08:04:00 UTC
e81df1f [SPARK-47125][SQL] Return null if Univocity never triggers parsing ### What changes were proposed in this pull request? This PR proposes to prevent `null` for `tokenizer.getContext`. This is similar with https://github.com/apache/spark/pull/28029. `getContext` seemingly via the univocity library, it can return null if `begingParsing` is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when `parseLine` is not invoked at https://github.com/apache/spark/blob/e081f06ea401a2b6b8c214a36126583d35eaf55f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L300 - `parseLine` invokes `begingParsing`. ### Why are the changes needed? To fix up a bug. ### Does this PR introduce _any_ user-facing change? Yes. In a very rare case, when `CsvToStructs` is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it. ### How was this patch tested? Manually tested, but test case will be done in a separate PR. We should backport this to all branches. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45210 from HyukjinKwon/SPARK-47125. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2024, 03:14:04 UTC
3f4425b [SPARK-47036][SS][3.5] Cleanup RocksDB file tracking for previously uploaded files if files were deleted from local directory Backports PR https://github.com/apache/spark/pull/45092 to Spark 3.5 ### What changes were proposed in this pull request? This change cleans up any dangling files tracked as being previously uploaded if they were cleaned up from the filesystem. The cleaning can happen due to a compaction racing in parallel with commit, where compaction completes after commit and a older version is loaded on the same executor. ### Why are the changes needed? The changes are needed to prevent RocksDB versionId mismatch errors (which require users to clean the checkpoint directory and retry the query). A particular scenario where this can happen is provided below: 1. Version V1 is loaded on executor A, RocksDB State Store has 195.sst, 196.sst, 197.sst and 198.sst files. 2. State changes are made, which result in creation of a new table file 200.sst. 3. State store is committed as version V2. The SST file 200.sst (as 000200-8c80161a-bc23-4e3b-b175-cffe38e427c7.sst) is uploaded to DFS, and previous 4 files are reused. A new metadata file is created to track the exact SST files with unique IDs, and uploaded with RocksDB Manifest as part of V1.zip. 4. Rocks DB compaction is triggered at the same time. The compaction creates a new L1 file (201.sst), and deletes existing 5 SST files. 5. Spark Stage is retried. 6. Version V1 is reloaded on the same executor. The local files are inspected, and 201.sst is deleted. The 4 SST files in version V1 are downloaded again to local file system. 7. Any local files which are deleted (as part of version load) are also removed from local → DFS file upload tracking. **However, the files already deleted as a result of compaction are not removed from tracking. This is the bug which resulted in the failure.** 8. State store is committed as version V1. However, the local mapping of SST files to DFS file path still has 200.sst in its tracking, hence the SST file is not re-uploaded. A new metadata file is created to track the exact SST files with unique IDs, and uploaded with the new RocksDB Manifest as part of V2.zip. (The V2.zip file is overwritten here atomically) 9. A new executor tried to load version V2. However, the SST files in (1) are now incompatible with Manifest file in (6) resulting in the version Id mismatch failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test cases to cover the scenario where some files were deleted on the file system. The test case fails with the existing master with error `Mismatch in unique ID on table file 16`, and is successful with changes in this PR. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45206 from sahnib/spark-3.5-rocks-db-fix. Authored-by: Bhuwan Sahni <bhuwan.sahni@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 22 February 2024, 01:59:15 UTC
92a333a [SPARK-47085][SQL][3.5] reduce the complexity of toTRowSet from n^2 to n ### What changes were proposed in this pull request? reduce the complexity of RowSetUtils.toTRowSet from n^2 to n ### Why are the changes needed? This causes performance issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests + test manually on AWS EMR ### Was this patch authored or co-authored using generative AI tooling? No Closes #45165 from igreenfield/branch-3.5. Authored-by: Izek Greenfield <izek.greenfield@adenza.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 February 2024, 20:39:24 UTC
93a09ea [SPARK-47072][SQL][3.5] Fix supported interval formats in error messages ### What changes were proposed in this pull request? In the PR, I propose to add one more field to keys of `supportedFormat` in `IntervalUtils` because current implementation has duplicate keys that overwrites each other. For instance, the following keys are the same: ``` (YM.YEAR, YM.MONTH) ... (DT.DAY, DT.HOUR) ``` because `YM.YEAR = DT.DAY = 0` and `YM.MONTH = DT.HOUR = 1`. This is a backport of https://github.com/apache/spark/pull/45127. ### Why are the changes needed? To fix the incorrect error message when Spark cannot parse ANSI interval string. For example, the expected format should be some year-month format but Spark outputs day-time one: ```sql spark-sql (default)> select interval '-\t2-2\t' year to month; Interval string does not match year-month format of `[+|-]d h`, `INTERVAL [+|-]'[+|-]d h' DAY TO HOUR` when cast to interval year to month: - 2-2 . (line 1, pos 16) == SQL == select interval '-\t2-2\t' year to month ----------------^^^ ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the existing test suite: ``` $ build/sbt "test:testOnly *IntervalUtilsSuite" ``` and regenerating the golden files: ``` $ SPARK_GENERATE_GOLDEN_FILES=1 PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 074fcf2807000d342831379de0fafc1e49a6bf19) Closes #45139 from MaxGekk/fix-supportedFormat-3.5. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 19 February 2024, 07:29:08 UTC
5067447 [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/39856. The configuration changes should be reflected in the Parquet data source doc ### Why are the changes needed? To fix doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview: <img width="1010" alt="image" src="https://github.com/apache/spark/assets/1097932/618df731-49ad-49e7-afa2-22381cb3bbef"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #45145 from gengliangwang/changeConfigName. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit dc2f2673a73ccde44b59cada00e95e869ad64c01) Signed-off-by: Gengliang Wang <gengliang@apache.org> 17 February 2024, 02:21:38 UTC
c61d89a [SPARK-45357][CONNECT][TESTS][3.5] Normalize `dataframeId` when comparing `CollectMetrics` in `SparkConnectProtoSuite` ### What changes were proposed in this pull request? This PR add a new function `normalizeDataframeId` to sets the `dataframeId` to the constant 0 of `CollectMetrics` before comparing `LogicalPlan` in the test case of `SparkConnectProtoSuite`. ### Why are the changes needed? The test scenario in `SparkConnectProtoSuite` does not need to compare the `dataframeId` in `CollectMetrics` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Manually check run ``` build/mvn clean install -pl connector/connect/server -am -DskipTests build/mvn test -pl connector/connect/server ``` **Before** ``` - Test observe *** FAILED *** == FAIL: Plans do not match === !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 0 CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L], 53 +- LocalRelation <empty>, [id#0, name#0] +- LocalRelation <empty>, [id#0, name#0] (PlanTest.scala:179) ``` **After** ``` Run completed in 41 seconds, 631 milliseconds. Total number of tests run: 882 Suites: completed 24, aborted 0 Tests: succeeded 882, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #45141 from LuciferYang/SPARK-45357-35. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 February 2024, 17:20:11 UTC
1c1c5fa [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch This PR fixes the regression introduced by https://github.com/apache/spark/pull/36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. Yes, it fixes a regression as described above. Unittest was added. No. Closes #45132 from HyukjinKwon/SPARK-47068. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3bb762dc032866cfb304019cba6db01125556c2f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 February 2024, 03:41:56 UTC
cacb6fa Preparing development version 3.5.2-SNAPSHOT 15 February 2024, 10:56:51 UTC
fd86f85 Preparing Spark release v3.5.1-rc2 15 February 2024, 10:56:47 UTC
9b4778f [SPARK-46906][INFRA][3.5] Bump python libraries (pandas, pyarrow) in Docker image for release script ### What changes were proposed in this pull request? This PR proposes to bump python libraries (pandas to 2.0.3, pyarrow to 4.0.0) in Docker image for release script. ### Why are the changes needed? Without this change, release script (do-release-docker.sh) fails on docs phase. Changing this fixes the release process against branch-3.5. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed with dry-run of release script against branch-3.5. `dev/create-release/do-release-docker.sh -d ~/spark-release -n -s docs` ``` Generating HTML files for SQL API documentation. INFO - Cleaning site directory INFO - Building documentation to directory: /opt/spark-rm/output/spark/sql/site INFO - Documentation built in 0.85 seconds /opt/spark-rm/output/spark/sql Moving back into docs dir. Making directory api/sql cp -r ../sql/site/. api/sql Source: /opt/spark-rm/output/spark/docs Destination: /opt/spark-rm/output/spark/docs/_site Incremental build: disabled. Enable with --incremental Generating... done in 7.469 seconds. Auto-regeneration: disabled. Use --watch to enable. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45111 from HeartSaVioR/SPARK-46906-3.5. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 15 February 2024, 05:49:09 UTC
ea6b257 Revert "[SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`" This reverts commit 35b627a934b1ab28be7d6ba88fdad63dc129525a. 15 February 2024, 00:20:57 UTC
a8c62d3 [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26 This PR aims to upgrade `aircompressor` to 1.26. `aircompressor` v1.26 has the following bug fixes. - [Fix out of bounds read/write in Snappy decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) - [Fix ZstdOutputStream corruption on double close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) No. Pass the CIs. No. Closes #45084 from dongjoon-hyun/SPARK-47023. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 13 February 2024, 18:00:56 UTC
d27bdbe Preparing development version 3.5.2-SNAPSHOT 12 February 2024, 03:39:15 UTC
08fe67b Preparing Spark release v3.5.1-rc1 12 February 2024, 03:39:11 UTC
4e4d9f0 [SPARK-47022][CONNECT][TESTS][3.5] Fix `connect/client/jvm` to have explicit `commons-(io|lang3)` test dependency ### What changes were proposed in this pull request? This PR aims to add `commons-io` and `commons-lang3` test dependency to `connector/client/jvm` module. ### Why are the changes needed? `connector/client/jvm` module uses `commons-io` and `commons-lang3` during testing like the following. https://github.com/apache/spark/blob/9700da7bfc1abb607f3cb916b96724d0fb8f2eba/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala#L26-L28 Currently, it's broken due to that. - https://github.com/apache/spark/actions?query=branch%3Abranch-3.5 ### Does this PR introduce _any_ user-facing change? No, this is a test-dependency only change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45081 from dongjoon-hyun/SPARK-47022. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 February 2024, 22:38:48 UTC
9700da7 [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency ### What changes were proposed in this pull request? This PR aims to fix `kvstore` module by adding explicit `commons-lang3` test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use Apache Spark's explicit declaration. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716 ### Why are the changes needed? Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test dependency like the following, but we didn't declare it explicitly so far. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33 https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23 Previously, it was provided by some unused `htmlunit-driver`'s transitive dependency accidentally. This causes a weird situation which `kvstore` module starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix this first. ``` $ mvn dependency:tree -pl common/kvstore ... [INFO] | \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test ... [INFO] | +- org.apache.commons:commons-lang3:jar:3.14.0:test ``` ### Does this PR introduce _any_ user-facing change? No. This is only a test dependency fix. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45080 from dongjoon-hyun/SPARK-47021. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 February 2024, 18:38:11 UTC
7658f77 [SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing In current version `DataSource#checkAndGlobPathIfNecessary` qualifies paths via `Path#makeQualified` and `PartitioningAwareFileIndex` qualifies via `FileSystem#makeQualified`. Most `FileSystem` implementations simply delegate to `Path#makeQualified`, but others, like `HarFileSystem` contain fs-specific logic, that can produce different result. Such inconsistencies can lead to a situation, when spark can't find partitions of the source file, because qualified paths, built by `Path` and `FileSystem` are different. Therefore, for uniformity, the `FileSystem` path qualification should be used in `DataSource#checkAndGlobPathIfNecessary`. Allow users to read files from hadoop archives (.har) using DataFrameReader API No New tests were added in `DataSourceSuite` and `DataFrameReaderWriterSuite` No Closes #43463 from tigrulya-exe/SPARK-39910-use-fs-path-qualification. Authored-by: Tigran Manasyan <t.manasyan@arenadata.io> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7edc5fac0f4e479cbc869d54a9490c553ba2613) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 February 2024, 12:30:05 UTC
77f8b38 [SPARK-46400][CORE][SQL][3.5] When there are corrupted files in the local maven repo, skip this cache and try again ### What changes were proposed in this pull request? The pr aims to - fix potential bug(ie: https://github.com/apache/spark/pull/44208) and enhance user experience. - make the code more compliant with standards Backport above to branch 3.5. Master branch pr: https://github.com/apache/spark/pull/44343 ### Why are the changes needed? We use the local maven repo as the first-level cache in ivy. The original intention was to reduce the time required to parse and obtain the ar, but when there are corrupted files in the local maven repo,The above mechanism will be directly interrupted and the prompt is very unfriendly, which will greatly confuse the user. Based on the original intention, we should skip the cache directly in similar situations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45017 from panbingkun/branch-3.5_SPARK-46400. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 08 February 2024, 06:41:51 UTC
1b36e3c [SPARK-46170][SQL][3.5] Support inject adaptive query post planner strategy rules in SparkSessionExtensions This pr is backport https://github.com/apache/spark/pull/44074 for branch-3.5 since 3.5 is a lts version ### What changes were proposed in this pull request? This pr adds a new extension entrance `queryPostPlannerStrategyRules` in `SparkSessionExtensions`. It will be applied between plannerStrategy and queryStagePrepRules in AQE, so it can get the whole plan before injecting exchanges. ### Why are the changes needed? 3.5 is a lts version ### Does this PR introduce _any_ user-facing change? no, only for develop ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #44074 from ulysses-you/post-planner. Authored-by: ulysses-you <ulyssesyou18gmail.com> Closes #45037 from ulysses-you/SPARK-46170. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 06 February 2024, 06:54:45 UTC
3f426b5 [MINOR][DOCS] Add Missing space in `docs/configuration.md` ### What changes were proposed in this pull request? Add a missing space in documentation file `docs/configuration.md`, which might lead to some misunderstanding to newcomers. ### Why are the changes needed? To eliminate ambiguity in sentences. ### Does this PR introduce _any_ user-facing change? Yes, it changes the documentation. ### How was this patch tested? I built the docs locally and double-checked the spelling. ### Was this patch authored or co-authored using generative AI tooling? No. It is just a little typo lol. Closes #45021 from KKtheGhost/fix/spell-configuration. Authored-by: KKtheGhost <dev@amd.sh> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit da73c123e648460dc7df04e9eda9d90445dfedff) Signed-off-by: Kent Yao <yao@apache.org> 05 February 2024, 01:49:42 UTC
4b33d28 [SPARK-46953][TEST] Wrap withTable for a test in ResolveDefaultColumnsSuite ### What changes were proposed in this pull request? The table is not cleaned up after this test; test retries or upcoming new tests reused 't' as the table name will fail with TAEE. ### Why are the changes needed? fix tests as FOLLOWUP of SPARK-43742 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? this test itself ### Was this patch authored or co-authored using generative AI tooling? no Closes #44993 from yaooqinn/SPARK-43742. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit a3432428e760fc16610cfe3380d3bdea7654f75d) Signed-off-by: Kent Yao <yao@apache.org> 02 February 2024, 07:17:58 UTC
547edb2 [SPARK-46945][K8S][3.5] Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters ### What changes were proposed in this pull request? This PR aims to introduce a legacy configuration for K8s PVC access mode to mitigate migrations issues in old K8s clusters. This is a kind of backport of - #44985 ### Why are the changes needed? - The default value of `spark.kubernetes.legacy.useReadWriteOnceAccessMode` is `true` in branch-3.5. - To help the users who cannot upgrade their K8s versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44986 from dongjoon-hyun/SPARK-46945-3.5. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Kent Yao <yao@apache.org> 02 February 2024, 02:44:40 UTC
d3b4537 [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects ### What changes were proposed in this pull request? [SPARK-46747](https://issues.apache.org/jira/browse/SPARK-46747) reported an issue that Postgres instances suffered from too many shared locks, which was caused by Spark‘s get table exist query. In this PR, we supplanted `"SELECT 1 FROM $table LIMIT 1"` with `"SELECT 1 FROM $table WHERE 1=0"` to prevent data from being scanned. ### Why are the changes needed? overhead reduction for JDBC datasources ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing JDBC v1/v2 datasouce tests. ### Was this patch authored or co-authored using generative AI tooling? no Closes #44948 from yaooqinn/SPARK-46747. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 031df8fa62666f14f54cf0a792f7fa2acc43afee) Signed-off-by: Kent Yao <yao@apache.org> 31 January 2024, 01:45:19 UTC
343ae82 [SPARK-46893][UI] Remove inline scripts from UI descriptions ### What changes were proposed in this pull request? This PR prevents malicious users from injecting inline scripts via job and stage descriptions. Spark's Web UI [already checks the security of job and stage descriptions](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L528-L545) before rendering them as HTML (or treating them as plain text). The UI already disallows `<script>` tags but doesn't protect against attributes with inline scripts like `onclick` or `onmouseover`. ### Why are the changes needed? On multi-user clusters, bad users can inject scripts into their job and stage descriptions. The UI already finds that [worth protecting against](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L533-L535). So this is extending that protection to scripts in attributes. ### Does this PR introduce _any_ user-facing change? Yes if users relied on inline scripts or attributes in their job or stage descriptions. ### How was this patch tested? Added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44933 from rshkv/wr/spark-46893. Authored-by: Willi Raschkowski <wraschkowski@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit abd9d27e87b915612e2a89e0d2527a04c7b984e0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 January 2024, 06:43:29 UTC
accfb39 [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue. Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive. ``` $ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1 ``` **Master UI** ![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be) **Worker Log** ``` 24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled. ``` To be consistent with the existing `Worker` behavior which ignores the request. https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868 No, this is a bug fix. Pass the CI with the newly added test case. No. Closes #44915 from dongjoon-hyun/SPARK-46888. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 20b593811dc02c96c71978851e051d32bf8c3496) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 January 2024, 04:25:45 UTC
a2854ba [SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by https://github.com/apache/spark/pull/44872 while matching of CSV header to a schema in the V1 CSV datasource. ### Why are the changes needed? To fix the failure when column pruning happens and a schema is not enforced: ```scala scala> spark.read. | option("multiLine", true). | option("header", true). | option("escape", "\""). | option("enforceSchema", false). | csv("/Users/maximgekk/tmp/es-939111-data.csv"). | count() 24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: Header length: 4, schema size: 0 CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44910 from MaxGekk/check-header-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bc51c9fea3645c6ae1d9e1e83b0f94f8b849be20) Signed-off-by: Max Gekk <max.gekk@gmail.com> 27 January 2024, 16:23:18 UTC
cf4e867 [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode ### What changes were proposed in this pull request? In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode. ### Why are the changes needed? To workaround the issue in the `uniVocity` parser used by the CSV datasource: https://github.com/uniVocity/univocity-parsers/issues/529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44872 from MaxGekk/csv-disable-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 829e742df8251c6f5e965cb08ad454ac3ee1a389) Signed-off-by: Max Gekk <max.gekk@gmail.com> 26 January 2024, 08:02:30 UTC
e5a654e [SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the `catalyst` in `module.py` ### What changes were proposed in this pull request? This pr add `sketch` to the dependencies of the `catalyst` module in `module.py` due to `sketch` is direct dependency of `catalyst` module. ### Why are the changes needed? Ensure that when modifying the `sketch` module, both `catalyst` and cascading modules will trigger tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44893 from LuciferYang/SPARK-46855-35. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 January 2024, 06:35:38 UTC
125b2f8 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler * The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes #44882 from fred-db/fix-deadlock. Authored-by: fred-db <fredrik.klauss@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 January 2024, 16:35:29 UTC
ef33b9c [SPARK-46796][SS] Ensure the correct remote files (mentioned in metadata.zip) are used on RocksDB version load This PR ensures that RocksDB loads do not run into SST file Version ID mismatch issue. RocksDB has added validation to ensure exact same SST file is used during database load from snapshot. Current streaming state suffers from certain edge cases where this condition is violated resulting in state load failure. The changes introduced are: 1. Ensure that the local SST file is exactly the same DFS file (as per mapping in metadata.zip). We keep track of the DFS file path for a local SST file, and re download the SST file in case DFS file has a different UUID in metadata zip. 2. Reset lastSnapshotVersion in RocksDB when Rocks DB is loaded. Changelog checkpoint relies on this version for future snapshots. Currently, if a older version is reloaded we were not uploading snapshots as lastSnapshotVersion was pointing to a higher snapshot of a cleanup database. We need to ensure that the correct SST files are used on executor during RocksDB load as per mapping in metadata.zip. With current implementation, its possible that the executor uses a SST file (with a different UUID) from a older version which is not the exact file mapped in the metadata.zip. This can cause version Id mismatch errors while loading RocksDB leading to streaming query failures. See https://issues.apache.org/jira/browse/SPARK-46796 for failure scenarios. No Added exhaustive unit testcases covering the scenarios. No Closes #44837 from sahnib/SPARK-46796. Authored-by: Bhuwan Sahni <bhuwan.sahni@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit f25ebe52b9b84ece9b3c5ae30b83eaaef52ec55b) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 24 January 2024, 12:48:59 UTC
0956db6 [SPARK-46590][SQL][FOLLOWUP] Update CoalesceShufflePartitions comments ### What changes were proposed in this pull request? After #44661 ,In addition to Union, children of CartesianProduct, BroadcastHashJoin and BroadcastNestedLoopJoin can also be coalesced independently, update comments. ### Why are the changes needed? Improve the readability and maintainability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44854 from zml1206/SPARK-46590-FOLLOWUP. Authored-by: zml1206 <zhuml1206@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fe4f8eac3efee42d53f7f24763a59c82ef03d343) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 January 2024, 07:07:13 UTC
be7f1e9 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command ### What changes were proposed in this pull request? This PR aims to fix `spark-daemon.sh` usage by adding `decommission` command. ### Why are the changes needed? This was missed when SPARK-20628 added `decommission` command at Apache Spark 3.1.0. The command has been used like the following. https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41 ### Does this PR introduce _any_ user-facing change? No, this is only a change on usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44856 from dongjoon-hyun/SPARK-46817. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 January 2024, 00:38:53 UTC
05f7aa5 [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints This PR modifies `LogicalRDD` to filter out all subqueries from its `constraints`. Fixes a correctness bug. Spark can produce incorrect results when using a checkpointed `DataFrame` with a filter containing a scalar subquery. This subquery is included in the constraints of the resulting `LogicalRDD`, and may then be propagated as a filter when joining with the checkpointed `DataFrame`. This causes the subquery to be evaluated twice: once during checkpointing and once while evaluating the query. These two subquery evaluations may return different results, e.g. when the subquery contains a limit with an underspecified sort order. No Added a test to `DataFrameSuite`. No Closes #44833 from tomvanbussel/SPARK-46794. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d26e871136e0c6e1f84a25978319733a516b7b2e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2024, 16:46:35 UTC
20da7c0 Revert "[SPARK-46417][SQL] Do not fail when calling hive.getTable and throwException is false" This reverts commit 8abf9583ac2303765255299af3e843d8248f313f. 23 January 2024, 09:35:59 UTC
a559ff7 [SPARK-46763] Fix assertion failure in ReplaceDeduplicateWithAggregate for duplicate attributes ### What changes were proposed in this pull request? - Updated the `ReplaceDeduplicateWithAggregate` implementation to reuse aliases generated for an attribute. - Added a unit test to ensure scenarios with duplicate non-grouping keys are correctly optimized. ### Why are the changes needed? - `ReplaceDeduplicateWithAggregate` replaces `Deduplicate` with an `Aggregate` operator with grouping expressions for the deduplication keys and aggregate expressions for the non-grouping keys (to preserve the output schema and keep the non-grouping columns). - For non-grouping key `a#X`, it generates an aggregate expression of the form `first(a#X, false) AS a#Y` - In case the non-grouping keys have a repeated attribute (with the same name and exprId), the existing logic would generate two different aggregate expressions both having two different exprId. - This then leads to duplicate rewrite attributes error (in `transformUpWithNewOutput`) when transforming the remaining tree. - For example, for the query ``` Project [a#0, b#1] +- Deduplicate [b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` the existing logic would transform it to ``` Project [a#3, b#1] +- Aggregate [b#1], [first(a#0, false) AS a#3, first(a#0, false) AS a#5, b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` with the aggregate mapping having two entries `a#0 -> a#3, a#0 -> a#5`. The correct transformation would be ``` Project [a#3, b#1] +- Aggregate [b#1], [first(a#0, false) AS a#3, first(a#0, false) AS a#3, b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` with the aggregate mapping having only one entry `a#0 -> a#3`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit test in `ResolveOperatorSuite`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44835 from nikhilsheoran-db/SPARK-46763. Authored-by: Nikhil Sheoran <125331115+nikhilsheoran-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 715b43428913d6a631f8f9043baac751b88cb5d4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 January 2024, 09:15:48 UTC
6403a84 [SPARK-46590][SQL] Fix coalesce failed with unexpected partition indeces ### What changes were proposed in this pull request? As outlined in JIRA issue [SPARK-46590](https://issues.apache.org/jira/browse/SPARK-46590), when a broadcast join follows a union within the same stage, the [collectCoalesceGroups](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala#L144) method will indiscriminately traverse all sub-plans, aggregating them into a single group, which is not expected. ### Why are the changes needed? In fact, for broadcastjoin, we do not expect broadcast exchange has same partition number. Therefore, we can safely disregard the broadcast join and continue traversing the subplan. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Newly added unit test. It would fail without this pr. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44661 from jackylee-ch/fix_coalesce_problem_with_broadcastjoin_and_union. Authored-by: jackylee-ch <lijunqing@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit de0c4ad3947f1188f02aaa612df8278d1c7c3ce5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 January 2024, 08:10:51 UTC
a6869b2 [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script ### What changes were proposed in this pull request? This PR proposes to avoid treating the exit code 5 as a test failure in Python testing script. ### Why are the changes needed? ``` ... ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.12'] Will test the following Python modules: ['pyspark-core', 'pyspark-streaming', 'pyspark-errors'] python3.12 python_implementation is CPython python3.12 version is: Python 3.12.1 Starting test(python3.12): pyspark.streaming.tests.test_context (temp output: /__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log) Finished test(python3.12): pyspark.streaming.tests.test_context (12s) Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp output: /__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log) Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s) Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp output: /__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log) test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." test_kinesis_stream_api (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." ---------------------------------------------------------------------- Ran 0 tests in 0.000s NO TESTS RAN (skipped=2) Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; see logs. Error: running /__w/spark/spark/python/run-tests --modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 --python-executables=python3.12 ; received return code 255 Error: Process completed with exit code 19. ``` Scheduled job fails because of exit 5, see https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No, Closes #44841 from HyukjinKwon/SPARK-46801. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2024, 01:07:11 UTC
68d9f35 [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent When canonicalizing `output` in `InMemoryRelation`, use `output` itself as the schema for determining the ordinals, rather than `cachedPlan.output`. `InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't necessarily use the same exprIds. E.g.: ``` +- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [c1#254, c2#255] ``` Because of this, `InMemoryRelation` will sometimes fail to fully canonicalize, resulting in cases where two semantically equivalent `InMemoryRelation` instances appear to be semantically nonequivalent. Example: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (3, 7), (4, 5); cache table data; select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all; ``` If plan change validation checking is on (i.e., `spark.sql.planChangeValidation=true`), the failure is: ``` [PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS count(c2)#83L] ... is not a valid aggregate expression: [SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an aggregate function. ``` If plan change validation checking is off, the failure is more mysterious: ``` [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 ``` If you remove the cache command, the query succeeds. The above failures happen because the subquery in the aggregate expressions and the subquery in the grouping expressions seem semantically nonequivalent since the `InMemoryRelation` in one of the subquery plans failed to completely canonicalize. In `CacheManager#useCachedData`, two lookups for the same cached plan may create `InMemoryRelation` instances that have different exprIds in `output`. That's because the plan fragments used as lookup keys may have been deduplicated by `DeduplicateRelations`, and thus have different exprIds in their respective output schemas. When `CacheManager#useCachedData` creates an `InMemoryRelation` instance, it borrows the output schema of the plan fragment used as the lookup key. The failure to fully canonicalize has other effects. For example, this query fails to reuse the exchange: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (2, 4), (3, 7), (7, 22); cache table data; set spark.sql.autoBroadcastJoinThreshold=-1; set spark.sql.adaptive.enabled=false; select * from data l join data r on l.c1 = r.c1; ``` No. New tests. No. Closes #44806 from bersprockets/plan_validation_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 January 2024, 19:09:44 UTC
04d3249 [SPARK-46789][K8S][TESTS] Add `VolumeSuite` to K8s IT ### What changes were proposed in this pull request? This PR aims to add `VolumeSuite` to K8s IT. ### Why are the changes needed? To improve the test coverage on various K8s volume use cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44827 from dongjoon-hyun/SPARK-46789. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Kent Yao <yao@apache.org> 22 January 2024, 09:37:40 UTC
687c297 [SPARK-44495][INFRA][K8S][3.5] Use the latest minikube in K8s IT ### What changes were proposed in this pull request? This is a backport of #44813 . This PR aims to recover GitHub Action K8s IT to use the latest Minikube and to make it sure that Apache Spark K8s module are tested with all Minikubes without any issues. **BEFORE** - Minikube: v1.30.1 - K8s: v1.26.3 **AFTER** - Minikube: v1.32.0 - K8s: v1.28.3 ### Why are the changes needed? - Previously, it was pinned due to the failure. - After this PR, we will track the latest Minikube and K8s version always. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44819 from dongjoon-hyun/SPARK-44495-3.5. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 January 2024, 02:53:21 UTC
b98cf95 [SPARK-46786][K8S] Fix `MountVolumesFeatureStep` to use `ReadWriteOncePod` instead of `ReadWriteOnce` This PR aims to fix a duplicated volume mounting bug by using `ReadWriteOncePod` instead of `ReadWriteOnce`. This bug fix is based on the stable K8s feature which is available since v1.22. - [KEP-2485: ReadWriteOncePod PersistentVolume AccessMode](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode/README.md) - https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes - v1.22 Alpha - v1.27 Beta - v1.29 Stable For the record, the minimum K8s version of GKE/EKS/AKE is **v1.24** as of today and the latest v1.29 is supported like the following. - [2024.01 (GKE Regular Channel)](https://cloud.google.com/kubernetes-engine/docs/release-schedule) - [2024.02 (AKE GA)](https://learn.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) This is a bug fix. Pass the CIs with the existing PV-related tests. No. Closes #44817 from dongjoon-hyun/SPARK-46786. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 45ec74415a4a89851968941b80c490e37ee88daf) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 January 2024, 01:51:49 UTC
c19bf01 [SPARK-46769][SQL] Refine timestamp related schema inference This is a refinement of https://github.com/apache/spark/pull/43243 . This PR enforces one thing: we only infer TIMESTAMP NTZ type using NTZ parser, and only infer LTZ type using LTZ parser. This consistency is important to avoid nondeterministic behaviors. Avoid non-deterministic behaviors. After https://github.com/apache/spark/pull/43243 , we can still have inconsistency if the LEGACY mode is enabled. Yes for the legacy parser. Now it's more likely to infer string type instead of inferring timestamp type "by luck" existing tests no Closes https://github.com/apache/spark/pull/44789 Closes #44800 from cloud-fan/infer. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e4e40762ca41931646b8f201028b1f2298252d96) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 January 2024, 12:58:52 UTC
fa6bf22 [SPARK-46676][SS] dropDuplicatesWithinWatermark should not fail on canonicalization of the plan ### What changes were proposed in this pull request? This PR proposes to fix the bug on canonicalizing the plan which contains the physical node of dropDuplicatesWithinWatermark (`StreamingDeduplicateWithinWatermarkExec`). ### Why are the changes needed? Canonicalization of the plan will replace the expressions (including attributes) to remove out cosmetic, including name, "and metadata", which denotes the event time column marker. StreamingDeduplicateWithinWatermarkExec assumes that the input attributes of child node contain the event time column, and it is determined at the initialization of the node instance. Once canonicalization is being triggered, child node will lose the notion of event time column from its attributes, and copy of StreamingDeduplicateWithinWatermarkExec will be performed which instantiating a new node of `StreamingDeduplicateWithinWatermarkExec` with new child node, which no longer has an event time column, hence instantiation will fail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44688 from HeartSaVioR/SPARK-46676. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit c1ed3e60e67f53bb323e2b9fa47789fcde70a75a) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 19 January 2024, 02:39:07 UTC
b27d169 [MINOR][DOCS] Add zstandard as a candidate to fix the desc of spark.sql.avro.compression.codec ### What changes were proposed in this pull request? Add zstandard as a candidate to fix the desc of spark.sql.avro.compression.codec ### Why are the changes needed? docfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ### Was this patch authored or co-authored using generative AI tooling? no Closes #44783 from yaooqinn/avro_minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c040824fd75c955dbc8e5712bc473a0ddb9a8c0f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 January 2024, 16:21:32 UTC
d083da7 [SPARK-46663][PYTHON][3.5] Disable memory profiler for pandas UDFs with iterators ### What changes were proposed in this pull request? When using pandas UDFs with iterators, if users enable the profiling spark conf, a warning indicating non-support should be raised, and profiling should be disabled. However, currently, after raising the not-supported warning, the memory profiler is still being enabled. The PR proposed to fix that. ### Why are the changes needed? A bug fix to eliminate misleading behavior. ### Does this PR introduce _any_ user-facing change? The noticeable changes will affect only those using the PySpark shell. This is because, in the PySpark shell, the memory profiler will raise an error, which in turn blocks the execution of the UDF. ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? Setup: ```py $ ./bin/pyspark --conf spark.python.profile=true >>> from typing import Iterator >>> from pyspark.sql.functions import * >>> import pandas as pd >>> pandas_udf("long") ... def plus_one(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: ... for s in iterator: ... yield s + 1 ... >>> df = spark.createDataFrame(pd.DataFrame([1, 2, 3], columns=["v"])) ``` Before: ``` >>> df.select(plus_one(df.v)).show() UserWarning: Profiling UDFs with iterators input/output is not supported. Traceback (most recent call last): ... OSError: could not get source code ``` After: ``` >>> df.select(plus_one(df.v)).show() /Users/xinrong.meng/spark/python/pyspark/sql/udf.py:417: UserWarning: Profiling UDFs with iterators input/output is not supported. +-----------+ |plus_one(v)| +-----------+ | 2| | 3| | 4| +-----------+ ``` Closes #44760 from xinrong-meng/PR_TOOL_PICK_PR_44668_BRANCH-3.5. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 January 2024, 00:01:05 UTC
5a0bc96 [SPARK-46732][CONNECT][3.5] Make Subquery/Broadcast thread work with Connect's artifact management ### What changes were proposed in this pull request? Similar with SPARK-44794, propagate JobArtifactState to broadcast/subquery thread. This is an example: ```scala val add1 = udf((i: Long) => i + 1) val tableA = spark.range(2).alias("a") val tableB = broadcast(spark.range(2).select(add1(col("id")).alias("id"))).alias("b") tableA.join(tableB). where(col("a.id")===col("b.id")). select(col("a.id").alias("a_id"), col("b.id").alias("b_id")). collect(). mkString("[", ", ", "]") ``` Before this pr, this example will throw exception `ClassNotFoundException`. Subquery and Broadcast execution use a separate ThreadPool which don't have the `JobArtifactState`. ### Why are the changes needed? Fix bug. Make Subquery/Broadcast thread work with Connect's artifact management. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new test to `ReplE2ESuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes #44763 from xieshuaihu/SPARK-46732backport. Authored-by: xieshuaihu <xieshuaihu@agora.io> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 January 2024, 06:24:58 UTC
10d5d89 [SPARK-46715][INFRA][3.5] Pin `sphinxcontrib-*` ### What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/44727 to branch-3.5 ### Why are the changes needed? to restore doc build ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44744 from zhengruifeng/infra_pin_shinxcontrib. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 16 January 2024, 06:16:03 UTC
679e4b6 [SPARK-46700][CORE] Count the last spilling for the shuffle disk spilling bytes metric ### What changes were proposed in this pull request? This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag `isLastFile = true`. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today. This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling. ### Why are the changes needed? make metrics accurate ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44709 from cloud-fan/shuffle. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 4ea374257c1fdb276abcd6b953ba042593e4d5a3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 January 2024, 21:49:48 UTC
2fe253e [SPARK-46704][CORE][UI] Fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly ### What changes were proposed in this pull request? This PR aims to fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly. ### Why are the changes needed? Since Apache Spark 3.0.0, `MasterPage` shows `Duration` column of `Running Drivers`. **BEFORE** <img width="111" src="https://github.com/apache/spark/assets/9700541/50276e34-01be-4474-803d-79066e06cb2c"> **AFTER** <img width="111" src="https://github.com/apache/spark/assets/9700541/a427b2e6-eab0-4d73-9114-1d8ff9d052c2"> ### Does this PR introduce _any_ user-facing change? Yes, this is a bug fix of UI. ### How was this patch tested? Manual. Run a Spark standalone cluster. ``` $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true -Dspark.deploy.maxDrivers=2" sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077 ``` Submit multiple jobs via REST API. ``` $ curl -s -k -XPOST http://localhost:6066/v1/submissions/create \ --header "Content-Type:application/json;charset=UTF-8" \ --data '{ "appResource": "", "sparkProperties": { "spark.master": "spark://localhost:7077", "spark.app.name": "Test 1", "spark.submit.deployMode": "cluster", "spark.jars": "/Users/dongjoon/APACHE/spark-merge/examples/target/scala-2.13/jars/spark-examples_2.13-4.0.0-SNAPSHOT.jar" }, "clientSparkVersion": "", "mainClass": "org.apache.spark.examples.SparkPi", "environmentVariables": {}, "action": "CreateSubmissionRequest", "appArgs": [ "10000" ] }' ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44711 from dongjoon-hyun/SPARK-46704. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 25c680cfd4dc63aeb9d16a673ee431c57188b80d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 January 2024, 20:54:37 UTC
d422aed [SPARK-46684][PYTHON][CONNECT][3.5] Fix CoGroup.applyInPandas/Arrow to pass arguments properly ### What changes were proposed in this pull request? This is a backport of apache/spark#44695. Fix `CoGroup.applyInPandas/Arrow` to pass arguments properly. ### Why are the changes needed? In Spark Connect, `CoGroup.applyInPandas/Arrow` doesn't take arguments properly, so the arguments of the UDF can be broken: ```py >>> import pandas as pd >>> >>> df1 = spark.createDataFrame( ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", "v1", "v2") ... ) >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) >>> >>> def summarize(left, right): ... return pd.DataFrame( ... { ... "left_rows": [len(left)], ... "left_columns": [len(left.columns)], ... "right_rows": [len(right)], ... "right_columns": [len(right.columns)], ... } ... ) ... >>> df = ( ... df1.groupby("id") ... .cogroup(df2.groupby("id")) ... .applyInPandas( ... summarize, ... schema="left_rows long, left_columns long, right_rows long, right_columns long", ... ) ... ) >>> >>> df.show() +---------+------------+----------+-------------+ |left_rows|left_columns|right_rows|right_columns| +---------+------------+----------+-------------+ | 2| 1| 2| 1| | 2| 1| 1| 1| +---------+------------+----------+-------------+ ``` The result should be: ```py +---------+------------+----------+-------------+ |left_rows|left_columns|right_rows|right_columns| +---------+------------+----------+-------------+ |        2|           3|         2|            2| |        2|           3|         1|            2| +---------+------------+----------+-------------+ ``` ### Does this PR introduce _any_ user-facing change? This is a bug fix. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44696 from ueshin/issues/SPARK-46684/3.5/cogroup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 January 2024, 04:19:12 UTC
8a0f642 [SPARK-46640][SQL] Fix RemoveRedundantAlias by excluding subquery attributes - In `RemoveRedundantAliases`, we have an `excluded` AttributeSet argument denoting the references for which we should not remove aliases. For a query with subquery expressions, adding the attributes references by the subquery in the `excluded` set prevents rewrites that might remove presumedly redundant aliases. (Changes in RemoveRedundantAlias) - Added a configuration flag to disable this fix, if not needed. - Added a unit test with Filter exists subquery expression to show how the alias would have been removed. - `RemoveRedundantAliases` does not take into account the outer attributes of a `SubqueryExpression` when considering redundant aliases, potentially removing them if it thinks they are redundant. - This can cause scenarios where a subquery expression has conditions like `a#x = a#x` i.e. both the attribute names and the expression ID(s) are the same. This can then lead to conflicting expression ID(s) error. - For example, in the query example below, the `RemoveRedundantAliases` would remove the alias `a#0 as a#1` and replace `a#1` with `a#0` in the Filter exists subquery expression which would create an issue if the subquery expression had an attribute with reference `a#0` (possible due to different scan relation instances possibly having the same attribute ID(s) (Ref: #40662) ``` Filter exists [a#1 && (a#1 = b#2)] : +- LocalRelation <empty>, [b#2] +- Project [a#0 AS a#1] +- LocalRelation <empty>, [a#0] ``` becomes ``` Filter exists [a#0 && (a#0 = b#2)] : +- LocalRelation <empty>, [b#2] +- LocalRelation <empty>, [a#0] ``` - The changes are needed to fix this bug. No - Added a unit test with Filter exists subquery expression to show how the alias would have been removed. No Closes #44645 from nikhilsheoran-db/SPARK-46640. Authored-by: Nikhil Sheoran <125331115+nikhilsheoran-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bbeb8d7417bafa09ad5202347175a47b3217e27f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 January 2024, 02:22:39 UTC
back to top