https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
fd86f85 Preparing Spark release v3.5.1-rc2 15 February 2024, 10:56:47 UTC
9b4778f [SPARK-46906][INFRA][3.5] Bump python libraries (pandas, pyarrow) in Docker image for release script ### What changes were proposed in this pull request? This PR proposes to bump python libraries (pandas to 2.0.3, pyarrow to 4.0.0) in Docker image for release script. ### Why are the changes needed? Without this change, release script (do-release-docker.sh) fails on docs phase. Changing this fixes the release process against branch-3.5. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed with dry-run of release script against branch-3.5. `dev/create-release/do-release-docker.sh -d ~/spark-release -n -s docs` ``` Generating HTML files for SQL API documentation. INFO - Cleaning site directory INFO - Building documentation to directory: /opt/spark-rm/output/spark/sql/site INFO - Documentation built in 0.85 seconds /opt/spark-rm/output/spark/sql Moving back into docs dir. Making directory api/sql cp -r ../sql/site/. api/sql Source: /opt/spark-rm/output/spark/docs Destination: /opt/spark-rm/output/spark/docs/_site Incremental build: disabled. Enable with --incremental Generating... done in 7.469 seconds. Auto-regeneration: disabled. Use --watch to enable. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45111 from HeartSaVioR/SPARK-46906-3.5. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 15 February 2024, 05:49:09 UTC
ea6b257 Revert "[SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`" This reverts commit 35b627a934b1ab28be7d6ba88fdad63dc129525a. 15 February 2024, 00:20:57 UTC
a8c62d3 [SPARK-47023][BUILD] Upgrade `aircompressor` to 1.26 This PR aims to upgrade `aircompressor` to 1.26. `aircompressor` v1.26 has the following bug fixes. - [Fix out of bounds read/write in Snappy decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) - [Fix ZstdOutputStream corruption on double close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) No. Pass the CIs. No. Closes #45084 from dongjoon-hyun/SPARK-47023. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 13 February 2024, 18:00:56 UTC
d27bdbe Preparing development version 3.5.2-SNAPSHOT 12 February 2024, 03:39:15 UTC
08fe67b Preparing Spark release v3.5.1-rc1 12 February 2024, 03:39:11 UTC
4e4d9f0 [SPARK-47022][CONNECT][TESTS][3.5] Fix `connect/client/jvm` to have explicit `commons-(io|lang3)` test dependency ### What changes were proposed in this pull request? This PR aims to add `commons-io` and `commons-lang3` test dependency to `connector/client/jvm` module. ### Why are the changes needed? `connector/client/jvm` module uses `commons-io` and `commons-lang3` during testing like the following. https://github.com/apache/spark/blob/9700da7bfc1abb607f3cb916b96724d0fb8f2eba/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala#L26-L28 Currently, it's broken due to that. - https://github.com/apache/spark/actions?query=branch%3Abranch-3.5 ### Does this PR introduce _any_ user-facing change? No, this is a test-dependency only change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45081 from dongjoon-hyun/SPARK-47022. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 February 2024, 22:38:48 UTC
9700da7 [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency ### What changes were proposed in this pull request? This PR aims to fix `kvstore` module by adding explicit `commons-lang3` test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use Apache Spark's explicit declaration. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716 ### Why are the changes needed? Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test dependency like the following, but we didn't declare it explicitly so far. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33 https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23 Previously, it was provided by some unused `htmlunit-driver`'s transitive dependency accidentally. This causes a weird situation which `kvstore` module starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix this first. ``` $ mvn dependency:tree -pl common/kvstore ... [INFO] | \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test ... [INFO] | +- org.apache.commons:commons-lang3:jar:3.14.0:test ``` ### Does this PR introduce _any_ user-facing change? No. This is only a test dependency fix. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45080 from dongjoon-hyun/SPARK-47021. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 February 2024, 18:38:11 UTC
7658f77 [SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing In current version `DataSource#checkAndGlobPathIfNecessary` qualifies paths via `Path#makeQualified` and `PartitioningAwareFileIndex` qualifies via `FileSystem#makeQualified`. Most `FileSystem` implementations simply delegate to `Path#makeQualified`, but others, like `HarFileSystem` contain fs-specific logic, that can produce different result. Such inconsistencies can lead to a situation, when spark can't find partitions of the source file, because qualified paths, built by `Path` and `FileSystem` are different. Therefore, for uniformity, the `FileSystem` path qualification should be used in `DataSource#checkAndGlobPathIfNecessary`. Allow users to read files from hadoop archives (.har) using DataFrameReader API No New tests were added in `DataSourceSuite` and `DataFrameReaderWriterSuite` No Closes #43463 from tigrulya-exe/SPARK-39910-use-fs-path-qualification. Authored-by: Tigran Manasyan <t.manasyan@arenadata.io> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7edc5fac0f4e479cbc869d54a9490c553ba2613) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 February 2024, 12:30:05 UTC
77f8b38 [SPARK-46400][CORE][SQL][3.5] When there are corrupted files in the local maven repo, skip this cache and try again ### What changes were proposed in this pull request? The pr aims to - fix potential bug(ie: https://github.com/apache/spark/pull/44208) and enhance user experience. - make the code more compliant with standards Backport above to branch 3.5. Master branch pr: https://github.com/apache/spark/pull/44343 ### Why are the changes needed? We use the local maven repo as the first-level cache in ivy. The original intention was to reduce the time required to parse and obtain the ar, but when there are corrupted files in the local maven repo,The above mechanism will be directly interrupted and the prompt is very unfriendly, which will greatly confuse the user. Based on the original intention, we should skip the cache directly in similar situations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45017 from panbingkun/branch-3.5_SPARK-46400. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 08 February 2024, 06:41:51 UTC
1b36e3c [SPARK-46170][SQL][3.5] Support inject adaptive query post planner strategy rules in SparkSessionExtensions This pr is backport https://github.com/apache/spark/pull/44074 for branch-3.5 since 3.5 is a lts version ### What changes were proposed in this pull request? This pr adds a new extension entrance `queryPostPlannerStrategyRules` in `SparkSessionExtensions`. It will be applied between plannerStrategy and queryStagePrepRules in AQE, so it can get the whole plan before injecting exchanges. ### Why are the changes needed? 3.5 is a lts version ### Does this PR introduce _any_ user-facing change? no, only for develop ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #44074 from ulysses-you/post-planner. Authored-by: ulysses-you <ulyssesyou18gmail.com> Closes #45037 from ulysses-you/SPARK-46170. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 06 February 2024, 06:54:45 UTC
3f426b5 [MINOR][DOCS] Add Missing space in `docs/configuration.md` ### What changes were proposed in this pull request? Add a missing space in documentation file `docs/configuration.md`, which might lead to some misunderstanding to newcomers. ### Why are the changes needed? To eliminate ambiguity in sentences. ### Does this PR introduce _any_ user-facing change? Yes, it changes the documentation. ### How was this patch tested? I built the docs locally and double-checked the spelling. ### Was this patch authored or co-authored using generative AI tooling? No. It is just a little typo lol. Closes #45021 from KKtheGhost/fix/spell-configuration. Authored-by: KKtheGhost <dev@amd.sh> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit da73c123e648460dc7df04e9eda9d90445dfedff) Signed-off-by: Kent Yao <yao@apache.org> 05 February 2024, 01:49:42 UTC
4b33d28 [SPARK-46953][TEST] Wrap withTable for a test in ResolveDefaultColumnsSuite ### What changes were proposed in this pull request? The table is not cleaned up after this test; test retries or upcoming new tests reused 't' as the table name will fail with TAEE. ### Why are the changes needed? fix tests as FOLLOWUP of SPARK-43742 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? this test itself ### Was this patch authored or co-authored using generative AI tooling? no Closes #44993 from yaooqinn/SPARK-43742. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit a3432428e760fc16610cfe3380d3bdea7654f75d) Signed-off-by: Kent Yao <yao@apache.org> 02 February 2024, 07:17:58 UTC
547edb2 [SPARK-46945][K8S][3.5] Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters ### What changes were proposed in this pull request? This PR aims to introduce a legacy configuration for K8s PVC access mode to mitigate migrations issues in old K8s clusters. This is a kind of backport of - #44985 ### Why are the changes needed? - The default value of `spark.kubernetes.legacy.useReadWriteOnceAccessMode` is `true` in branch-3.5. - To help the users who cannot upgrade their K8s versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44986 from dongjoon-hyun/SPARK-46945-3.5. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Kent Yao <yao@apache.org> 02 February 2024, 02:44:40 UTC
d3b4537 [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects ### What changes were proposed in this pull request? [SPARK-46747](https://issues.apache.org/jira/browse/SPARK-46747) reported an issue that Postgres instances suffered from too many shared locks, which was caused by Spark‘s get table exist query. In this PR, we supplanted `"SELECT 1 FROM $table LIMIT 1"` with `"SELECT 1 FROM $table WHERE 1=0"` to prevent data from being scanned. ### Why are the changes needed? overhead reduction for JDBC datasources ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing JDBC v1/v2 datasouce tests. ### Was this patch authored or co-authored using generative AI tooling? no Closes #44948 from yaooqinn/SPARK-46747. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 031df8fa62666f14f54cf0a792f7fa2acc43afee) Signed-off-by: Kent Yao <yao@apache.org> 31 January 2024, 01:45:19 UTC
343ae82 [SPARK-46893][UI] Remove inline scripts from UI descriptions ### What changes were proposed in this pull request? This PR prevents malicious users from injecting inline scripts via job and stage descriptions. Spark's Web UI [already checks the security of job and stage descriptions](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L528-L545) before rendering them as HTML (or treating them as plain text). The UI already disallows `<script>` tags but doesn't protect against attributes with inline scripts like `onclick` or `onmouseover`. ### Why are the changes needed? On multi-user clusters, bad users can inject scripts into their job and stage descriptions. The UI already finds that [worth protecting against](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L533-L535). So this is extending that protection to scripts in attributes. ### Does this PR introduce _any_ user-facing change? Yes if users relied on inline scripts or attributes in their job or stage descriptions. ### How was this patch tested? Added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44933 from rshkv/wr/spark-46893. Authored-by: Willi Raschkowski <wraschkowski@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit abd9d27e87b915612e2a89e0d2527a04c7b984e0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 January 2024, 06:43:29 UTC
accfb39 [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue. Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive. ``` $ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1 ``` **Master UI** ![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be) **Worker Log** ``` 24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled. ``` To be consistent with the existing `Worker` behavior which ignores the request. https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868 No, this is a bug fix. Pass the CI with the newly added test case. No. Closes #44915 from dongjoon-hyun/SPARK-46888. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 20b593811dc02c96c71978851e051d32bf8c3496) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 January 2024, 04:25:45 UTC
a2854ba [SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by https://github.com/apache/spark/pull/44872 while matching of CSV header to a schema in the V1 CSV datasource. ### Why are the changes needed? To fix the failure when column pruning happens and a schema is not enforced: ```scala scala> spark.read. | option("multiLine", true). | option("header", true). | option("escape", "\""). | option("enforceSchema", false). | csv("/Users/maximgekk/tmp/es-939111-data.csv"). | count() 24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: Header length: 4, schema size: 0 CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44910 from MaxGekk/check-header-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bc51c9fea3645c6ae1d9e1e83b0f94f8b849be20) Signed-off-by: Max Gekk <max.gekk@gmail.com> 27 January 2024, 16:23:18 UTC
cf4e867 [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode ### What changes were proposed in this pull request? In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode. ### Why are the changes needed? To workaround the issue in the `uniVocity` parser used by the CSV datasource: https://github.com/uniVocity/univocity-parsers/issues/529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44872 from MaxGekk/csv-disable-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 829e742df8251c6f5e965cb08ad454ac3ee1a389) Signed-off-by: Max Gekk <max.gekk@gmail.com> 26 January 2024, 08:02:30 UTC
e5a654e [SPARK-46855][INFRA][3.5] Add `sketch` to the dependencies of the `catalyst` in `module.py` ### What changes were proposed in this pull request? This pr add `sketch` to the dependencies of the `catalyst` module in `module.py` due to `sketch` is direct dependency of `catalyst` module. ### Why are the changes needed? Ensure that when modifying the `sketch` module, both `catalyst` and cascading modules will trigger tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44893 from LuciferYang/SPARK-46855-35. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 January 2024, 06:35:38 UTC
125b2f8 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler * The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes #44882 from fred-db/fix-deadlock. Authored-by: fred-db <fredrik.klauss@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 January 2024, 16:35:29 UTC
ef33b9c [SPARK-46796][SS] Ensure the correct remote files (mentioned in metadata.zip) are used on RocksDB version load This PR ensures that RocksDB loads do not run into SST file Version ID mismatch issue. RocksDB has added validation to ensure exact same SST file is used during database load from snapshot. Current streaming state suffers from certain edge cases where this condition is violated resulting in state load failure. The changes introduced are: 1. Ensure that the local SST file is exactly the same DFS file (as per mapping in metadata.zip). We keep track of the DFS file path for a local SST file, and re download the SST file in case DFS file has a different UUID in metadata zip. 2. Reset lastSnapshotVersion in RocksDB when Rocks DB is loaded. Changelog checkpoint relies on this version for future snapshots. Currently, if a older version is reloaded we were not uploading snapshots as lastSnapshotVersion was pointing to a higher snapshot of a cleanup database. We need to ensure that the correct SST files are used on executor during RocksDB load as per mapping in metadata.zip. With current implementation, its possible that the executor uses a SST file (with a different UUID) from a older version which is not the exact file mapped in the metadata.zip. This can cause version Id mismatch errors while loading RocksDB leading to streaming query failures. See https://issues.apache.org/jira/browse/SPARK-46796 for failure scenarios. No Added exhaustive unit testcases covering the scenarios. No Closes #44837 from sahnib/SPARK-46796. Authored-by: Bhuwan Sahni <bhuwan.sahni@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit f25ebe52b9b84ece9b3c5ae30b83eaaef52ec55b) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 24 January 2024, 12:48:59 UTC
0956db6 [SPARK-46590][SQL][FOLLOWUP] Update CoalesceShufflePartitions comments ### What changes were proposed in this pull request? After #44661 ,In addition to Union, children of CartesianProduct, BroadcastHashJoin and BroadcastNestedLoopJoin can also be coalesced independently, update comments. ### Why are the changes needed? Improve the readability and maintainability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44854 from zml1206/SPARK-46590-FOLLOWUP. Authored-by: zml1206 <zhuml1206@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fe4f8eac3efee42d53f7f24763a59c82ef03d343) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 January 2024, 07:07:13 UTC
be7f1e9 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command ### What changes were proposed in this pull request? This PR aims to fix `spark-daemon.sh` usage by adding `decommission` command. ### Why are the changes needed? This was missed when SPARK-20628 added `decommission` command at Apache Spark 3.1.0. The command has been used like the following. https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41 ### Does this PR introduce _any_ user-facing change? No, this is only a change on usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44856 from dongjoon-hyun/SPARK-46817. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 January 2024, 00:38:53 UTC
05f7aa5 [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints This PR modifies `LogicalRDD` to filter out all subqueries from its `constraints`. Fixes a correctness bug. Spark can produce incorrect results when using a checkpointed `DataFrame` with a filter containing a scalar subquery. This subquery is included in the constraints of the resulting `LogicalRDD`, and may then be propagated as a filter when joining with the checkpointed `DataFrame`. This causes the subquery to be evaluated twice: once during checkpointing and once while evaluating the query. These two subquery evaluations may return different results, e.g. when the subquery contains a limit with an underspecified sort order. No Added a test to `DataFrameSuite`. No Closes #44833 from tomvanbussel/SPARK-46794. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d26e871136e0c6e1f84a25978319733a516b7b2e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2024, 16:46:35 UTC
20da7c0 Revert "[SPARK-46417][SQL] Do not fail when calling hive.getTable and throwException is false" This reverts commit 8abf9583ac2303765255299af3e843d8248f313f. 23 January 2024, 09:35:59 UTC
a559ff7 [SPARK-46763] Fix assertion failure in ReplaceDeduplicateWithAggregate for duplicate attributes ### What changes were proposed in this pull request? - Updated the `ReplaceDeduplicateWithAggregate` implementation to reuse aliases generated for an attribute. - Added a unit test to ensure scenarios with duplicate non-grouping keys are correctly optimized. ### Why are the changes needed? - `ReplaceDeduplicateWithAggregate` replaces `Deduplicate` with an `Aggregate` operator with grouping expressions for the deduplication keys and aggregate expressions for the non-grouping keys (to preserve the output schema and keep the non-grouping columns). - For non-grouping key `a#X`, it generates an aggregate expression of the form `first(a#X, false) AS a#Y` - In case the non-grouping keys have a repeated attribute (with the same name and exprId), the existing logic would generate two different aggregate expressions both having two different exprId. - This then leads to duplicate rewrite attributes error (in `transformUpWithNewOutput`) when transforming the remaining tree. - For example, for the query ``` Project [a#0, b#1] +- Deduplicate [b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` the existing logic would transform it to ``` Project [a#3, b#1] +- Aggregate [b#1], [first(a#0, false) AS a#3, first(a#0, false) AS a#5, b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` with the aggregate mapping having two entries `a#0 -> a#3, a#0 -> a#5`. The correct transformation would be ``` Project [a#3, b#1] +- Aggregate [b#1], [first(a#0, false) AS a#3, first(a#0, false) AS a#3, b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` with the aggregate mapping having only one entry `a#0 -> a#3`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit test in `ResolveOperatorSuite`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44835 from nikhilsheoran-db/SPARK-46763. Authored-by: Nikhil Sheoran <125331115+nikhilsheoran-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 715b43428913d6a631f8f9043baac751b88cb5d4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 January 2024, 09:15:48 UTC
6403a84 [SPARK-46590][SQL] Fix coalesce failed with unexpected partition indeces ### What changes were proposed in this pull request? As outlined in JIRA issue [SPARK-46590](https://issues.apache.org/jira/browse/SPARK-46590), when a broadcast join follows a union within the same stage, the [collectCoalesceGroups](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala#L144) method will indiscriminately traverse all sub-plans, aggregating them into a single group, which is not expected. ### Why are the changes needed? In fact, for broadcastjoin, we do not expect broadcast exchange has same partition number. Therefore, we can safely disregard the broadcast join and continue traversing the subplan. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Newly added unit test. It would fail without this pr. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44661 from jackylee-ch/fix_coalesce_problem_with_broadcastjoin_and_union. Authored-by: jackylee-ch <lijunqing@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit de0c4ad3947f1188f02aaa612df8278d1c7c3ce5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 January 2024, 08:10:51 UTC
a6869b2 [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script ### What changes were proposed in this pull request? This PR proposes to avoid treating the exit code 5 as a test failure in Python testing script. ### Why are the changes needed? ``` ... ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.12'] Will test the following Python modules: ['pyspark-core', 'pyspark-streaming', 'pyspark-errors'] python3.12 python_implementation is CPython python3.12 version is: Python 3.12.1 Starting test(python3.12): pyspark.streaming.tests.test_context (temp output: /__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log) Finished test(python3.12): pyspark.streaming.tests.test_context (12s) Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp output: /__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log) Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s) Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp output: /__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log) test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." test_kinesis_stream_api (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." ---------------------------------------------------------------------- Ran 0 tests in 0.000s NO TESTS RAN (skipped=2) Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; see logs. Error: running /__w/spark/spark/python/run-tests --modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 --python-executables=python3.12 ; received return code 255 Error: Process completed with exit code 19. ``` Scheduled job fails because of exit 5, see https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No, Closes #44841 from HyukjinKwon/SPARK-46801. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2024, 01:07:11 UTC
68d9f35 [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent When canonicalizing `output` in `InMemoryRelation`, use `output` itself as the schema for determining the ordinals, rather than `cachedPlan.output`. `InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't necessarily use the same exprIds. E.g.: ``` +- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [c1#254, c2#255] ``` Because of this, `InMemoryRelation` will sometimes fail to fully canonicalize, resulting in cases where two semantically equivalent `InMemoryRelation` instances appear to be semantically nonequivalent. Example: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (3, 7), (4, 5); cache table data; select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all; ``` If plan change validation checking is on (i.e., `spark.sql.planChangeValidation=true`), the failure is: ``` [PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS count(c2)#83L] ... is not a valid aggregate expression: [SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an aggregate function. ``` If plan change validation checking is off, the failure is more mysterious: ``` [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 ``` If you remove the cache command, the query succeeds. The above failures happen because the subquery in the aggregate expressions and the subquery in the grouping expressions seem semantically nonequivalent since the `InMemoryRelation` in one of the subquery plans failed to completely canonicalize. In `CacheManager#useCachedData`, two lookups for the same cached plan may create `InMemoryRelation` instances that have different exprIds in `output`. That's because the plan fragments used as lookup keys may have been deduplicated by `DeduplicateRelations`, and thus have different exprIds in their respective output schemas. When `CacheManager#useCachedData` creates an `InMemoryRelation` instance, it borrows the output schema of the plan fragment used as the lookup key. The failure to fully canonicalize has other effects. For example, this query fails to reuse the exchange: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (2, 4), (3, 7), (7, 22); cache table data; set spark.sql.autoBroadcastJoinThreshold=-1; set spark.sql.adaptive.enabled=false; select * from data l join data r on l.c1 = r.c1; ``` No. New tests. No. Closes #44806 from bersprockets/plan_validation_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 January 2024, 19:09:44 UTC
04d3249 [SPARK-46789][K8S][TESTS] Add `VolumeSuite` to K8s IT ### What changes were proposed in this pull request? This PR aims to add `VolumeSuite` to K8s IT. ### Why are the changes needed? To improve the test coverage on various K8s volume use cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44827 from dongjoon-hyun/SPARK-46789. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Kent Yao <yao@apache.org> 22 January 2024, 09:37:40 UTC
687c297 [SPARK-44495][INFRA][K8S][3.5] Use the latest minikube in K8s IT ### What changes were proposed in this pull request? This is a backport of #44813 . This PR aims to recover GitHub Action K8s IT to use the latest Minikube and to make it sure that Apache Spark K8s module are tested with all Minikubes without any issues. **BEFORE** - Minikube: v1.30.1 - K8s: v1.26.3 **AFTER** - Minikube: v1.32.0 - K8s: v1.28.3 ### Why are the changes needed? - Previously, it was pinned due to the failure. - After this PR, we will track the latest Minikube and K8s version always. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44819 from dongjoon-hyun/SPARK-44495-3.5. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 January 2024, 02:53:21 UTC
b98cf95 [SPARK-46786][K8S] Fix `MountVolumesFeatureStep` to use `ReadWriteOncePod` instead of `ReadWriteOnce` This PR aims to fix a duplicated volume mounting bug by using `ReadWriteOncePod` instead of `ReadWriteOnce`. This bug fix is based on the stable K8s feature which is available since v1.22. - [KEP-2485: ReadWriteOncePod PersistentVolume AccessMode](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode/README.md) - https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes - v1.22 Alpha - v1.27 Beta - v1.29 Stable For the record, the minimum K8s version of GKE/EKS/AKE is **v1.24** as of today and the latest v1.29 is supported like the following. - [2024.01 (GKE Regular Channel)](https://cloud.google.com/kubernetes-engine/docs/release-schedule) - [2024.02 (AKE GA)](https://learn.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) This is a bug fix. Pass the CIs with the existing PV-related tests. No. Closes #44817 from dongjoon-hyun/SPARK-46786. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 45ec74415a4a89851968941b80c490e37ee88daf) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 January 2024, 01:51:49 UTC
c19bf01 [SPARK-46769][SQL] Refine timestamp related schema inference This is a refinement of https://github.com/apache/spark/pull/43243 . This PR enforces one thing: we only infer TIMESTAMP NTZ type using NTZ parser, and only infer LTZ type using LTZ parser. This consistency is important to avoid nondeterministic behaviors. Avoid non-deterministic behaviors. After https://github.com/apache/spark/pull/43243 , we can still have inconsistency if the LEGACY mode is enabled. Yes for the legacy parser. Now it's more likely to infer string type instead of inferring timestamp type "by luck" existing tests no Closes https://github.com/apache/spark/pull/44789 Closes #44800 from cloud-fan/infer. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e4e40762ca41931646b8f201028b1f2298252d96) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 January 2024, 12:58:52 UTC
fa6bf22 [SPARK-46676][SS] dropDuplicatesWithinWatermark should not fail on canonicalization of the plan ### What changes were proposed in this pull request? This PR proposes to fix the bug on canonicalizing the plan which contains the physical node of dropDuplicatesWithinWatermark (`StreamingDeduplicateWithinWatermarkExec`). ### Why are the changes needed? Canonicalization of the plan will replace the expressions (including attributes) to remove out cosmetic, including name, "and metadata", which denotes the event time column marker. StreamingDeduplicateWithinWatermarkExec assumes that the input attributes of child node contain the event time column, and it is determined at the initialization of the node instance. Once canonicalization is being triggered, child node will lose the notion of event time column from its attributes, and copy of StreamingDeduplicateWithinWatermarkExec will be performed which instantiating a new node of `StreamingDeduplicateWithinWatermarkExec` with new child node, which no longer has an event time column, hence instantiation will fail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44688 from HeartSaVioR/SPARK-46676. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit c1ed3e60e67f53bb323e2b9fa47789fcde70a75a) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 19 January 2024, 02:39:07 UTC
b27d169 [MINOR][DOCS] Add zstandard as a candidate to fix the desc of spark.sql.avro.compression.codec ### What changes were proposed in this pull request? Add zstandard as a candidate to fix the desc of spark.sql.avro.compression.codec ### Why are the changes needed? docfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ### Was this patch authored or co-authored using generative AI tooling? no Closes #44783 from yaooqinn/avro_minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c040824fd75c955dbc8e5712bc473a0ddb9a8c0f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 January 2024, 16:21:32 UTC
d083da7 [SPARK-46663][PYTHON][3.5] Disable memory profiler for pandas UDFs with iterators ### What changes were proposed in this pull request? When using pandas UDFs with iterators, if users enable the profiling spark conf, a warning indicating non-support should be raised, and profiling should be disabled. However, currently, after raising the not-supported warning, the memory profiler is still being enabled. The PR proposed to fix that. ### Why are the changes needed? A bug fix to eliminate misleading behavior. ### Does this PR introduce _any_ user-facing change? The noticeable changes will affect only those using the PySpark shell. This is because, in the PySpark shell, the memory profiler will raise an error, which in turn blocks the execution of the UDF. ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? Setup: ```py $ ./bin/pyspark --conf spark.python.profile=true >>> from typing import Iterator >>> from pyspark.sql.functions import * >>> import pandas as pd >>> pandas_udf("long") ... def plus_one(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: ... for s in iterator: ... yield s + 1 ... >>> df = spark.createDataFrame(pd.DataFrame([1, 2, 3], columns=["v"])) ``` Before: ``` >>> df.select(plus_one(df.v)).show() UserWarning: Profiling UDFs with iterators input/output is not supported. Traceback (most recent call last): ... OSError: could not get source code ``` After: ``` >>> df.select(plus_one(df.v)).show() /Users/xinrong.meng/spark/python/pyspark/sql/udf.py:417: UserWarning: Profiling UDFs with iterators input/output is not supported. +-----------+ |plus_one(v)| +-----------+ | 2| | 3| | 4| +-----------+ ``` Closes #44760 from xinrong-meng/PR_TOOL_PICK_PR_44668_BRANCH-3.5. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 January 2024, 00:01:05 UTC
5a0bc96 [SPARK-46732][CONNECT][3.5] Make Subquery/Broadcast thread work with Connect's artifact management ### What changes were proposed in this pull request? Similar with SPARK-44794, propagate JobArtifactState to broadcast/subquery thread. This is an example: ```scala val add1 = udf((i: Long) => i + 1) val tableA = spark.range(2).alias("a") val tableB = broadcast(spark.range(2).select(add1(col("id")).alias("id"))).alias("b") tableA.join(tableB). where(col("a.id")===col("b.id")). select(col("a.id").alias("a_id"), col("b.id").alias("b_id")). collect(). mkString("[", ", ", "]") ``` Before this pr, this example will throw exception `ClassNotFoundException`. Subquery and Broadcast execution use a separate ThreadPool which don't have the `JobArtifactState`. ### Why are the changes needed? Fix bug. Make Subquery/Broadcast thread work with Connect's artifact management. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new test to `ReplE2ESuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes #44763 from xieshuaihu/SPARK-46732backport. Authored-by: xieshuaihu <xieshuaihu@agora.io> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 January 2024, 06:24:58 UTC
10d5d89 [SPARK-46715][INFRA][3.5] Pin `sphinxcontrib-*` ### What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/44727 to branch-3.5 ### Why are the changes needed? to restore doc build ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44744 from zhengruifeng/infra_pin_shinxcontrib. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 16 January 2024, 06:16:03 UTC
679e4b6 [SPARK-46700][CORE] Count the last spilling for the shuffle disk spilling bytes metric ### What changes were proposed in this pull request? This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag `isLastFile = true`. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today. This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling. ### Why are the changes needed? make metrics accurate ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44709 from cloud-fan/shuffle. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 4ea374257c1fdb276abcd6b953ba042593e4d5a3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 January 2024, 21:49:48 UTC
2fe253e [SPARK-46704][CORE][UI] Fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly ### What changes were proposed in this pull request? This PR aims to fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly. ### Why are the changes needed? Since Apache Spark 3.0.0, `MasterPage` shows `Duration` column of `Running Drivers`. **BEFORE** <img width="111" src="https://github.com/apache/spark/assets/9700541/50276e34-01be-4474-803d-79066e06cb2c"> **AFTER** <img width="111" src="https://github.com/apache/spark/assets/9700541/a427b2e6-eab0-4d73-9114-1d8ff9d052c2"> ### Does this PR introduce _any_ user-facing change? Yes, this is a bug fix of UI. ### How was this patch tested? Manual. Run a Spark standalone cluster. ``` $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true -Dspark.deploy.maxDrivers=2" sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077 ``` Submit multiple jobs via REST API. ``` $ curl -s -k -XPOST http://localhost:6066/v1/submissions/create \ --header "Content-Type:application/json;charset=UTF-8" \ --data '{ "appResource": "", "sparkProperties": { "spark.master": "spark://localhost:7077", "spark.app.name": "Test 1", "spark.submit.deployMode": "cluster", "spark.jars": "/Users/dongjoon/APACHE/spark-merge/examples/target/scala-2.13/jars/spark-examples_2.13-4.0.0-SNAPSHOT.jar" }, "clientSparkVersion": "", "mainClass": "org.apache.spark.examples.SparkPi", "environmentVariables": {}, "action": "CreateSubmissionRequest", "appArgs": [ "10000" ] }' ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44711 from dongjoon-hyun/SPARK-46704. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 25c680cfd4dc63aeb9d16a673ee431c57188b80d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 January 2024, 20:54:37 UTC
d422aed [SPARK-46684][PYTHON][CONNECT][3.5] Fix CoGroup.applyInPandas/Arrow to pass arguments properly ### What changes were proposed in this pull request? This is a backport of apache/spark#44695. Fix `CoGroup.applyInPandas/Arrow` to pass arguments properly. ### Why are the changes needed? In Spark Connect, `CoGroup.applyInPandas/Arrow` doesn't take arguments properly, so the arguments of the UDF can be broken: ```py >>> import pandas as pd >>> >>> df1 = spark.createDataFrame( ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", "v1", "v2") ... ) >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) >>> >>> def summarize(left, right): ... return pd.DataFrame( ... { ... "left_rows": [len(left)], ... "left_columns": [len(left.columns)], ... "right_rows": [len(right)], ... "right_columns": [len(right.columns)], ... } ... ) ... >>> df = ( ... df1.groupby("id") ... .cogroup(df2.groupby("id")) ... .applyInPandas( ... summarize, ... schema="left_rows long, left_columns long, right_rows long, right_columns long", ... ) ... ) >>> >>> df.show() +---------+------------+----------+-------------+ |left_rows|left_columns|right_rows|right_columns| +---------+------------+----------+-------------+ | 2| 1| 2| 1| | 2| 1| 1| 1| +---------+------------+----------+-------------+ ``` The result should be: ```py +---------+------------+----------+-------------+ |left_rows|left_columns|right_rows|right_columns| +---------+------------+----------+-------------+ |        2|           3|         2|            2| |        2|           3|         1|            2| +---------+------------+----------+-------------+ ``` ### Does this PR introduce _any_ user-facing change? This is a bug fix. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44696 from ueshin/issues/SPARK-46684/3.5/cogroup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 January 2024, 04:19:12 UTC
8a0f642 [SPARK-46640][SQL] Fix RemoveRedundantAlias by excluding subquery attributes - In `RemoveRedundantAliases`, we have an `excluded` AttributeSet argument denoting the references for which we should not remove aliases. For a query with subquery expressions, adding the attributes references by the subquery in the `excluded` set prevents rewrites that might remove presumedly redundant aliases. (Changes in RemoveRedundantAlias) - Added a configuration flag to disable this fix, if not needed. - Added a unit test with Filter exists subquery expression to show how the alias would have been removed. - `RemoveRedundantAliases` does not take into account the outer attributes of a `SubqueryExpression` when considering redundant aliases, potentially removing them if it thinks they are redundant. - This can cause scenarios where a subquery expression has conditions like `a#x = a#x` i.e. both the attribute names and the expression ID(s) are the same. This can then lead to conflicting expression ID(s) error. - For example, in the query example below, the `RemoveRedundantAliases` would remove the alias `a#0 as a#1` and replace `a#1` with `a#0` in the Filter exists subquery expression which would create an issue if the subquery expression had an attribute with reference `a#0` (possible due to different scan relation instances possibly having the same attribute ID(s) (Ref: #40662) ``` Filter exists [a#1 && (a#1 = b#2)] : +- LocalRelation <empty>, [b#2] +- Project [a#0 AS a#1] +- LocalRelation <empty>, [a#0] ``` becomes ``` Filter exists [a#0 && (a#0 = b#2)] : +- LocalRelation <empty>, [b#2] +- LocalRelation <empty>, [a#0] ``` - The changes are needed to fix this bug. No - Added a unit test with Filter exists subquery expression to show how the alias would have been removed. No Closes #44645 from nikhilsheoran-db/SPARK-46640. Authored-by: Nikhil Sheoran <125331115+nikhilsheoran-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bbeb8d7417bafa09ad5202347175a47b3217e27f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 January 2024, 02:22:39 UTC
a4b184d [SPARK-46547][SS] Swallow non-fatal exception in maintenance task to avoid deadlock between maintenance thread and streaming aggregation operator ### What changes were proposed in this pull request? Swallow non-fatal exception in maintenance task to avoid deadlock between maintenance thread and streaming aggregation operator ### Why are the changes needed? This change fixes a race condition that causes a deadlock between the task thread and the maintenance thread. This is primarily only possible with the streaming aggregation operator. In this case, we use 2 physical operators - `StateStoreRestoreExec` and `StateStoreSaveExec`. The first one opens the store in read-only mode and the 2nd one does the actual commit. However, the following sequence of events creates an issue 1. Task thread runs the `StateStoreRestoreExec` and gets the store instance and thereby the DB instance lock 2. Maintenance thread fails with an error for some reason 3. Maintenance thread takes the `loadedProviders` lock and tries to call `close` on all the loaded providers 4. Task thread tries to execute the StateStoreRDD for the `StateStoreSaveExec` operator and tries to acquire the `loadedProviders` lock which is held by the thread above So basically if the maintenance thread is interleaved between the `restore/save` operations, there is a deadlock condition based on the `loadedProviders` lock and the DB instance lock. The fix proposes to simply release the resources at the end of the `StateStoreRestoreExec` operator (note that `abort` for `ReadStateStore` is likely a misnomer - but we choose to follow the already provided API in this case) Relevant Logs: Link - https://github.com/anishshri-db/spark/actions/runs/7356847259/job/20027577445?pr=4 ``` 2023-12-27T09:59:02.6362466Z 09:59:02.635 WARN org.apache.spark.sql.execution.streaming.state.StateStore: Error in maintenanceThreadPool 2023-12-27T09:59:02.6365616Z java.io.FileNotFoundException: File file:/home/runner/work/spark/spark/target/tmp/spark-8ef51f34-b9de-48f2-b8df-07e14599b4c9/state/0/1 does not exist 2023-12-27T09:59:02.6367861Z at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:733) 2023-12-27T09:59:02.6369383Z at org.apache.hadoop.fs.DelegateToFileSystem.listStatus(DelegateToFileSystem.java:177) 2023-12-27T09:59:02.6370693Z at org.apache.hadoop.fs.ChecksumFs.listStatus(ChecksumFs.java:571) 2023-12-27T09:59:02.6371781Z at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1940) 2023-12-27T09:59:02.6372876Z at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1936) 2023-12-27T09:59:02.6373967Z at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) 2023-12-27T09:59:02.6375104Z at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1942) 2023-12-27T09:59:02.6376676Z 09:59:02.636 WARN org.apache.spark.sql.execution.streaming.state.StateStore: Error running maintenance thread 2023-12-27T09:59:02.6379079Z java.io.FileNotFoundException: File file:/home/runner/work/spark/spark/target/tmp/spark-8ef51f34-b9de-48f2-b8df-07e14599b4c9/state/0/1 does not exist 2023-12-27T09:59:02.6381083Z at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:733) 2023-12-27T09:59:02.6382490Z at org.apache.hadoop.fs.DelegateToFileSystem.listStatus(DelegateToFileSystem.java:177) 2023-12-27T09:59:02.6383816Z at org.apache.hadoop.fs.ChecksumFs.listStatus(ChecksumFs.java:571) 2023-12-27T09:59:02.6384875Z at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1940) 2023-12-27T09:59:02.6386294Z at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1936) 2023-12-27T09:59:02.6387439Z at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) 2023-12-27T09:59:02.6388674Z at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1942) ... 2023-12-27T10:01:02.4292831Z [info] - changing schema of state when restarting query - state format version 2 (RocksDBStateStore) *** FAILED *** (2 minutes) 2023-12-27T10:01:02.4295311Z [info]  Timed out waiting for stream: The code passed to failAfter did not complete within 120 seconds. 2023-12-27T10:01:02.4297271Z [info]  java.base/java.lang.Thread.getStackTrace(Thread.java:1619) 2023-12-27T10:01:02.4299084Z [info]  org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:277) 2023-12-27T10:01:02.4300948Z [info]  org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231) ... 2023-12-27T10:01:02.6474472Z 10:01:02.646 WARN org.apache.spark.sql.execution.streaming.state.RocksDB StateStoreId(opId=0,partId=0,name=default): Error closing RocksDB 2023-12-27T10:01:02.6482792Z org.apache.spark.SparkException: [CANNOT_LOAD_STATE_STORE.UNRELEASED_THREAD_ERROR] An error occurred during loading state. StateStoreId(opId=0,partId=0,name=default): RocksDB instance could not be acquired by [ThreadId: Some(1858)] as it was not released by [ThreadId: Some(3835), task: partition 0.0 in stage 513.0, TID 1369] after 120009 ms. 2023-12-27T10:01:02.6488483Z Thread holding the lock has trace: app//org.apache.spark.sql.execution.streaming.state.StateStore$.getStateStoreProvider(StateStore.scala:577) 2023-12-27T10:01:02.6490896Z app//org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:565) 2023-12-27T10:01:02.6493072Z app//org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:128) 2023-12-27T10:01:02.6494915Z app//org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) 2023-12-27T10:01:02.6496232Z app//org.apache.spark.rdd.RDD.iterator(RDD.scala:329) 2023-12-27T10:01:02.6497655Z app//org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 2023-12-27T10:01:02.6499153Z app//org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) 2023-12-27T10:01:02.6556758Z 10:01:02.654 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 513.0 (TID 1369) (localhost executor driver): TaskKilled (Stage cancelled: [SPARK_JOB_CANCELLED] Job 260 cancelled part of cancelled job group cf26288c-0158-48ce-8a86-00a596dd45d8 SQLSTATE: XXKDA) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ``` [info] Run completed in 6 minutes, 20 seconds. [info] Total number of tests run: 80 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 80, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #44542 from anishshri-db/task/SPARK-46547. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit f7b0b453791707b904ed0fa5508aa4b648d56bba) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 10 January 2024, 14:18:21 UTC
d3e3084 [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website ### What changes were proposed in this pull request? Enhance the Visual Appeal of Spark doc website after https://github.com/apache/spark/pull/40269: #### 1. There is a weird indent on the top right side of the first paragraph of the Spark 3.5.0 doc overview page Before this PR <img width="680" alt="image" src="https://github.com/apache/spark/assets/1097932/84d21ca1-a4d0-4bd4-8f20-a34fa5db4000"> After this PR: <img width="1035" alt="image" src="https://github.com/apache/spark/assets/1097932/4ffc0d5a-ed75-44c5-b20a-475ff401afa8"> #### 2. All the titles are too big and therefore less readable. In the website https://spark.apache.org/downloads.html, titles are h2 while in doc site https://spark.apache.org/docs/latest/ titles are h1. So we should make the font size of titles smaller. Before this PR: <img width="935" alt="image" src="https://github.com/apache/spark/assets/1097932/5bbbd9eb-432a-42c0-98be-ff00a9099cd6"> After this PR: <img width="965" alt="image" src="https://github.com/apache/spark/assets/1097932/dc94c1fb-6ac1-41a8-b4a4-19b3034125d7"> #### 3. The banner image can't be displayed correct. Even when it shows up, it will be hover by the text. To make it simple, let's not show the banner image as we did in https://spark.apache.org/docs/3.4.2/ <img width="570" alt="image" src="https://github.com/apache/spark/assets/1097932/f6d34261-a352-44e2-9633-6e96b311a0b3"> <img width="1228" alt="image" src="https://github.com/apache/spark/assets/1097932/c49ce6b6-13d9-4d8f-97a9-7ed8b037be57"> ### Why are the changes needed? Improve the Visual Appeal of Spark doc website ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build doc and verify on local setup. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44642 from gengliangwang/enhance_doc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 January 2024, 01:13:51 UTC
a753239 [SPARK-46600][SQL] Move shared code between SqlConf and SqlApiConf to SqlApiConfHelper ### What changes were proposed in this pull request? This code proposes to introduce a new object named `SqlApiConfHelper` to contain shared code between `SqlApiConf` and `SqlConf`. ### Why are the changes needed? As of now, SqlConf will access some of the variables of SqlApiConf while SqlApiConf also try to initialize SqlConf upon initialization. This PR is to avoid potential circular dependency between SqlConf and SqlApiConf. The shared variables or access to the shared variables are moved to the new `SqlApiConfHelper`. So either SqlApiConf and SqlConf wants to initialize the other side, they will only initialize the same third object. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Existing UT ### Was this patch authored or co-authored using generative AI tooling? NO Closes #44602 from amaliujia/refactor_sql_api. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 03fc5e26b866491b52f89f4d24beade7d1669a37) Signed-off-by: Herman van Hovell <herman@databricks.com> 09 January 2024, 02:22:59 UTC
2b0c3e1 [SPARK-46610][SQL] Create table should throw exception when no value for a key in options ### What changes were proposed in this pull request? Before SPARK-43529, there was a check from `visitPropertyKeyValues` that throws for null values for option keys. After SPARK-43529, a new function is used to support expressions in options but the new function lose the check. This PR adds the check back. ### Why are the changes needed? Throw exception when a option value is null. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT ### Was this patch authored or co-authored using generative AI tooling? NO Closes #44615 from amaliujia/fix_create_table_options. Lead-authored-by: Rui Wang <rui.wang@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e7536f2484afce412256bf711452acde8df5a287) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 January 2024, 01:07:49 UTC
75b567d [SPARK-46628][INFRA] Use SPDX short identifier in `license` name ### What changes were proposed in this pull request? This PR aims to use SPDX short identifier as `license`'s `name` field. - https://spdx.org/licenses/Apache-2.0.html ### Why are the changes needed? SPDX short identifier is recommended as `name` field by `Apache Maven`. - https://maven.apache.org/pom.html#Licenses ASF pom file has been using it. This PR aims to match with ASF pom file. - https://github.com/apache/maven-apache-parent/pull/118 - https://github.com/apache/maven-apache-parent/blob/7888bdb8ee653ecc03b5fee136540a607193c240/pom.xml#L46 ``` <name>Apache-2.0</name> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44631 from dongjoon-hyun/SPARK-46628. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d008f81a9d8d4b5e8e434469755405f6ae747e75) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 January 2024, 00:24:14 UTC
fe22ec7 [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it. To not violate the memory mode requirement No new test no Closes #44598 from cloud-fan/orc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 0c1c5e93e376b97a6d2dae99e973b9385155727a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 January 2024, 20:40:11 UTC
9f095b7 [SPARK-46609][SQL] Avoid exponential explosion in PartitioningPreservingUnaryExecNode ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/37525 . When expanding the output partitioning/ordering with aliases, we have a threshold to avoid exponential explosion. However, we missed to apply this threshold in one place. This PR fixes it. ### Why are the changes needed? to avoid OOM ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #44614 from cloud-fan/oom. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f8115da1a2bb33e6344dd69cc38ca7a68c3654b1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 January 2024, 19:23:45 UTC
1b7ee9e [SPARK-46602][SQL] Propagate `allowExisting` in view creation when the view/table does not exists ### What changes were proposed in this pull request? This PR fixes the undesired behavior that concurrent `CREATE VIEW IF NOT EXISTS` queries could throw `TABLE_OR_VIEW_ALREADY_EXISTS` exceptions. It's because the current implementation did not propagate the 'IF NOT EXISTS' when the detecting view/table does not exists. ### Why are the changes needed? Fix the above issue. ### Does this PR introduce _any_ user-facing change? Yes in the sense that if fixes an issue in concurrent case. ### How was this patch tested? Without the fix the following test failed while with this PR if passed. But following the [comment](https://github.com/apache/spark/pull/44603#discussion_r1442515458), I removed the test from this PR. ```scala test("CREATE VIEW IF NOT EXISTS never throws TABLE_OR_VIEW_ALREADY_EXISTS") { // Concurrently create a view with the same name, so that some of the queries may all // get that the view does not exist and try to create it. But with IF NOT EXISTS, the // queries should not fail. import ExecutionContext.Implicits.global val concurrency = 10 val tableName = "table_name" val viewName = "view_name" withTable(tableName) { sql(s"CREATE TABLE $tableName (id int) USING parquet") withView("view_name") { val futures = (0 to concurrency).map { _ => Future { Try { sql(s"CREATE VIEW IF NOT EXISTS $viewName AS SELECT * FROM $tableName") } } } futures.map { future => val res = ThreadUtils.awaitResult(future, 5.seconds) assert( res.isSuccess, s"Failed to create view: ${if (res.isFailure) res.failed.get.getMessage}" ) } } } } ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44603 from anchovYu/create-view-if-not-exist-fix. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9b3c70f6094c97ed61018d9fca8a50320574ab49) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 January 2024, 14:57:46 UTC
fb90ade [SPARK-46546][DOCS] Fix the formatting of tables in `running-on-yarn` pages ### What changes were proposed in this pull request? The pr aims to fix the formatting of tables in `running-on-yarn` pages. ### Why are the changes needed? Make the tables on the page display normally. Before: <img width="1288" alt="image" src="https://github.com/apache/spark/assets/15246973/26facec4-d805-4549-a640-120c499bd7fd"> After: <img width="1310" alt="image" src="https://github.com/apache/spark/assets/15246973/cf6c20ef-a4ce-4532-9acd-ab9cec41881a"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually check. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44540 from panbingkun/SPARK-46546. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 85b44ccef4c4aeec302c12e03833590c7d8d6b9e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 January 2024, 20:07:49 UTC
2891d92 [SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite leaks hive's SessionState ### What changes were proposed in this pull request? The upcoming tests with the new hive configurations will have no effect due to the leaked SessionState. ``` 06:21:12.848 pool-1-thread-1 INFO ThriftServerWithSparkContextInHttpSuite: Trying to start HiveThriftServer2: mode=http, attempt=0 .... 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:OperationManager is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:SessionManager is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: CLIService is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: HiveServer2 is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:OperationManager is started. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:SessionManager is started. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: CLIService is started. 06:21:12.852 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is started. 06:21:12.852 pool-1-thread-1 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads 06:21:12.852 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is started. ``` As the logs above revealed, ThriftServerWithSparkContextInHttpSuite started the ThriftBinaryCLIService instead of the ThriftHttpCLIService. This is because in HiveClientImpl, the new configurations are only applied to hive conf during initializing but not for existing ones. This cause ThriftServerWithSparkContextInHttpSuite retrying or even aborting. ### Why are the changes needed? Fix flakiness in tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ran tests locally with the hive-thriftserver module locally, ### Was this patch authored or co-authored using generative AI tooling? no Closes #44578 from yaooqinn/SPARK-46577. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 605fecd22cc18fc9b93fb26d4aa6088f5a314f92) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 January 2024, 13:55:05 UTC
6e8dbac [SPARK-46425][INFRA] Pin the bundler version in CI Currently documentation build is broken: https://github.com/apache/spark/actions/runs/7226413850/job/19691970695 ``` ... ERROR: Error installing bundler: The last version of bundler (>= 0) to support your Ruby & RubyGems was 2.4.22. Try installing it with `gem install bundler -v 2.4.22` bundler requires Ruby version >= 3.0.0. The current ruby version is 2.7.0.0. ``` This PR uses the suggestion. To recover the CI. No, dev-only. CI in this PR verify it. No. Closes #44376 from HyukjinKwon/SPARK-46425. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d0da1172b7d87b68a8af8464c6486aa586324241) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 January 2024, 16:36:11 UTC
f0e5fc9 [SPARK-46562][SQL] Remove retrieval of `keytabFile` from `UserGroupInformation` in `HiveAuthFactory` ### What changes were proposed in this pull request? This pr removed the retrieval of `keytabFile` from `UserGroupInformation` in `HiveAuthFactory` because `keytabFile` no longer exists in `UserGroupInformation` after Hadoop 3.0.3. Therefore, in `HiveAuthFactory`, `keytabFile` will always be null and in `HiveAuthFactory`, `keytabFile` will only be used when it is not null. For the specific changes in Hadoop, please refer to https://issues.apache.org/jira/browse/HADOOP-9747 | https://github.com/apache/hadoop/commit/59cf7588779145ad5850ad63426743dfe03d8347. ### Why are the changes needed? Clean up the invalid code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44557 from LuciferYang/remove-keytabFile. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bc7e949cf99382ecf70d5b59fca9e7e415fbbb48) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 January 2024, 16:13:38 UTC
6838f0d [SPARK-46535][SQL] Fix NPE when describe extended a column without col stats ### What changes were proposed in this pull request? ### Why are the changes needed? Currently executing DESCRIBE TABLE EXTENDED a column without col stats with v2 table will throw a null pointer exception. ```text Cannot invoke "org.apache.spark.sql.connector.read.colstats.ColumnStatistics.min()" because the return value of "scala.Option.get()" is null java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.connector.read.colstats.ColumnStatistics.min()" because the return value of "scala.Option.get()" is null at org.apache.spark.sql.execution.datasources.v2.DescribeColumnExec.run(DescribeColumnExec.scala:63) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:118) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:150) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:241) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:116) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:918) ``` This RP will fix it ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Add a new test `describe extended (formatted) a column without col stats` ### Was this patch authored or co-authored using generative AI tooling? Closes #44524 from Zouxxyy/dev/fix-stats. Lead-authored-by: zouxxyy <zouxinyu.zxy@alibaba-inc.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit af8228ce9aee99eae9d08dbdefaaad32cf5438eb) Signed-off-by: Max Gekk <max.gekk@gmail.com> 28 December 2023, 16:57:29 UTC
5d4a913 [SPARK-46514][TESTS] Fix HiveMetastoreLazyInitializationSuite ### What changes were proposed in this pull request? This PR enabled the assertion in HiveMetastoreLazyInitializationSuite ### Why are the changes needed? fix test intenton ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass HiveMetastoreLazyInitializationSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #44500 from yaooqinn/SPARK-46514. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit d0245d34c004935bb2c904bfd906836df3d574fa) Signed-off-by: Kent Yao <yao@apache.org> 28 December 2023, 02:53:52 UTC
432ab15 [SPARK-46478][SQL][3.5] Revert SPARK-43049 to use oracle varchar(255) for string ### What changes were proposed in this pull request? Revert SPARK-43049 to use Oracle Varchar (255) for string for performance consideration ### Why are the changes needed? for performance consideration ### Does this PR introduce _any_ user-facing change? yes, storing strings in Oracle table, which is defined by spark DDL with string columns. Users will get an error if string values exceed 255 ```java org.apache.spark.SparkRuntimeException: [EXCEED_LIMIT_LENGTH] Exceeds char/varchar type length limitation: 255. SQLSTATE: 54006 [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.exceedMaxLimit(QueryExecutionErrors.scala:2512) ``` ### How was this patch tested? revised unit tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44493 from yaooqinn/SPARK-46478-B. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 December 2023, 19:46:43 UTC
0948e24 [SPARK-46466][SQL][3.5] Vectorized parquet reader should never do rebase for timestamp ntz backport https://github.com/apache/spark/pull/44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 December 2023, 15:25:12 UTC
a001482 [SPARK-46480][CORE][SQL][3.5] Fix NPE when table cache task attempt This pr backports https://github.com/apache/spark/pull/44445 for branch-3.5 ### What changes were proposed in this pull request? This pr adds a check: we only mark the cached partition is materialized if the task is not failed and not interrupted. And adds a new method `isFailed` in `TaskContext`. ### Why are the changes needed? Before this pr, when do cache, task failure can cause NPE in other tasks ``` java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:396) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.accessors1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:155) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) ``` ### Does this PR introduce _any_ user-facing change? yes, it's a bug fix ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #44457 from ulysses-you/fix-cache-3.5. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> 22 December 2023, 09:28:59 UTC
98042e3 [SPARK-46464][DOC] Fix the scroll issue of tables when overflow ### What changes were proposed in this pull request? https://spark.apache.org/docs/3.4.1/running-on-kubernetes.html#spark-properties https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties As listed above, the doc content in 3.5.0 cannot scroll horizontally. Users can only see the rest of its content when a table overflows if they zoom out as much as possible, resulting in hard-to-read minor characters. This PR changes the HTML body overflow-x from hidden to auto to enable the underlying table to scroll horizontally. ### Why are the changes needed? Fix documentation ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? #### Before ![image](https://github.com/apache/spark/assets/8326978/437bee91-ab0d-4616-aaaf-f99171dcf9f9) #### After ![image](https://github.com/apache/spark/assets/8326978/327ed82b-3e14-4a27-be1a-835a7b21c000) ### Was this patch authored or co-authored using generative AI tooling? no Closes #44423 from yaooqinn/SPARK-46464. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit fc7d7bce7732a2bccb3a7ccf3ed6bed4ac65f8fc) Signed-off-by: Kent Yao <yao@apache.org> 22 December 2023, 03:45:25 UTC
286c469 [SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect ### What changes were proposed in this pull request? This PR fix a but by make JDBC dialect decide the decimal precision and scale. **How to reproduce the bug?** https://github.com/apache/spark/pull/44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`. The bug fired when pushdown the below SQL to H2 JDBC. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` **The root cause** `getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`. The query for database H2 show below. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` We can get the five variables from `ResultSetMetaData`, please refer: ``` columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)" dataType = 2 typeName = "NUMERIC" fieldSize = 100000 fieldScale = 50000 ``` Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually. The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`. At finally, `makeGetter` throws exception. ``` Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339) ``` ### Why are the changes needed? This PR fix the bug that `JdbcUtils` can't get the correct decimal type. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? Manual tests in https://github.com/apache/spark/pull/44397 ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44398 from beliefer/SPARK-46443. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a921da8509a19b2d23c30ad657725f760932236c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 December 2023, 01:55:20 UTC
d7534a3 [SPARK-46380][SQL] Replace current time/date prior to evaluating inline table expressions With this PR proposal is to do inline table resolution in two phases: 1) If there are no expressions that depend on current context (e.g. expressions that depend on CURRENT_DATABASE, CURRENT_USER, CURRENT_TIME etc.) they will be evaluated as part of ResolveInlineTable rule. 2) Expressions that do depend on CURRENT_* evaluation will be kept as expressions and they evaluation will be delayed to post analysis phase. This PR aims to solve two problems with inline tables. Example1: ```sql SELECT COUNT(DISTINCT ct) FROM VALUES (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()), (CURRENT_TIMESTAMP()) as data(ct) ``` Prior to this change this example would return 3 (i.e. all CURRENT_TIMESTAMP expressions would return different value since they would be evaluated individually as part of inline table evaluation). After this change result is 1. Example 2: ```sql CREATE VIEW V as (SELECT * FROM VALUES(CURRENT_TIMESTAMP()) ``` In this example VIEW would be saved with literal evaluated during VIEW creation. After this change CURRENT_TIMESTAMP() will eval during VIEW execution. See section above. New test that validates this behaviour is introduced. No. Closes #44316 from dbatomic/inline_tables_curr_time_fix. Lead-authored-by: Aleksandar Tomic <aleksandar.tomic@databricks.com> Co-authored-by: Aleksandar Tomic <150942779+dbatomic@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5fe963f8560ef05925d127e82ab7ef28d6a1d7bc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 December 2023, 08:00:09 UTC
0c00c54 [SPARK-46330] Loading of Spark UI blocks for a long time when HybridStore enabled ### What changes were proposed in this pull request? Move `LoadedAppUI` invalidate operation out of `FsHistoryProvider` synchronized block. ### Why are the changes needed? When closing a HybridStore of a `LoadedAppUI` with a lot of data waiting to be written to disk, loading of other Spark UIs will be blocked for a long time. See more details at https://issues.apache.org/jira/browse/SPARK-46330 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44260 from zhouyifan279/SPARK-46330. Authored-by: zhouyifan279 <zhouyifan279@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit cf54e8f9a51bf54e8fa3e1011ac370e46134b134) Signed-off-by: Kent Yao <yao@apache.org> 20 December 2023, 08:51:12 UTC
cc4f578 [SPARK-46453][CONNECT] Throw exception from `internalError()` in `SessionHolder` ### What changes were proposed in this pull request? In the PR, I propose to throw `SparkException` returned by `internalError` in `SessionHolder`. ### Why are the changes needed? Without the bug fix user won't see the internal error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/a ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44400 from MaxGekk/throw-internal-error. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit dc0bfc4c700c347f2f58625facec8c5771bde59a) Signed-off-by: Max Gekk <max.gekk@gmail.com> 19 December 2023, 09:21:35 UTC
8abf958 [SPARK-46417][SQL] Do not fail when calling hive.getTable and throwException is false ### What changes were proposed in this pull request? Uses can set up their own HMS and let Spark connects to it. We have no control over it and somtimes it's not even Hive but just a HMS-API-compatible service. Spark should be more fault-tolerant when calling HMS APIs. This PR fixes an issue in `hive.getTable` with `throwException = false`, to make sure we don't throw error when can't fetch the table. ### Why are the changes needed? avoid query failure caused by HMS bugs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? in our product environment ### Was this patch authored or co-authored using generative AI tooling? No Closes #44364 from cloud-fan/hive. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 59488039f58b18617cd6dfd6dbe3bf014af222e7) Signed-off-by: Kent Yao <yao@apache.org> 15 December 2023, 10:55:25 UTC
908c472 [SPARK-46396][SQL] Timestamp inference should not throw exception ### What changes were proposed in this pull request? When setting `spark.sql.legacy.timeParserPolicy=LEGACY`, Spark will use the LegacyFastTimestampFormatter to infer potential timestamp columns. The inference shouldn't throw exception. However, when the input is 23012150952, there is exception: ``` For input string: "23012150952" java.lang.NumberFormatException: For input string: "23012150952" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) at java.base/java.lang.Integer.parseInt(Integer.java:668) at java.base/java.lang.Integer.parseInt(Integer.java:786) at org.apache.commons.lang3.time.FastDateParser$NumberStrategy.parse(FastDateParser.java:304) at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:1045) at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:651) at org.apache.spark.sql.catalyst.util.LegacyFastTimestampFormatter.parseOptional(TimestampFormatter.scala:418) ``` This PR is to fix the issue. ### Why are the changes needed? Bug fix, Timestamp inference should not throw exception ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? New test case + existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #44338 from gengliangwang/fixParseOptional. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 4a79ae9d821e9b04fbe949251050c3e4819dff92) Signed-off-by: Gengliang Wang <gengliang@apache.org> 14 December 2023, 08:06:35 UTC
eb1e6ad [SPARK-46388][SQL] HiveAnalysis misses the pattern guard of `query.resolved` ### What changes were proposed in this pull request? This PR adds `query.resolved` as a pattern guard when HiveAnalysis converts InsertIntoStatement to InsertIntoHiveTable. ### Why are the changes needed? Due to https://github.com/apache/spark/pull/41262/files#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1080, the `table` field is resolved regardless of the query field. Before, it never got a chance to be resolved as `HiveTableRelation` and then match any rule of HiveAnalysis. But now, it gets the chance always and results in a spark-kernel bug - `Invalid call to toAttribute on unresolved object.` ``` insert into t2 select cast(a as short) from t where b=1; Invalid call to toAttribute on unresolved object ``` ### Does this PR introduce _any_ user-facing change? no, bugfix for 3.5 and later ### How was this patch tested? added new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #44326 from yaooqinn/SPARK-46388. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit ccc436d829cd0b07088e2864cb1ecc55ab97a491) Signed-off-by: Kent Yao <yao@apache.org> 13 December 2023, 10:04:53 UTC
ac031d6 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` ### What changes were proposed in this pull request? This PR aims to remove `kill` hyperlink from `RELAUNCHING` drivers in `MasterPage`. ### Why are the changes needed? Since Apache Spark 1.4.0 (SPARK-5495), `RELAUNCHING` drivers have `kill` hyperlinks in the `Completed Drivers` table. ![Screenshot 2023-12-11 at 1 02 29 PM](https://github.com/apache/spark/assets/9700541/38f4bf08-efb9-47e5-8a7a-f7d127429012) However, this is a bug because the driver was already terminated by definition. Newly relaunched driver has an independent ID and there is no relationship with the previously terminated ID. https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala#L27 If we clicked the `kill` link, `Master` always complains like the following. ``` 23/12/11 21:25:50 INFO Master: Asked to kill driver 202312112113-00000 23/12/11 21:25:50 WARN Master: Driver 202312112113-00000 has already finished or does not exist ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44301 from dongjoon-hyun/SPARK-46369. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e434c9f0d5792b7af43c87dd6145fd8a6a04d8e2) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 December 2023, 23:06:01 UTC
9c83bf5 [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide ### What changes were proposed in this pull request? Fixes an error in the SQL migration guide documentation for `spark.sql.legacy.doLooseUpcast`. I corrected the config name and moved it to the section for migration from Spark 2.4 to 3.0 since it was not made available until Spark 3.0. ### Why are the changes needed? The config was documented as `spark.sql.legacy.looseUpcast` and is inaccurately included in the Spark 2.4 to Spark 2.4.1 section. I changed the docs to match what is implemented in https://github.com/apache/spark/blob/20df062d85e80422a55afae80ddbf2060f26516c/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3873 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Docs only change ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44262 from amytsai-stripe/fix-migration-docs-loose-upcast. Authored-by: Amy Tsai <amytsai@stripe.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bab884082c0f82e3f9053adac6c7e8a3fcfab11c) Signed-off-by: Max Gekk <max.gekk@gmail.com> 11 December 2023, 15:35:45 UTC
cbaefe9 [SPARK-45969][DOCS] Document configuration change of executor failure tracker It's a follow-up of SPARK-41210 (use a new JIRA ticket because it was released in 3.5.0), this PR updates docs/migration guide about configuration change of executor failure tracker Docs update is missing in previous changes, also is requested https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb#r132516892 by tgravescs Yes, docs changed Review No Closes #43863 from pan3793/SPARK-45969. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7a43de193aa5a0856e098088728dccea37f169c5) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 December 2023, 22:04:27 UTC
28a8b18 [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log ### What changes were proposed in this pull request? This patch updates the document of `CheckpointFileManager.list` method to reflect the fact it is used to return both files and directories to reduce confusion. For the usage like `HDFSMetadataLog` where it assumes returned file status by `list` are all files, we add a filter there to avoid confusing error. ### Why are the changes needed? `HDFSMetadataLog` takes a metadata path as parameter. When it goes to retrieves all batches metadata, it calls `CheckpointFileManager.list` to get all files under the metadata path. However, currently all implementations of `CheckpointFileManager.list` returns all files/directories under the given path. So if there is a dictionary with name of batch number (a long value), the directory will be returned too and cause trouble when `HDFSMetadataLog` goes to read it. Actually, `CheckpointFileManager.list` method clearly defines that it lists the "files" in a path. That's being said, current implementations don't follow the doc. We tried to make `list` method implementations only return files but some usage (state metadata) of `list` method already break the assumption and they use dictionaries returned by `list` method. So we simply update `list` method document to explicitly define it returns both files/dictionaries. We add a filter in `HDFSMetadataLog` on the file statuses returned by `list` method to avoid this issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44272 from viirya/fix_metadatalog. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 75805f07f5caeb01104a7352b02790d03a043ded) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 December 2023, 23:21:04 UTC
ab14430 [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails ### What changes were proposed in this pull request? This updates the the behavior of `from_protobuf()` built function when underlying record fails to deserialize. * **Current behvior**: * By default, this would throw an error and the query fails. [This part is not changed in the PR] * When `mode` is set to 'PERMISSIVE' it returns a non-null struct with each of the inner fields set to null e.g. `{ "field_a": null, "field_b": null }` etc. * This is not very convenient to the users. They don't know if this was due to malformed record or if the input itself has null. It is very hard to check for each field for null in SQL query (imagine a sql query with a struct that has 10 fields). * **New behavior** * When `mode` is set to 'PERMISSIVE' it simply returns `null`. ### Why are the changes needed? This makes it easier for users to detect and handle malformed records. ### Does this PR introduce _any_ user-facing change? Yes, but this does not change the contract. In fact, it clarifies it. ### How was this patch tested? - Unit tests are updated. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44214 from rangadi/protobuf-null. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 309c796876f310f8604292d84acc12e711ba7031) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 08 December 2023, 07:40:41 UTC
dbb6198 [SPARK-45580][SQL][3.5] Handle case where a nested subquery becomes an existence join ### What changes were proposed in this pull request? This is a back-port of #44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44215 from bersprockets/schema_change_br35. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 December 2023, 23:24:48 UTC
a697725 [SPARK-46274][SQL] Fix Range operator computeStats() to check long validity before converting ### What changes were proposed in this pull request? Range operator's `computeStats()` function unsafely casts from `BigInt` to `Long` and causes issues downstream with statistics estimation. Adds bounds checking to avoid crashing. ### Why are the changes needed? Downstream statistics estimation will crash and fail loudly; to avoid this and help maintain clean code we should fix this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44191 from n-young-db/range-compute-stats. Authored-by: Nick Young <nick.young@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9fd575ae46f8a4dbd7da18887a44c693d8788332) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 December 2023, 23:20:28 UTC
b5cbe1f [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation - Missing docs - https://github.com/apache/spark/pull/31502#issuecomment-774792276 potential regression no doc build no Closes #44207 from yaooqinn/SPARK-46286. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 December 2023, 18:47:16 UTC
1321b4e [SPARK-46239][CORE] Hide `Jetty` info **What changes were proposed in this pull request?** The PR sets parameters to hide the version of jetty in spark. **Why are the changes needed?** It can avoid obtaining remote WWW service information through HTTP. **Does this PR introduce any user-facing change?** No **How was this patch tested?** Manual review **Was this patch authored or co-authored using generative AI tooling?** No Closes #44158 from chenyu-opensource/branch-SPARK-46239. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: chenyu <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ff4f59341215b7f3a87e6cd8798d49e25562fcd6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2023, 22:41:35 UTC
97472c9 [SPARK-46092][SQL][3.5] Don't push down Parquet row group filters that overflow This is a cherry-pick from https://github.com/apache/spark/pull/44006 to spark 3.5 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44154 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.5. Authored-by: Johan Lasperas <johan.lasperas@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2023, 16:58:03 UTC
273ef57 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event ### What changes were proposed in this pull request? We found a race condition between lastTaskRunningTime and lastShuffleMigrationTime that could lead to a decommissioned executor exit before all the shuffle blocks have been discovered. The issue could lead to immediate task retry right after an executor exit, thus longer query execution time. To fix the issue, we choose to update the lastTaskRunningTime only when a task updates its status to finished through the StatusUpdate event. This is better than the current approach (which use a thread to check for number of running tasks every second), because in this way we clearly know whether the shuffle block refresh happened after all tasks finished running or not, thus resolved the race condition mentioned above. ### Why are the changes needed? To fix a race condition that could lead to shuffle data lost, thus longer query execution time. ### How was this patch tested? This is a very subtle race condition that is hard to write a unit test using current unit test framework. And we are confident the change is low risk. Thus only verify by passing all the existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44090 from jiangxb1987/SPARK-46182. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6f112f7b1a50a2b8a59952c69f67dd5f80ab6633) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2023, 06:08:40 UTC
fde0fe6 [SPARK-45975][SQL][TESTS][3.5] Reset storeAssignmentPolicy to original ### What changes were proposed in this pull request? Reset storeAssignmentPolicy to original in HiveCompatibilitySuite. ### Why are the changes needed? STORE_ASSIGNMENT_POLICY was not reset in HiveCompatibilitySuite, causing subsequent test cases to fail. Details: https://github.com/wForget/spark/actions/runs/6902668865/job/18779862759 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44126 from LuciferYang/SPARK-45943-FOLLOWUP. Authored-by: wforget <643348094@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 December 2023, 19:25:54 UTC
53e2e7b [SPARK-46189][PS][SQL] Perform comparisons and arithmetic between same types in various Pandas aggregate functions to avoid interpreted mode errors ### What changes were proposed in this pull request? In various Pandas aggregate functions, remove each comparison or arithmetic operation between `DoubleType` and `IntergerType` in `evaluateExpression` and replace with a comparison or arithmetic operation between `DoubleType` and `DoubleType`. Affected functions are `PandasStddev`, `PandasVariance`, `PandasSkewness`, `PandasKurtosis`, and `PandasCovar`. ### Why are the changes needed? These functions fail in interpreted mode. For example, `evaluateExpression` in `PandasKurtosis` compares a double to an integer: ``` If(n < 4, Literal.create(null, DoubleType) ... ``` This results in a boxed double and a boxed integer getting passed to `SQLOrderingUtil.compareDoubles` which expects two doubles as arguments. The scala runtime tries to unbox the boxed integer as a double, resulting in an error. Reproduction example: ``` spark.sql("set spark.sql.codegen.wholeStage=false") spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") import numpy as np import pandas as pd import pyspark.pandas as ps pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a") psser = ps.from_pandas(pser) psser.kurt() ``` See Jira (SPARK-46189) for the other reproduction cases. This works fine in codegen mode because the integer is already unboxed and the Java runtime will implictly cast it to a double. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44099 from bersprockets/unboxing_error. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 042d8546be5d160e203ad78a8aa2e12e74142338) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 01 December 2023, 02:29:00 UTC
00bb4ad [SPARK-46188][DOC][3.5] Fix the CSS of Spark doc's generated tables ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/40269, there is no border in the generated tables of Spark doc(for example, https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html) . This PR is to fix it by restoring part of the table style in https://github.com/apache/spark/pull/40269/files#diff-309b964023ca899c9505205f36d3f4d5b36a6487e5c9b2e242204ee06bbc9ce9L26 This PR also unifies all the styles of tables by removing the `class="table table-striped"` in HTML style tables in markdown docs. ### Why are the changes needed? Fix a regression in the table CSS of Spark docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build docs and verify. Before changes: <img width="931" alt="image" src="https://github.com/apache/spark/assets/1097932/1eb7abff-65af-4c4c-bbd5-9077f38c1b43"> After changes: <img width="911" alt="image" src="https://github.com/apache/spark/assets/1097932/be77d4c6-1279-43ec-a234-b69ee02e3dc6"> ### Was this patch authored or co-authored using generative AI tooling? Generated-by: ChatGPT 4 Closes #44097 from gengliangwang/fixTable3.5. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 30 November 2023, 22:56:48 UTC
c2342ba [SPARK-45943][SQL] Move DetermineTableStats to resolution rules ### What changes were proposed in this pull request? Move DetermineTableStats to resolution rules. ### Why are the changes needed? `MergeIntoTable#sourceTable` is used for `ReplaceData#groupFilterCondition` in `RewriteMergeIntoTable`, SourceTable in `ReplaceData#groupFilterCondition` is resolved and will not be applied to `DetermineTableStats` through `ResolveSubquery#resolveSubQueries`. So, when there is a hive table without stats in `MergeIntoTable#sourceTable`, IllegalStateException will occur. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added test case ### Was this patch authored or co-authored using generative AI tooling? No Closes #43867 from wForget/SPARK-45943. Authored-by: wforget <643348094@qq.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d1aea92daf254334bcbd6d96901a54a2502eda29) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 November 2023, 02:07:59 UTC
35ecb32 [SPARK-46029][SQL] Escape the single quote, `_` and `%` for DS V2 pushdown ### What changes were proposed in this pull request? Spark supports push down `startsWith`, `endWith` and `contains` to JDBC database with DS V2 pushdown. But the `V2ExpressionSQLBuilder` didn't escape the single quote, `_` and `%`, it can cause unexpected result. ### Why are the changes needed? Escape the single quote, `_` and `%` for DS V2 pushdown ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43801 from beliefer/SPARK-38432_followup3. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d2cd98bdd32446b4106e66eb099efd8fb47acf40) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 November 2023, 00:37:59 UTC
92b6619 [SPARK-46006][YARN][FOLLOWUP] YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down https://github.com/apache/spark/pull/38622 2. Avoid new allocation requests when sc.stop stuck https://github.com/apache/spark/pull/43906 3. Cancel pending allocation request, this pr https://github.com/apache/spark/pull/44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756bdac823be42ed10bc011415f405905497) Signed-off-by: Kent Yao <yao@apache.org> 28 November 2023, 03:04:30 UTC
e4731e9 [SPARK-45974][SQL] Add scan.filterAttributes non-empty judgment for RowLevelOperationRuntimeGroupFiltering ### What changes were proposed in this pull request? Add scan.filterAttributes non-empty judgment for RowLevelOperationRuntimeGroupFiltering. ### Why are the changes needed? When scan.filterAttributes is empty, an invalid dynamic pruning condition will be generated in RowLevelOperationRuntimeGroupFiltering. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added test case ### Was this patch authored or co-authored using generative AI tooling? No Closes #43869 from wForget/SPARK-45974. Authored-by: wforget <643348094@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ade861d19910df724d9233df98c059ff9d57f795) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 November 2023, 15:29:04 UTC
132c1a1 [SPARK-46095][DOCS] Document `REST API` for Spark Standalone Cluster This PR aims to document `REST API` for Spark Standalone Cluster. To help the users to understand Apache Spark features. No. Manual review. `REST API` Section is added newly. **AFTER** <img width="704" alt="Screenshot 2023-11-24 at 4 13 53 PM" src="https://github.com/apache/spark/assets/9700541/a4e09d94-d216-4629-8b37-9d350365a428"> No. Closes #44007 from dongjoon-hyun/SPARK-46095. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2023, 01:34:58 UTC
351a5f8 [SPARK-46016][DOCS][PS] Fix pandas API support list properly ### What changes were proposed in this pull request? This PR proposes to fix a critical issue in the [Supported pandas API documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html) where many essential APIs such as `DataFrame.max`, `DataFrame.min`, `DataFrame.mean`, `and DataFrame.median`, etc. were incorrectly marked as not implemented - marked as "N" - as below: <img width="291" alt="Screenshot 2023-11-24 at 12 37 49 PM" src="https://github.com/apache/spark/assets/44108233/95c5785c-711c-400c-b2ec-0db034e90fd8"> The root cause of this issue was that the script used to generate the support list excluded functions inherited from parent classes. For instance, `CategoricalIndex.max` is actually supported by inheriting the `Index` class but was not directly implemented in `CategoricalIndex`, leading to it being marked as unsupported: <img width="397" alt="Screenshot 2023-11-24 at 12 30 08 PM" src="https://github.com/apache/spark/assets/44108233/90e92996-a88a-4a20-bb0c-4909097e2688"> ### Why are the changes needed? The current documentation inaccurately represents the state of supported pandas API, which could significantly hinder user experience and adoption. By correcting these inaccuracies, we ensure that the documentation reflects the true capabilities of Pandas API on Spark, providing users with reliable and accurate information. ### Does this PR introduce _any_ user-facing change? No. This PR only updates the documentation to accurately reflect the current state of supported pandas API. ### How was this patch tested? Manually build documentation, and check if the supported pandas API list is correctly generated as below: <img width="299" alt="Screenshot 2023-11-24 at 12 36 31 PM" src="https://github.com/apache/spark/assets/44108233/a2da0f0b-0973-45cb-b22d-9582bbeb51b5"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43996 from itholic/fix_supported_api_gen. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 132bb63a897f4f4049f34deefc065ed3eac6a90f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 November 2023, 10:38:43 UTC
a855262 [SPARK-46062][SQL] Sync the isStreaming flag between CTE definition and reference This PR proposes to sync the flag `isStreaming` from CTE definition to CTE reference. The essential issue is that CTE reference node cannot determine the flag `isStreaming` by itself, and never be able to have a proper value and always takes the default as it does not have a parameter in constructor. The other flag `resolved` is handled, and we need to do the same for `isStreaming`. Once we add the parameter to the constructor, we will also need to make sure the flag is in sync with CTE definition. We have a rule `ResolveWithCTE` doing the sync, hence we add the logic to sync the flag `isStreaming` as well. The bug may impact some rules which behaves differently depending on isStreaming flag. It would no longer be a problem once CTE reference is replaced with CTE definition at some point in "optimization phase", but all rules in analyzer and optimizer being triggered before the rule takes effect may misbehave based on incorrect isStreaming flag. No. New UT. No. Closes #43966 from HeartSaVioR/SPARK-46062. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 43046631a5d4ac7201361a00473cc87fa52ab5a7) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 23 November 2023, 14:02:19 UTC
c367194 [SPARK-46064][SQL][SS] Move out EliminateEventTimeWatermark to the analyzer and change to only take effect on resolved child This PR proposes to move out EliminateEventTimeWatermark to the analyzer (one of the analysis rule), and also make a change to eliminate EventTimeWatermark node only when the child of EventTimeWatermark is "resolved". Currently, we apply EliminateEventTimeWatermark immediately when withWatermark is called, which means the rule is applied immediately against the child, regardless whether child is resolved or not. It is not an issue for the usage of DataFrame API initiated by read / readStream, because streaming sources have the flag isStreaming set to true even it is yet resolved. But mix-up of SQL and DataFrame API would expose the issue; we may not know the exact value of isStreaming flag on unresolved node and it is subject to change upon resolution. No. New UTs. No. Closes #43971 from HeartSaVioR/SPARK-46064. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit a703dace0aa400fa24b2bded1500f44ae7ac8db0) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 23 November 2023, 11:20:58 UTC
18bcd02 [MINOR][BUILD] Rename gprcVersion to grpcVersion in SparkBuild ### What changes were proposed in this pull request? This PR aims to fix a typo. ``` - val gprcVersion = "1.56.0" + val grpcVersion = "1.56.0" ``` There are two occurrences. ``` $ git grep gprc project/SparkBuild.scala: val gprcVersion = "1.56.0" project/SparkBuild.scala: "io.grpc" % "protoc-gen-grpc-java" % BuildCommons.gprcVersion asProtocPlugin(), ``` ### Why are the changes needed? To fix a typo. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43923 from dongjoon-hyun/minor_grpc. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 November 2023, 10:08:30 UTC
1f81e26 [SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info ![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d) When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId. ![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102) ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 06635e25f170e61f6cfe53232d001993ec7d376d) Signed-off-by: Kent Yao <yao@apache.org> 22 November 2023, 08:51:16 UTC
2065754 [SPARK-46012][CORE][FOLLOWUP] Invoke `fs.listStatus` once and reuse the result ### What changes were proposed in this pull request? This PR is a follow-up of #43914 and aims to invoke `fs.listStatus` once and reuse the result. ### Why are the changes needed? This will prevent the increase of the number of `listStatus` invocation . ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the existing test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43944 from dongjoon-hyun/SPARK-46012-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6be4a0358265fb81f68a27589f9940bd726c8ee7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 November 2023, 01:51:19 UTC
8f52fd5 [SPARK-44973][SQL] Fix `ArrayIndexOutOfBoundsException` in `conv()` ### What changes were proposed in this pull request? Increase the size of the buffer allocated for the result of base conversion in `NumberConverter` to prevent ArrayIndexOutOfBoundsException when evaluating `conv(s"${Long.MinValue}", 10, -2)`. ### Why are the changes needed? I don't think the ArrayIndexOutOfBoundsException is intended behaviour. ### Does this PR introduce _any_ user-facing change? Users will no longer experience an ArrayIndexOutOfBoundsException for this specific set of arguments and will instead receive the expected base conversion. ### How was this patch tested? New unit test cases ### Was this patch authored or co-authored using generative AI tooling? No Closes #43880 from markj-db/SPARK-44973. Authored-by: Mark Jarvin <mark.jarvin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2ac8ff76a5169fe1f6cf130cc82738ba78bd8c65) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 19:38:40 UTC
fcf5573 [SPARK-46019][SQL][TESTS] Fix `HiveThriftServer2ListenerSuite` and `ThriftServerPageSuite` to create `java.io.tmpdir` if it doesn't exist ### What changes were proposed in this pull request? The pr aims to fix `HiveThriftServer2ListenerSuite` and `ThriftServerPageSuite` failed when there are running on local. ``` [info] ThriftServerPageSuite: [info] - thriftserver page should load successfully *** FAILED *** (515 milliseconds) [info] java.lang.IllegalStateException: Could not initialize plugin: interface org.mockito.plugins.MockMaker (alternate: null) [info] at org.mockito.internal.configuration.plugins.PluginLoader$1.invoke(PluginLoader.java:84) [info] at jdk.proxy2/jdk.proxy2.$Proxy20.isTypeMockable(Unknown Source) [info] at org.mockito.internal.util.MockUtil.typeMockabilityOf(MockUtil.java:78) [info] at org.mockito.internal.util.MockCreationValidator.validateType(MockCreationValidator.java:22) [info] at org.mockito.internal.creation.MockSettingsImpl.validatedSettings(MockSettingsImpl.java:267) [info] at org.mockito.internal.creation.MockSettingsImpl.build(MockSettingsImpl.java:234) [info] at org.mockito.internal.MockitoCore.mock(MockitoCore.java:86) [info] at org.mockito.Mockito.mock(Mockito.java:2037) [info] at org.mockito.Mockito.mock(Mockito.java:2010) [info] at org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite.getStatusStore(ThriftServerPageSuite.scala:49) ``` It can be simply reproduced by running the following command: ``` build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite" -Phive-thriftserver build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite" -Phive-thriftserver ``` ### Why are the changes needed? Fix tests failed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test: ``` build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite" -Phive-thriftserver build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite" -Phive-thriftserver ``` After it: ``` [info] - listener events should store successfully (live = true) (1 second, 711 milliseconds) [info] - listener events should store successfully (live = false) (6 milliseconds) [info] - cleanup session if exceeds the threshold (live = true) (21 milliseconds) [info] - cleanup session if exceeds the threshold (live = false) (3 milliseconds) [info] - update execution info when jobstart event come after execution end event (9 milliseconds) [info] - SPARK-31387 - listener update methods should not throw exception with unknown input (8 milliseconds) [info] Run completed in 3 seconds, 734 milliseconds. [info] Total number of tests run: 6 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 6, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 156 s (02:36), completed Nov 21, 2023, 1:57:21 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43921 from panbingkun/SPARK-46019. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fdcd20f4b51c3ddddaae12f7d3f429e7b77c9f5e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 18:26:44 UTC
ece4ebe [SPARK-46033][SQL][TESTS] Fix flaky ArithmeticExpressionSuite ### What changes were proposed in this pull request? The pr aims to fix flaky ArithmeticExpressionSuite. https://github.com/panbingkun/spark/actions/runs/6940660146/job/18879997046 <img width="1000" alt="image" src="https://github.com/apache/spark/assets/15246973/9fe6050a-7a06-4110-9152-d4512a49b284"> ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43935 from panbingkun/SPARK-46033. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b7930e718f453f8a9d923ad57161a982f16ca8e8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 18:17:40 UTC
a436736 [MINOR][DOCS] Correct Python Spark Connect documentation about pip installation ### What changes were proposed in this pull request? This PR fixes the Spark Connect documentation from `pyspark==3.5.0` to `pyspark[connect]==3.5.0`; otherwise it will fail to execute the example as is because of missing dependencies. This is sort of a followup of SPARK-44867. https://github.com/apache/spark/blob/d31c8596cd714766892d1395e30358bd1cd3cb84/python/setup.py#L325-L332 ### Why are the changes needed? To guide users about using Spark Connect ### Does this PR introduce _any_ user-facing change? Yes, this fixes the user-facing documentation for Python Spark Connect. ### How was this patch tested? Manually checked with Markdown editor. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43919 from HyukjinKwon/SPARK-44867-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit df1280cb10ee71ea362a95705f355402e2bcaff2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 November 2023, 05:38:38 UTC
24079ad [SPARK-46014][SQL][TESTS] Run `RocksDBStateStoreStreamingAggregationSuite` on a dedicated JVM ### What changes were proposed in this pull request? This PR aims to run `RocksDBStateStoreStreamingAggregationSuite` on a dedicated JVM to reduce the flakiness. ### Why are the changes needed? `RocksDBStateStoreStreamingAggregationSuite` is flaky. - https://github.com/apache/spark/actions/runs/6936862847/job/18869845206 - https://github.com/apache/spark/actions/runs/6926542106/job/18838877151 - https://github.com/apache/spark/actions/runs/6924927427/job/18834849433 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43916 from dongjoon-hyun/SPARK-46014. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9bb1fe2a8410e6a0dbf73a420d8e9b359363b932) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 01:54:49 UTC
96bfd83 [SPARK-46012][CORE] EventLogFileReader should not read rolling logs if app status file is missing ### What changes were proposed in this pull request? This PR aims to prevent `EventLogFileReader` from reading rolling event logs if `appStatus` is missing. ### Why are the changes needed? Since Apache Spark 3.0.0, `appstatus_` is supposed to exist. https://github.com/apache/spark/blob/839f0c98bd85a14eadad13f8aaac876275ded5a4/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala#L277-L283 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43914 from dongjoon-hyun/SPARK-46012. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6ca1c67de082269b9337503bff5161f5a2d87225) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 01:50:13 UTC
f3baf08 [SPARK-43393][SQL][3.5] Address sequence expression overflow bug ### What changes were proposed in this pull request? Spark has a (long-standing) overflow bug in the `sequence` expression. Consider the following operations: ``` spark.sql("CREATE TABLE foo (l LONG);") spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});") spark.sql("SELECT sequence(0, l) FROM foo;").collect() ``` The result of these operations will be: ``` Array[org.apache.spark.sql.Row] = Array([WrappedArray()]) ``` an unintended consequence of overflow. The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like: ``` 0111111111111111111111111111111111111111111111111111111111111111 - 0000000000000000000000000000000000000000000000000000000000000000 ------------------------------------------------------------------ 0111111111111111111111111111111111111111111111111111111111111111 / 0000000000000000000000000000000000000000000000000000000000000001 ------------------------------------------------------------------ 0111111111111111111111111111111111111111111111111111111111111111 + 0000000000000000000000000000000000000000000000000000000000000001 ------------------------------------------------------------------ 1000000000000000000000000000000000000000000000000000000000000000 ``` The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`. Other overflows are similarly problematic. This PR addresses the issue by checking numeric operations in the length computation for overflow. ### Why are the changes needed? There is a correctness bug from overflow in the `sequence` expression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added in `CollectionExpressionsSuite.scala`. Closes #43820 from thepinetree/spark-sequence-overflow-3.5. Authored-by: Deepayan Patra <deepayan.patra@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 November 2023, 21:17:43 UTC
back to top