https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
18db204 Preparing Spark release v3.3.4-rc1 08 December 2023, 18:28:15 UTC
6a4488f [SPARK-45580][SQL][3.3] Handle case where a nested subquery becomes an existence join ### What changes were proposed in this pull request? This is a back-port of https://github.com/apache/spark/pull/44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44223 from bersprockets/schema_change_br33. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 December 2023, 03:24:13 UTC
37d10ec [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation - Missing docs - https://github.com/apache/spark/pull/31502#issuecomment-774792276 potential regression no doc build no Closes #44207 from yaooqinn/SPARK-46286. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 December 2023, 18:48:41 UTC
aaec17f [SPARK-46239][CORE] Hide `Jetty` info **What changes were proposed in this pull request?** The PR sets parameters to hide the version of jetty in spark. **Why are the changes needed?** It can avoid obtaining remote WWW service information through HTTP. **Does this PR introduce any user-facing change?** No **How was this patch tested?** Manual review **Was this patch authored or co-authored using generative AI tooling?** No Closes #44158 from chenyu-opensource/branch-SPARK-46239. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: chenyu <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ff4f59341215b7f3a87e6cd8798d49e25562fcd6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2023, 22:41:55 UTC
c941230 [SPARK-46092][SQL][3.3] Don't push down Parquet row group filters that overflow This is a cherry-pick from https://github.com/apache/spark/pull/44006 to spark 3.3 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44156 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.3. Authored-by: Johan Lasperas <johan.lasperas@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2023, 20:50:57 UTC
6ad00b7 [SPARK-46029][SQL][3.3] Escape the single quote, _ and % for DS V2 pushdown ### What changes were proposed in this pull request? This PR used to back port https://github.com/apache/spark/pull/43801 to branch-3.3 ### Why are the changes needed? Escape the single quote, _ and % for DS V2 pushdown ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? New test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44089 from beliefer/SPARK-46029_3.3. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Jiaan Geng <beliefer@163.com> 01 December 2023, 06:58:52 UTC
b530904 [SPARK-46006][YARN][FOLLOWUP] YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down https://github.com/apache/spark/pull/38622 2. Avoid new allocation requests when sc.stop stuck https://github.com/apache/spark/pull/43906 3. Cancel pending allocation request, this pr https://github.com/apache/spark/pull/44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756bdac823be42ed10bc011415f405905497) Signed-off-by: Kent Yao <yao@apache.org> 28 November 2023, 03:05:10 UTC
b38aade [SPARK-46095][DOCS] Document `REST API` for Spark Standalone Cluster This PR aims to document `REST API` for Spark Standalone Cluster. To help the users to understand Apache Spark features. No. Manual review. `REST API` Section is added newly. **AFTER** <img width="704" alt="Screenshot 2023-11-24 at 4 13 53 PM" src="https://github.com/apache/spark/assets/9700541/a4e09d94-d216-4629-8b37-9d350365a428"> No. Closes #44007 from dongjoon-hyun/SPARK-46095. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 132c1a1f08d6555c950600c102db28b9d7581350) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 03dac186d841308c7230188aa3386b2b240cdf97) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2023, 01:56:31 UTC
f478c3e [SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info ![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d) When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId. ![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102) ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 06635e25f170e61f6cfe53232d001993ec7d376d) Signed-off-by: Kent Yao <yao@apache.org> 22 November 2023, 08:52:04 UTC
b412cc5 [SPARK-46012][CORE][FOLLOWUP] Invoke `fs.listStatus` once and reuse the result ### What changes were proposed in this pull request? This PR is a follow-up of #43914 and aims to invoke `fs.listStatus` once and reuse the result. ### Why are the changes needed? This will prevent the increase of the number of `listStatus` invocation . ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the existing test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43944 from dongjoon-hyun/SPARK-46012-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6be4a0358265fb81f68a27589f9940bd726c8ee7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 November 2023, 01:51:51 UTC
44808c1 [SPARK-44973][SQL] Fix `ArrayIndexOutOfBoundsException` in `conv()` ### What changes were proposed in this pull request? Increase the size of the buffer allocated for the result of base conversion in `NumberConverter` to prevent ArrayIndexOutOfBoundsException when evaluating `conv(s"${Long.MinValue}", 10, -2)`. ### Why are the changes needed? I don't think the ArrayIndexOutOfBoundsException is intended behaviour. ### Does this PR introduce _any_ user-facing change? Users will no longer experience an ArrayIndexOutOfBoundsException for this specific set of arguments and will instead receive the expected base conversion. ### How was this patch tested? New unit test cases ### Was this patch authored or co-authored using generative AI tooling? No Closes #43880 from markj-db/SPARK-44973. Authored-by: Mark Jarvin <mark.jarvin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2ac8ff76a5169fe1f6cf130cc82738ba78bd8c65) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 19:39:00 UTC
84e82be [SPARK-46019][SQL][TESTS] Fix `HiveThriftServer2ListenerSuite` and `ThriftServerPageSuite` to create `java.io.tmpdir` if it doesn't exist ### What changes were proposed in this pull request? The pr aims to fix `HiveThriftServer2ListenerSuite` and `ThriftServerPageSuite` failed when there are running on local. ``` [info] ThriftServerPageSuite: [info] - thriftserver page should load successfully *** FAILED *** (515 milliseconds) [info] java.lang.IllegalStateException: Could not initialize plugin: interface org.mockito.plugins.MockMaker (alternate: null) [info] at org.mockito.internal.configuration.plugins.PluginLoader$1.invoke(PluginLoader.java:84) [info] at jdk.proxy2/jdk.proxy2.$Proxy20.isTypeMockable(Unknown Source) [info] at org.mockito.internal.util.MockUtil.typeMockabilityOf(MockUtil.java:78) [info] at org.mockito.internal.util.MockCreationValidator.validateType(MockCreationValidator.java:22) [info] at org.mockito.internal.creation.MockSettingsImpl.validatedSettings(MockSettingsImpl.java:267) [info] at org.mockito.internal.creation.MockSettingsImpl.build(MockSettingsImpl.java:234) [info] at org.mockito.internal.MockitoCore.mock(MockitoCore.java:86) [info] at org.mockito.Mockito.mock(Mockito.java:2037) [info] at org.mockito.Mockito.mock(Mockito.java:2010) [info] at org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite.getStatusStore(ThriftServerPageSuite.scala:49) ``` It can be simply reproduced by running the following command: ``` build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite" -Phive-thriftserver build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite" -Phive-thriftserver ``` ### Why are the changes needed? Fix tests failed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test: ``` build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite" -Phive-thriftserver build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite" -Phive-thriftserver ``` After it: ``` [info] - listener events should store successfully (live = true) (1 second, 711 milliseconds) [info] - listener events should store successfully (live = false) (6 milliseconds) [info] - cleanup session if exceeds the threshold (live = true) (21 milliseconds) [info] - cleanup session if exceeds the threshold (live = false) (3 milliseconds) [info] - update execution info when jobstart event come after execution end event (9 milliseconds) [info] - SPARK-31387 - listener update methods should not throw exception with unknown input (8 milliseconds) [info] Run completed in 3 seconds, 734 milliseconds. [info] Total number of tests run: 6 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 6, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 156 s (02:36), completed Nov 21, 2023, 1:57:21 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43921 from panbingkun/SPARK-46019. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fdcd20f4b51c3ddddaae12f7d3f429e7b77c9f5e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 18:27:06 UTC
feec452 [SPARK-46012][CORE] EventLogFileReader should not read rolling logs if app status file is missing ### What changes were proposed in this pull request? This PR aims to prevent `EventLogFileReader` from reading rolling event logs if `appStatus` is missing. ### Why are the changes needed? Since Apache Spark 3.0.0, `appstatus_` is supposed to exist. https://github.com/apache/spark/blob/839f0c98bd85a14eadad13f8aaac876275ded5a4/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala#L277-L283 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43914 from dongjoon-hyun/SPARK-46012. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6ca1c67de082269b9337503bff5161f5a2d87225) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 01:50:38 UTC
b7bc127 [SPARK-43393][SQL][3.3] Address sequence expression overflow bug ### What changes were proposed in this pull request? Spark has a (long-standing) overflow bug in the `sequence` expression. Consider the following operations: ``` spark.sql("CREATE TABLE foo (l LONG);") spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});") spark.sql("SELECT sequence(0, l) FROM foo;").collect() ``` The result of these operations will be: ``` Array[org.apache.spark.sql.Row] = Array([WrappedArray()]) ``` an unintended consequence of overflow. The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like: ``` 0111111111111111111111111111111111111111111111111111111111111111 - 0000000000000000000000000000000000000000000000000000000000000000 ------------------------------------------------------------------ 0111111111111111111111111111111111111111111111111111111111111111 / 0000000000000000000000000000000000000000000000000000000000000001 ------------------------------------------------------------------ 0111111111111111111111111111111111111111111111111111111111111111 + 0000000000000000000000000000000000000000000000000000000000000001 ------------------------------------------------------------------ 1000000000000000000000000000000000000000000000000000000000000000 ``` The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`. Other overflows are similarly problematic. This PR addresses the issue by checking numeric operations in the length computation for overflow. ### Why are the changes needed? There is a correctness bug from overflow in the `sequence` expression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added in `CollectionExpressionsSuite.scala`. Closes #43821 from thepinetree/spark-sequence-overflow-3.3. Authored-by: Deepayan Patra <deepayan.patra@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 November 2023, 17:22:39 UTC
c5b874d [SPARK-45920][SQL][3.3] group by ordinal should be idempotent backport https://github.com/apache/spark/pull/43797 ### What changes were proposed in this pull request? GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #43839 from cloud-fan/3.3-port. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 November 2023, 16:20:22 UTC
bb0dadd [SPARK-45764][PYTHON][DOCS][3.3] Make code block copyable ### What changes were proposed in this pull request? The pr aims to make code block `copyable `in pyspark docs. Backport above to `branch 3.3`. Master branch pr: https://github.com/apache/spark/pull/43799 ### Why are the changes needed? Improving the usability of PySpark documents. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to easily copy code block in pyspark docs. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43830 from panbingkun/branch-3.3_SPARK-45764. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 16 November 2023, 10:11:07 UTC
8e4efb2 [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error ### What changes were proposed in this pull request? The pr aims to fix RST files `link substitutions` error. Target branch: branch-3.3, branch-3.4, branch-3.5, master. ### Why are the changes needed? When I was reviewing Python documents, I found that `the actual address` of the link was incorrect, eg: https://spark.apache.org/docs/latest/api/python/getting_started/install.html#installing-from-source <img width="916" alt="image" src="https://github.com/apache/spark/assets/15246973/069c1875-1e21-45db-a236-15c27ee7b913"> `The ref link url` of `Building Spark`: from `https://spark.apache.org/docs/3.5.0/#downloading` to `https://spark.apache.org/docs/3.5.0/building-spark.html`. We should fix it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43815 from panbingkun/SPARK-45935. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 79ccdfa31e282ebe9a82c8f20c703b6ad2ea6bc1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 November 2023, 09:01:47 UTC
662edf5 [MINOR][DOCS] Fix the example value in the docs ### What changes were proposed in this pull request? fix the example value ### Why are the changes needed? for doc ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Just example value in the docs, no need to test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43750 from jlfsdtc/fix_typo_in_doc. Authored-by: longfei.jiang <longfei.jiang@kyligence.io> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit b501a223bfcf4ddbcb0b2447aa06c549051630b0) Signed-off-by: Kent Yao <yao@apache.org> 11 November 2023, 05:50:22 UTC
6780c78 [SPARK-45885][BUILD][3.3] Upgrade ORC to 1.7.10 ### What changes were proposed in this pull request? This PR aims to upgrade ORC to 1.7.10 for Apache Spark 3.3.4 ### Why are the changes needed? To bring the latest bug fixes. - https://github.com/apache/orc/releases/tag/v1.7.9 - https://github.com/apache/orc/releases/tag/v1.7.10 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43756 from dongjoon-hyun/SPARK-45885. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 November 2023, 16:54:38 UTC
187651b [SPARK-45829][DOCS] Update the default value for spark.executor.logs.rolling.maxSize **What changes were proposed in this pull request?** The PR updates the default value of 'spark.executor.logs.rolling.maxSize' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.executor.logs.rolling.maxSize' is 1024 * 1024, but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #43712 from chenyu-opensource/branch-SPARK-45829. Authored-by: chenyu <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit a9127068194a48786df4f429ceb4f908c71f7138) Signed-off-by: Kent Yao <yao@apache.org> 08 November 2023, 11:17:38 UTC
d76223c [SPARK-44843][TESTS] Double streamingTimeout for StateStoreMetricsTest to make RocksDBStateStore related streaming tests reliable ### What changes were proposed in this pull request? This PR increases streamingTimeout and the check interval for StateStoreMetricsTest to make RocksDBStateStore-related streaming tests reliable, hopefully. ### Why are the changes needed? ``` SPARK-35896: metrics in StateOperatorProgress are output correctly (RocksDBStateStore with changelog checkpointing) *** FAILED *** (1 minute) [info] Timed out waiting for stream: The code passed to failAfter did not complete within 60 seconds. [info] java.base/java.lang.Thread.getStackTrace(Thread.java:1619) ``` The probability of these tests failing is close to 100%, which seriously affects the UX of making PRs for the contributors. https://github.com/yaooqinn/spark/actions/runs/6744173341/job/18333952141 ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? this can be verified by `sql - slow test` job in CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #43647 from yaooqinn/SPARK-44843. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit afdce266f0ffeb068d47eca2f2af1bcba66b0e95) Signed-off-by: Kent Yao <yao@apache.org> 03 November 2023, 17:17:29 UTC
8072631 [SPARK-45751][DOCS] Update the default value for spark.executor.logs.rolling.maxRetainedFile **What changes were proposed in this pull request?** The PR updates the default value of 'spark.executor.logs.rolling.maxRetainedFiles' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.executor.logs.rolling.maxRetainedFiles' is -1, but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #43618 from chenyu-opensource/branch-SPARK-45751. Authored-by: chenyu <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit e6b4fa835de3f6d0057bf3809ea369d785967bcd) Signed-off-by: Kent Yao <yao@apache.org> 01 November 2023, 09:13:59 UTC
b77e7ef [SPARK-45749][CORE][WEBUI] Fix `Spark History Server` to sort `Duration` column properly ### What changes were proposed in this pull request? This PR aims to fix an UI regression at Apache Spark 3.2.0 caused by SPARK-34123. From Apache Spark **3.2.0** to **3.5.0**, `Spark History Server` cannot sort `Duration` column. After this PR, Spark History Server can sort `Duration` column properly like Apache Spark 3.1.3 and before. ### Why are the changes needed? Before SPARK-34123, Apache Spark had the `title` attribute for sorting. - https://github.com/apache/spark/pull/31191 ``` <td><span title="{{durationMillisec}}">{{duration}}</span></td> ``` Without `title`, `title-numeric` doesn't work. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix. ### How was this patch tested? Manual test. Please use `Safari Private Browsing ` or `Chrome Incognito` mode. <img width="96" alt="Screenshot 2023-10-31 at 5 47 34 PM" src="https://github.com/apache/spark/assets/9700541/8c8464d2-c58b-465c-8f98-edab1ec2317d"> <img width="94" alt="Screenshot 2023-10-31 at 5 47 29 PM" src="https://github.com/apache/spark/assets/9700541/03e8373d-bda3-4835-90ad-9a45670e853a"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43613 from dongjoon-hyun/SPARK-45749. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f72510ca9e04ae88660346de440b231fc8225698) Signed-off-by: Kent Yao <yao@apache.org> 01 November 2023, 01:56:36 UTC
94a043d [SPARK-45670][CORE][3.3] SparkSubmit does not support `--total-executor-cores` when deploying on K8s This is the cherry-pick of https://github.com/apache/spark/pull/43536 for branch-3.3 ### What changes were proposed in this pull request? Remove Kubernetes from the support list of `--total-executor-cores` in SparkSubmit ### Why are the changes needed? `--total-executor-cores` does not take effect in Spark on K8s, [the comments from original PR](https://github.com/apache/spark/pull/19717#discussion_r154568773) also proves that ### Does this PR introduce _any_ user-facing change? The output of `spark-submit --help` changed ```patch ... - Spark standalone, Mesos and Kubernetes only: + Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. ... ``` ### How was this patch tested? Pass GA and review. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43548 from pan3793/SPARK-45670-3.3. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 October 2023, 06:24:28 UTC
1bb4143 [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount ### What changes were proposed in this pull request? This is a fix for the failure when function that utilized `FramelessOffsetWindowFunctionFrame` is used with `ignoreNulls = true` and `offset > rowCount`. e.g. ``` select x, lead(x, 5) IGNORE NULLS over (order by x) from (select explode(sequence(1, 3)) x) ``` ### Why are the changes needed? Fix existing bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modify existing unit test to cover this case ### Was this patch authored or co-authored using generative AI tooling? No Closes #43236 from vitaliili-db/SPARK-45430. Authored-by: Vitalii Li <vitalii.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 32e1e58411913517c87d7e75942437f4e1c1d40e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 October 2023, 07:16:19 UTC
746f936 [MINOR][SQL] Remove signature from Hive thriftserver exception ### What changes were proposed in this pull request? Don't return expected signature to caller in Hive thriftserver exception ### Why are the changes needed? Please see private discussion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43402 from srowen/HiveCookieSigner. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit cf59b1f51c16301f689b4e0f17ba4dbd140e1b19) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 October 2023, 23:11:25 UTC
8cd3c1a [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite ### What changes were proposed in this pull request? WholeStageCodegenSparkSubmitSuite is [flaky](https://github.com/apache/spark/actions/runs/6479534195/job/17593342589) because SHUFFLE_PARTITIONS(200) creates 200 reducers for one total core and improper stop progress causes executor launcher reties. The heavy load and reties might result in timeout test failures. ### Why are the changes needed? CI robustness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing WholeStageCodegenSparkSubmitSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #43394 from yaooqinn/SPARK-45568. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f00ec39542a5f9ac75d8c24f0f04a7be703c8d7c) Signed-off-by: Kent Yao <yao@apache.org> 17 October 2023, 14:24:52 UTC
8706ccd [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access Cleaner on Java 9+ This PR adds `--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED` to our JVM flags so that we can access `jdk.internal.ref.Cleaner` in JDK 9+. This allows Spark to allocate direct memory while ignoring the JVM's MaxDirectMemorySize limit. Spark uses JDK internal APIs to directly construct DirectByteBuffers while bypassing that limit, but there is a fallback path at https://github.com/apache/spark/blob/v3.5.0/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java#L213 that is used if we cannot reflectively access the `Cleaner` API. No. Added a unit test in `PlatformUtilSuite`. No. Closes #43344 from JoshRosen/SPARK-45508. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 96bac6c033b5bb37101ebcd8436ab9a84db8e092) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 146fba1a22e3f1555f3e4494522810030f9a7854) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d0eaebbd7f90fd121c7f97ca376b7141ad15731b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 13 October 2023, 16:54:47 UTC
5c57e4a [SPARK-45389][SQL][3.3] Correct MetaException matching rule on getting partition metadata This is the backport of https://github.com/apache/spark/pull/43191 for `branch-3.3` ### What changes were proposed in this pull request? This PR aims to fix the HMS call fallback logic introduced in SPARK-35437. ```patch try { ... hive.getPartitionNames ... hive.getPartitionsByNames } catch { - case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] => + case ex: HiveException if ex.getCause.isInstanceOf[MetaException] => ... } ``` ### Why are the changes needed? Directly method call won't throw `InvocationTargetException`, and check the code of `hive.getPartitionNames` and `hive.getPartitionsByNames`, both of them will wrap a `HiveException` if `MetaException` throws. ### Does this PR introduce _any_ user-facing change? Yes, it should be a bug fix. ### How was this patch tested? Pass GA and code review. (I'm not sure how to construct/simulate a MetaException during the HMS thrift call with the current HMS testing infrastructure) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43282 from pan3793/SPARK-45389-3.3. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> 10 October 2023, 06:44:55 UTC
454abbf [SPARK-44729][PYTHON][DOCS][3.3] Add canonical links to the PySpark docs page ### What changes were proposed in this pull request? The pr aims to add canonical links to the PySpark docs page, backport this to branch 3.3. Master branch pr: https://github.com/apache/spark/pull/42425. ### Why are the changes needed? Backport this to branch 3.3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43286 from panbingkun/branch-3.3_SPARK-44729. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 October 2023, 20:53:34 UTC
44c960d [MINOR][DOCS] Update `CTAS` with `LOCATION` behavior with Spark 3.2+ ### What changes were proposed in this pull request? This PR aims to update `CTAS` with `LOCATION` behavior according to Spark 3.2+. ### Why are the changes needed? SPARK-28551 changed the behavior at Apache Spark 3.2.0. https://github.com/apache/spark/blob/24b82dfd6cfb9a658af615446be5423695830dd9/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2306-L2313 ### Does this PR introduce _any_ user-facing change? No. This is a documentation fix. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43277 from dongjoon-hyun/minor_ctas. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2d6d09b71e77b362a4c774170e2ca992a31fb1ea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 October 2023, 11:45:29 UTC
a837105 [SPARK-45227][CORE] Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend Backport of #43021 to branch 3.3 ### What changes were proposed in this pull request? Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck. ### Why are the changes needed? For each executor, the single-threaded dispatcher can run into an "infinite loop" (as explained in the SPARK-45227). Once an executor process runs into a state, it'd stop launching tasks from the driver or reporting task status back. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` $ build/mvn package -DskipTests -pl core $ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite test ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #43176 from xiongbo-sjtu/branch-3.3. Authored-by: Bo Xiong <xiongbo@amazon.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> 30 September 2023, 17:04:34 UTC
62c2f83 [SPARK-44074][CORE][SQL][TESTS][3.3] Fix loglevel restore behavior of `SparkFunSuite#withLogAppender` and re-enable UT `Logging plan changes for execution` ### What changes were proposed in this pull request? The main change of this pr is to add a call of `SparkFunSuite#wupdateLoggers` after restore loglevel when 'level' of `withLogAppender` function is not `None`, and under the premise of this change, the UT `Logging plan changes for execution` disabled in https://github.com/apache/spark/pull/43160 can be re-enabled. ### Why are the changes needed? - Fix bug of `SparkFunSuite#withLogAppender` when 'level' is not None - Re-enable UT `Logging plan changes for execution` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test ``` build/sbt "sql/testOnly org.apache.spark.sql.JoinHintSuite org.apache.spark.sql.execution.QueryExecutionSuite" ``` **Before** ``` [info] - Logging plan changes for execution *** FAILED *** (36 milliseconds) [info] testAppender.loggingEvents.exists(((x$10: org.apache.logging.log4j.core.LogEvent) => x$10.getMessage().getFormattedMessage().contains(expectedMsg))) was false (QueryExecutionSuite.scala:232) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.execution.QueryExecutionSuite.$anonfun$new$34(QueryExecutionSuite.scala:232) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.execution.QueryExecutionSuite.$anonfun$new$31(QueryExecutionSuite.scala:231) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) ... ``` The failure reason is `withLogAppender(hintAppender, level = Some(Level.WARN))` used in `JoinHintSuite`, but `SparkFunSuite#wupdateLoggers` doesn't have the correct restore Loglevel. The test was successful before SPARK-44034 due to there was `AdaptiveQueryExecSuite` between `JoinHintSuite` and `QueryExecutionSuite`, and `AdaptiveQueryExecSuite` called `withLogAppender(hintAppender, level = Some(Level.DEBUG))`, but `AdaptiveQueryExecSuite` move to `slow sql` test group after SPARK-44034 **After** ``` [info] Run completed in 7 seconds, 485 milliseconds. [info] Total number of tests run: 32 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 32, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` Closes #43175 from LuciferYang/SPARK-44074-33. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 September 2023, 18:07:58 UTC
1c02ed0 [SPARK-44034][TESTS][3.3] Add a new test group for sql module ### What changes were proposed in this pull request? The purpose of this pr is to add a new test tag `SlowSQLTest` to the sql module, and identified some Suites with test cases more than 3 seconds, and apply it to GA testing task to reduce the testing pressure of the `sql others` group. ### Why are the changes needed? For a long time, the sql module UTs has only two groups: `slow` and `others`. The test cases in group `slow` are fixed, while the number of test cases in group `others` continues to increase, which has had a certain impact on the testing duration and stability of group `others`. So this PR proposes to add a new testing group to share the testing pressure of `sql others` group, which has made the testing time of the three groups more average, and hope it can improve the stability of the GA task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Should monitor GA ### Was this patch authored or co-authored using generative AI tooling? No Closes #43160 from LuciferYang/SPARK-44034-33. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 29 September 2023, 06:49:58 UTC
48138eb [SPARK-45057][CORE] Avoid acquire read lock when keepReadLock is false ### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5bbee99f9d1821952195efc6764816ec2f) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 28 September 2023, 23:52:08 UTC
e965cc8 [SPARK-44937][CORE] Mark connection as timedOut in TransportClient.close ### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in https://github.com/apache/spark/pull/9853. ### Why are the changes needed? Without this change, some of the new tests added in https://github.com/apache/spark/pull/42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in https://github.com/apache/spark/pull/42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88feadd4b7cec9e01bc744e589783e3390e5ce) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 28 September 2023, 23:17:24 UTC
9a28200 [SPARK-45286][DOCS] Add back Matomo analytics ### What changes were proposed in this pull request? Add analytics to doc pages using the ASF's Matomo service ### Why are the changes needed? We had previously removed Google Analytics from the website and release docs, per ASF policy: https://github.com/apache/spark/pull/36310 We just restored analytics using the ASF-hosted Matomo service on the website: https://github.com/apache/spark-website/commit/a1548627b48a62c2e51870d1488ca3e09397bd30 This change would put the same new tracking code back into the release docs. It would let us see what docs and resources are most used, I suppose. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #43063 from srowen/SPARK-45286. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a881438114ea3e8e918d981ef89ed1ab956d6fca) Signed-off-by: Sean Owen <srowen@gmail.com> 24 September 2023, 19:18:28 UTC
578233e [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) ### What changes were proposed in this pull request? This PR proposes to recover the availity of switching languages consistently across docs for all code snippets in Spark 3.4 and below by using the proper class selector in the JQuery. Previously the selector was a string `.nav-link tab_python` which did not comply multiple class selection: https://www.w3.org/TR/CSS21/selector.html#class-html. I assume it worked as a legacy behaviour somewhere. Now it uses the standard way `.nav-link.tab_python`. Note that https://github.com/apache/spark/pull/42657 works because there's only single class assigned (after we refactored the site at https://github.com/apache/spark/pull/40269) ### Why are the changes needed? This is a regression in our documentation site. ### Does this PR introduce _any_ user-facing change? Yes, once you click the language tab, it will apply to the examples in the whole page. ### How was this patch tested? Manually tested after building the site. ![Screenshot 2023-09-19 at 12 08 17 PM](https://github.com/apache/spark/assets/6477701/09d0c117-9774-4404-8e2e-d454b7f700a3) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42989 from HyukjinKwon/SPARK-45210. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 796d8785c61e09d1098350657fd44707763687db) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 19 September 2023, 05:51:46 UTC
b170a67 [SPARK-45187][CORE] Fix `WorkerPage` to use the same pattern for `logPage` urls ### What changes were proposed in this pull request? This PR aims to use the same pattern for `logPage` urls of `WorkerPage` to make it work consistently when `spark.ui.reverseProxy=true`. ### Why are the changes needed? Since Apache Spark 3.2.0 (SPARK-34635, #31753), Apache Spark adds trailing slashes to reduce redirections for `logPage`. ```scala s"$workerUrlRef/logPage?driverId=$driverId&logType=stdout") s"$workerUrlRef/logPage/?driverId=$driverId&logType=stdout") ... <a href={s"$workerUrlRef/logPage?appId=${executor <a href={s"$workerUrlRef/logPage/?appId=${executor ``` This PR aims to fix a leftover in `WorkerPage` to make it work consistently in case of the reverse proxy situation via `spark.ui.reverseProxy`. Currently, in some proxy environments, `appId` link is working but `driverId` link is broken due to the redirections. This inconsistent behavior makes the users confused. ``` - <a href={s"$workerUrlRef/logPage?driverId=${driver.driverId}&logType=stdout"}>stdout</a> - <a href={s"$workerUrlRef/logPage?driverId=${driver.driverId}&logType=stderr"}>stderr</a> + <a href={s"$workerUrlRef/logPage/?driverId=${driver.driverId}&logType=stdout"}>stdout</a> + <a href={s"$workerUrlRef/logPage/?driverId=${driver.driverId}&logType=stderr"}>stderr</a> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests because it requires a reverse proxy. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42959 from dongjoon-hyun/SPARK-45187. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f8f2735426ee7ad3d7a1f5bd07e72643516f4a35) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 84a053e72ac9d9cfc91bab777cea94958d3a91da) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b4eb947ea45c7382a460f562c6240e8b48e67f0e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 September 2023, 17:35:45 UTC
6dcab1f [SPARK-45127][DOCS] Exclude README.md from document build ### What changes were proposed in this pull request? The pr aims to exclude `README.md` from document build. ### Why are the changes needed? - Currently, our document `README.html` does not have any CSS style applied to it, as shown below: https://spark.apache.org/docs/latest/README.html <img width="1432" alt="image" src="https://github.com/apache/spark/assets/15246973/1dfe5f69-30d9-4ce4-8d82-1bba5e721ccd"> **If we do not intend to display the above page to users, we should remove it during the document build process.** - As we saw in the project `spark-website`, it has already set the following configuration: https://github.com/apache/spark-website/blob/642d1fb834817014e1799e73882d53650c1c1662/_config.yml#L7 <img width="720" alt="image" src="https://github.com/apache/spark/assets/15246973/421b7be5-4ece-407e-9d49-8e7487b74a47"> Let's stay consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. After this pr, the README.html file will no longer be generated ``` (base) panbingkun:~/Developer/spark/spark-community/docs/_site$ls -al README.html ls: README.html: No such file or directory ``` - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42883 from panbingkun/SPARK-45127. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 804f741453fb146b5261084fa3baf26631badb79) Signed-off-by: Sean Owen <srowen@gmail.com> 16 September 2023, 14:05:07 UTC
9ee184a [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' **What changes were proposed in this pull request?** The PR updates the default value of 'spark.submit.deployMode' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.submit.deployMode' is 'client', but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #42902 from chenyu-opensource/branch-SPARK-45146. Authored-by: chenyu-opensource <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e) Signed-off-by: Sean Owen <srowen@gmail.com> 13 September 2023, 13:48:49 UTC
8bb0546 [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present. The following query gets incorrect results: ``` drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} ``` The result should be: ``` {"f1":[1.0,2.0,3.0],"f2":[1,2,3]} ``` The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary: ``` java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet ... row group 1: RC:1 TS:112 OFFSET:4 ------------------------------------------------------------------------------------------------------------------------------------------------------- value: .f1: ..list: ...element: INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0] .f2: ..list: ...element: INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0] ``` The same bug also occurs when field f2 is a map. This PR fixes that case as well. No, except for fixing the correctness issue. New tests. No. Closes #42850 from bersprockets/vector_oddity. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fac236e1350d1c71dd772251709db3af877a69c2) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 September 2023, 20:07:44 UTC
a4d40e8 [SPARK-45100][SQL][3.3] Fix an internal error from `reflect()`on `NULL` class and method ### What changes were proposed in this pull request? In the PR, I propose to check that the `class` and `method` arguments are not a NULL in `CallMethodViaReflection`. And if they are, throw an `AnalysisException` with new error class `DATATYPE_MISMATCH.UNEXPECTED_NULL`. This is a backport of https://github.com/apache/spark/pull/42849. ### Why are the changes needed? To fix the issue demonstrated by the example: ```sql $ spark-sql (default)> select reflect('java.util.UUID', CAST(NULL AS STRING)); [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *.MiscFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit fd424caf6c46e7030ac2deb2afbe3f4a5fc1095c) Closes #42856 from MaxGekk/fix-internal-error-in-reflect-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 08 September 2023, 15:59:22 UTC
5250ed6 [SPARK-45079][SQL][3.3] Fix an internal error from `percentile_approx()` on `NULL` accuracy ### What changes were proposed in this pull request? In the PR, I propose to check the `accuracy` argument is not a NULL in `ApproximatePercentile`. And if it is, throw an `AnalysisException` with new error class `DATATYPE_MISMATCH.UNEXPECTED_NULL`. This is a backport of https://github.com/apache/spark/pull/42817. ### Why are the changes needed? To fix the issue demonstrated by the example: ```sql $ spark-sql (default)> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), NULL) FROM VALUES (0), (1), (2), (10) AS tab(col); [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *.ApproximatePercentileQuerySuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 24b29adcf53616067a9fa2ca201e3f4d2f54436b) Closes #42835 from MaxGekk/fix-internal-error-in-percentile_approx-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 06 September 2023, 15:56:14 UTC
c19ea9e [SPARK-44990][SQL] Reduce the frequency of get `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` ### What changes were proposed in this pull request? This PR move get config `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` to lazy val of `UnivocityGenerator`. To reduce the frequency of get it. As report, it will affect performance. ### Why are the changes needed? Reduce the frequency of get `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test ### Was this patch authored or co-authored using generative AI tooling? No Closes #42738 from Hisoka-X/SPARK-44990_csv_null_value_config. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit dac750b855c35a88420b6ba1b943bf0b6f0dded1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 August 2023, 17:55:28 UTC
0b8c43f [MINOR][SQL][DOC] Fix incorrect link in sql menu and typo ### What changes were proposed in this pull request? Fix incorrect link in sql menu and typo. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? run `SKIP_API=1 bundle exec jekyll build` ![image](https://github.com/apache/spark/assets/17894939/7cc564ec-41cd-4e92-b19e-d33a53188a10) ### Was this patch authored or co-authored using generative AI tooling? No Closes #42697 from wForget/doc. Authored-by: wforget <643348094@qq.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 421ff4e3c047865a1887cae94b85dbf40bb7bac9) Signed-off-by: Kent Yao <yao@apache.org> 28 August 2023, 08:10:37 UTC
5e417ff [SPARK-44547][CORE] Ignore fallback storage for cached RDD migration ### What changes were proposed in this pull request? Fix bugs that makes the RDD decommissioner never finish ### Why are the changes needed? The cached RDD decommissioner is in a forever retry loop when the only viable peer is the fallback storage, which it doesn't know how to handle. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests are added and tested using Spark jobs. Closes #42155 from ukby1234/franky.SPARK-44547. Authored-by: Frank Yin <franky@ziprecruiter.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 47555da2ae292b07488ba181db1aceac8e7ddb3a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 August 2023, 05:20:10 UTC
9541881 Revert "[SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef" This reverts commit d529d32c5c06d5bbc34c91f70b0f3d8281ed5347. 24 August 2023, 18:02:47 UTC
d529d32 [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef ### What changes were proposed in this pull request? Originally, when a CTE has duplicate expression IDs in its output, the rule PushdownPredicatesAndPruneColumnsForCTEDef wrongly assesses that the columns in the CTE were pruned, as it compares the size of the attribute set containing the union of columns (which is unique) and the original output of the CTE (which contains duplicate columns) and notices that the former is less than the latter. This causes incorrect pruning of the CTE output, resulting in a missing reference and causing the error as documented in the ticket. This PR changes the logic to use the needsPruning function to assess whether a CTE has been pruned, which uses the outputSet to check if any columns has been pruned instead of the output. ### Why are the changes needed? The incorrect behaviour of PushdownPredicatesAndPruneColumnsForCTEDef in CTEs with duplicate expression IDs in its output causes a crash when such a query is run. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test for the crashing case was added. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42635 from wenyuen-db/SPARK-44934. Authored-by: Wen Yuen Pang <wenyuen.pang@databricks.com> Signed-off-by: Peter Toth <peter.toth@gmail.com> (cherry picked from commit 3b405948ee47702e5a7250dc27430836145b0e19) Signed-off-by: Peter Toth <peter.toth@gmail.com> 24 August 2023, 15:05:05 UTC
392b12e [SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in Docker images if exists ### What changes were proposed in this pull request? This PR aims to fix `RELEASE` file to have the correct information in Docker images if `RELEASE` file exists. Please note that `RELEASE` file doesn't exists in SPARK_HOME directory when we run the K8s integration test from Spark Git repository. So, we keep the following empty `RELEASE` file generation and use `COPY` conditionally via glob syntax. https://github.com/apache/spark/blob/2a3aec1f9040e08999a2df88f92340cd2710e552/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L37 ### Why are the changes needed? Currently, it's an empty file in the official Apache Spark Docker images. ``` $ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE -rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE $ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail -n1 -rw-r--r-- 1 root root 0 Feb 21 2022 /opt/spark/RELEASE ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually build image and check it with `docker run -it --rm NEW_IMAGE ls -al /opt/spark/RELEASE` I copied this `Dockerfile` into Apache Spark 3.5.0 RC2 binary distribution and tested in the following way. ``` $ cd spark-3.5.0-rc2-bin-hadoop3 $ cp /tmp/Dockerfile kubernetes/dockerfiles/spark/Dockerfile $ bin/docker-image-tool.sh -t SPARK-44935 build $ docker run -it --rm docker.io/library/spark:SPARK-44935 ls -al /opt/spark/RELEASE | tail -n1 -rw-r--r-- 1 root root 165 Aug 18 21:10 /opt/spark/RELEASE $ docker run -it --rm docker.io/library/spark:SPARK-44935 cat /opt/spark/RELEASE | tail -n2 Spark 3.5.0 (git revision 010c4a6a05) built for Hadoop 3.3.4 Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-3 -Phive -Phive-thriftserver ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42636 from dongjoon-hyun/SPARK-44935. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d382c6b3aef28bde6adcdf62b7be565ff1152942) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 August 2023, 23:01:26 UTC
aa6f6f7 [SPARK-44871][SQL][3.3] Fix percentile_disc behaviour ### What changes were proposed in this pull request? This PR fixes `percentile_disc()` function as currently it returns inforrect results in some cases. E.g.: ``` SELECT percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 FROM VALUES (0), (1), (2), (3), (4) AS v(a) ``` currently returns: ``` +---+---+---+---+---+---+---+---+---+---+---+ | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| +---+---+---+---+---+---+---+---+---+---+---+ |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| +---+---+---+---+---+---+---+---+---+---+---+ ``` but after this PR it returns the correct: ``` +---+---+---+---+---+---+---+---+---+---+---+ | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| +---+---+---+---+---+---+---+---+---+---+---+ |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| +---+---+---+---+---+---+---+---+---+---+---+ ``` ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness bug, but the old behaviour can be restored with `spark.sql.legacy.percentileDiscCalculation=true`. ### How was this patch tested? Added new UTs. Closes #42611 from peter-toth/SPARK-44871-fix-percentile-disc-behaviour-3.3. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 23 August 2023, 07:48:48 UTC
352810b [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient() ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see https://github.com/apache/spark/pull/16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42619 from JoshRosen/remove-awaitUninterruptibly. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 2137606a6686b33bed57800c6b166059b134a089) Signed-off-by: Kent Yao <yao@apache.org> 23 August 2023, 05:55:20 UTC
f94b9fb [SPARK-44925][K8S] K8s default service token file should not be materialized into token ### What changes were proposed in this pull request? This PR aims to stop materializing `OAuth token` from the default service token file, `/var/run/secrets/kubernetes.io/serviceaccount/token`, because the content of volumes varies which means being renewed or expired by K8s control plane. We need to read the content in a on-demand manner to be in the up-to-date status. Note the followings: - Since we use `autoConfigure` for K8s client, K8s client still uses the default service tokens if exists and needed. https://github.com/apache/spark/blob/13588c10cbc380ecba1231223425eaad2eb9ec80/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala#L91 - This PR doesn't change Spark's behavior for the user-provided token file location. Spark will load the content of the user-provided token file locations to get `OAuth token` because Spark cannot assume that the files of that locations are refreshed or not in the future. ### Why are the changes needed? [BoundServiceAccountTokenVolume](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume) became `Stable` at K8s 1.22. - [KEP-1205 Bound Service Account Tokens](https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md#boundserviceaccounttokenvolume-1) : **BoundServiceAccountTokenVolume** Alpha | Beta | GA -- | -- | -- 1.13 | 1.21 | 1.22 - [EKS Service Account with 90 Days Expiration](https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html) > For Amazon EKS clusters, the extended expiry period is 90 days. Your Amazon EKS cluster's Kubernetes API server rejects requests with tokens that are greater than 90 days old. - As of today, [all supported EKS clusters are from 1.23 to 1.27](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html) which means we always use `BoundServiceAccountTokenVolume`. ### Does this PR introduce _any_ user-facing change? No. This fixes only the bugs caused by some outdated tokens where K8s control plane denies Spark's K8s API invocation. ### How was this patch tested? Pass the CIs with the all existing unit tests and integration tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42624 from dongjoon-hyun/SPARK-44925. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7b1a3494107b304a93f571920fc3816cde71f706) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 August 2023, 05:46:30 UTC
557f9f5 [SPARK-43327][CORE][3.3] Trigger `committer.setupJob` before plan execute in `FileFormatWriter#write` ### What changes were proposed in this pull request? Trigger `committer.setupJob` before plan execute in `FileFormatWriter#write` ### Why are the changes needed? In this issue, the case where `outputOrdering` might not work if AQE is enabled has been resolved. https://github.com/apache/spark/pull/38358 However, since it materializes the AQE plan in advance (triggers getFinalPhysicalPlan) , it may cause the committer.setupJob(job) to not execute When `AdaptiveSparkPlanExec#getFinalPhysicalPlan()` is executed with an error. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add UT Closes #41154 from zzzzming95/spark3-SPARK-43327. Lead-authored-by: zzzzming95 <505306252@qq.com> Co-authored-by: zhiming she <505306252@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 August 2023, 03:08:00 UTC
69ca57c [SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again ### What changes were proposed in this pull request? This PR creates an alternative to the assign_issue function in jira.client.JIRA. The original one has an issue that it will search users again and only choose the assignee from 20 candidates. If it's unmatched, it picks the head blindly. For example, ```python >>> assignee = asf_jira.user("yao") >>> "SPARK-44801" 'SPARK-44801' >>> asf_jira.assign_issue(issue.key, assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'issue' is not defined >>> asf_jira.assign_issue("SPARK-44801", assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 123, in wrapper result = func(*arg_list, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 1891, in assign_issue self._session.put(url, data=json.dumps(payload)) File "/Users/hzyaoqin/python/lib/python3.11/site-packages/requests/sessions.py", line 649, in put return self.request("PUT", url, data=data, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 246, in request elif raise_on_error(response, **processed_kwargs): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 71, in raise_on_error raise JIRAError( jira.exceptions.JIRAError: JiraError HTTP 400 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-44801/assignee response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' cannot be assigned issues."}} ``` The Jira userid 'yao' fails to return my JIRA profile as a candidate(20 in total) to match. So, 'airhot' from the head replaces me as an assignee. ### Why are the changes needed? bugfix for merge_spark_pr ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test locally ```python >>> def assign_issue(client: jira.client.JIRA, issue: int, assignee: str) -> bool: ... """Assign an issue to a user. ... ... Args: ... issue (Union[int, str]): the issue ID or key to assign ... assignee (str): the user to assign the issue to. None will set it to unassigned. -1 will set it to Automatic. ... ... Returns: ... bool ... """ ... url = getattr(client, "_get_latest_url")(f"issue/{issue}/assignee") ... payload = {"name": assignee} ... getattr(client, "_session").put(url, data=json.dumps(payload)) ... return True ... >>> >>> assign_issue(asf_jira, "SPARK-44801", "yao") True ``` Closes #42496 from yaooqinn/SPARK-44813. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 8fb799d47bbd5d5ce9db35283d08ab1a31dc37b9) Signed-off-by: Kent Yao <yao@apache.org> 18 August 2023, 18:04:24 UTC
0ef1419 Revert "[SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again" This reverts commit 7e7c41bf1007ca05ffc3d818d34d75570d234a6d. 18 August 2023, 17:57:03 UTC
7e7c41b [SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again ### What changes were proposed in this pull request? This PR creates an alternative to the assign_issue function in jira.client.JIRA. The original one has an issue that it will search users again and only choose the assignee from 20 candidates. If it's unmatched, it picks the head blindly. For example, ```python >>> assignee = asf_jira.user("yao") >>> "SPARK-44801" 'SPARK-44801' >>> asf_jira.assign_issue(issue.key, assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'issue' is not defined >>> asf_jira.assign_issue("SPARK-44801", assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 123, in wrapper result = func(*arg_list, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 1891, in assign_issue self._session.put(url, data=json.dumps(payload)) File "/Users/hzyaoqin/python/lib/python3.11/site-packages/requests/sessions.py", line 649, in put return self.request("PUT", url, data=data, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 246, in request elif raise_on_error(response, **processed_kwargs): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 71, in raise_on_error raise JIRAError( jira.exceptions.JIRAError: JiraError HTTP 400 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-44801/assignee response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' cannot be assigned issues."}} ``` The Jira userid 'yao' fails to return my JIRA profile as a candidate(20 in total) to match. So, 'airhot' from the head replaces me as an assignee. ### Why are the changes needed? bugfix for merge_spark_pr ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test locally ```python >>> def assign_issue(client: jira.client.JIRA, issue: int, assignee: str) -> bool: ... """Assign an issue to a user. ... ... Args: ... issue (Union[int, str]): the issue ID or key to assign ... assignee (str): the user to assign the issue to. None will set it to unassigned. -1 will set it to Automatic. ... ... Returns: ... bool ... """ ... url = getattr(client, "_get_latest_url")(f"issue/{issue}/assignee") ... payload = {"name": assignee} ... getattr(client, "_session").put(url, data=json.dumps(payload)) ... return True ... >>> >>> assign_issue(asf_jira, "SPARK-44801", "yao") True ``` Closes #42496 from yaooqinn/SPARK-44813. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 00255bc63b1a3bbe80bedc639b88d4a8e3f88f72) Signed-off-by: Sean Owen <srowen@gmail.com> 18 August 2023, 15:03:19 UTC
3d27f20 [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI buttons ### What changes were proposed in this pull request? This PR aims to fix `getBaseURI` errors when we clicks Spark Worker LogPage UI buttons in Apache Spark 3.2.0+. ### Why are the changes needed? Run a Spark job and open the Spark Worker UI, http://localhost:8081 . ``` $ sbin/start-master.sh $ sbin/start-worker.sh spark://127.0.0.1:7077 $ bin/spark-shell --master spark://127.0.0.1:7077 ``` Click `stderr` and `Load New` button. The button is out of order currently due to the following error because `getBaseURI` is defined in `utils.js`. ![Screenshot 2023-08-17 at 2 38 45 PM](https://github.com/apache/spark/assets/9700541/c2358ae3-46d2-43fe-9cc1-ce343725ce4c) ### Does this PR introduce _any_ user-facing change? This will make the buttons work. ### How was this patch tested? Manual. Closes #42546 from dongjoon-hyun/SPARK-44857. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f807bd239aebe182a04f5452d6efdf458e44143c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 August 2023, 01:07:39 UTC
330f7a7 [SPARK-44581][YARN] Fix the bug that ShutdownHookManager gets wrong UGI from SecurityManager of ApplicationMaster ### What changes were proposed in this pull request? I make the SecurityManager instance a lazy value ### Why are the changes needed? fix the bug in issue [SPARK-44581](https://issues.apache.org/jira/browse/SPARK-44581) **Bug:** In spark3.2 it throws the org.apache.hadoop.security.AccessControlException, but in spark2.4 this hook does not throw exception. I rebuild the hadoop-client-api.jar, and add some debug log before the hadoop shutdown hook is created, and rebuild the spark-yarn.jar to add some debug log when creating the spark shutdown hook manager, here is the screenshot of the log: ![image](https://github.com/apache/spark/assets/62563545/ea338db3-646c-432c-bf16-1f445adc2ad9) We can see from the screenshot, the ShutdownHookManager is initialized before the ApplicationManager create a new ugi. **Reason** The main cause is that ShutdownHook thread is created before we create the ugi in ApplicationMaster. When we set the config key "hadoop.security.credential.provider.path", the ApplicationMaster will try to get a filesystem when generating SSLOptions, and when initialize the filesystem during which it will generate a new thread whose ugi is inherited from the current process (yarn). After this, it will generate a new ugi (SPARK_USER) in ApplicationMaster and execute the doAs() function. Here is the chain of the call: ApplicationMaster.(ApplicationMaster.scala:83) -> org.apache.spark.SecurityManager.(SecurityManager.scala:98) -> org.apache.spark.SSLOptions$.parse(SSLOptions.scala:188) -> org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:2353) -> org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2434) -> org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:82) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I didn't add new UnitTest for this, but I rebuild the package, and runs a program in my cluster, and turns out that the user when I delete the staging file turns to be the same with the SPARK_USER. Closes #42405 from liangyu-1/SPARK-44581. Authored-by: 余良 <yul165@chinaunicom.cn> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit e584ed4ad96a0f0573455511d7be0e9b2afbeb96) Signed-off-by: Kent Yao <yao@apache.org> 09 August 2023, 05:48:08 UTC
8d706c7 [SPARK-44725][DOCS] Document `spark.network.timeoutInterval` ### What changes were proposed in this pull request? This PR aims to document `spark.network.timeoutInterval` configuration. ### Why are the changes needed? Like `spark.network.timeout`, `spark.network.timeoutInterval` exists since Apache Spark 1.3.x. https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31ed3ad998995b8/core/src/main/scala/org/apache/spark/internal/config/Network.scala#L48-L52 Since this is a user-facing configuration like the following, we had better document it. https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31ed3ad998995b8/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala#L91-L93 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual because this is a doc-only change. Closes #42402 from dongjoon-hyun/SPARK-44725. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3af2e77bd20cad3d9fe23cc0689eed29d5f5a537) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 August 2023, 23:04:07 UTC
951cb64 Preparing development version 3.3.4-SNAPSHOT 04 August 2023, 04:33:52 UTC
8c2b331 Preparing Spark release v3.3.3-rc1 04 August 2023, 04:33:19 UTC
407bb57 [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching We have a long-standing tricky optimization in `Dataset.union`, which invokes the optimizer rule `CombineUnions` to pre-optimize the analyzed plan. This is to avoid too large analyzed plan for a specific dataframe query pattern `df1.union(df2).union(df3).union...`. This tricky optimization is designed to break dataframe caching, but we thought it was fine as people usually won't cache the intermediate dataframe in a union chain. However, `CombineUnions` gets improved from time to time (e.g. https://github.com/apache/spark/pull/35214) and now it can optimize a wide range of Union patterns. Now it's possible that people union two dataframe, do something with `select`, and cache it. Then the dataframe is unioned again with other dataframes and people expect the df cache to work. However the cache won't work due to the tricky optimization in `Dataset.union`. This PR updates `Dataset.union` to only combine adjacent Unions to match the original purpose. Fix perf regression due to breaking df caching no new test Closes #42315 from cloud-fan/union. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ce1fe57cdd7004a891ef8b97c77ac96b3719efcd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 August 2023, 03:32:30 UTC
9e86aed [SPARK-44588][CORE][3.3] Fix double encryption issue for migrated shuffle blocks ### What changes were proposed in this pull request? Fix double encryption issue for migrated shuffle blocks Shuffle blocks upon migration are sent without decryption when io.encryption is enabled. The code on the receiving side ends up using serializer.wrapStream on the OutputStream to the file which results in the already encrypted bytes being encrypted again when the bytes are written out. This patch removes the usage of serializerManager.wrapStream on the receiving side and also adds tests that validate that this works as expected. I have also validated that the added tests will fail if the fix is not in place. Jira ticket with more details: https://issues.apache.org/jira/browse/SPARK-44588 ### Why are the changes needed? Migrated shuffle blocks will be double encrypted when `spark.io.encryption = true` without this fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests were added to test shuffle block migration with spark.io.encryption enabled and also fixes a test helper method to properly construct the SerializerManager with the encryption key. Closes #42277 from henrymai/branch-3.3_backport_double_encryption. Authored-by: Henry Mai <henrymai@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 August 2023, 04:07:51 UTC
073d0b6 [SPARK-44251][SQL] Set nullable correctly on coalesced join key in full outer USING join ### What changes were proposed in this pull request? For full outer joins employing USING, set the nullability of the coalesced join columns to true. ### Why are the changes needed? The following query produces incorrect results: ``` create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); create or replace temp view v2 as values (2, 3) as (c1, c2); select explode(array(c1)) as x from v1 full outer join v2 using (c1); -1 <== should be null 1 2 ``` The following query fails with a `NullPointerException`: ``` create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); create or replace temp view v2 as values ('2', 3) as (c1, c2); select explode(array(c1)) as x from v1 full outer join v2 using (c1); 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) ... ``` The above full outer joins implicitly add an aliased coalesce to the parent projection of the join: `coalesce(v1.c1, v2.c1) as c1`. In the case where only one side's key is nullable, the coalesce's nullability is false. As a result, the generator's output has nullable set as false. But this is incorrect: If one side has a row with explicit null key values, the other side's row will also have null key values (because the other side's row will be "made up"), and both the `coalesce` and the `explode` will return a null value. While `UpdateNullability` actually repairs the nullability of the `coalesce` before execution, it doesn't recreate the generator output, so the nullability remains incorrect in `Generate#output`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #41809 from bersprockets/using_oddity2. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 7a27bc68c849041837e521285e33227c3d1f9853) Signed-off-by: Yuming Wang <yumwang@ebay.com> 11 July 2023, 03:19:49 UTC
edc13fd [SPARK-44215][3.3][SHUFFLE] If num chunks are 0, then server should throw a RuntimeException ### What changes were proposed in this pull request? The executor expects `numChunks` to be > 0. If it is zero, then we see that the executor fails with ``` 23/06/20 19:07:37 ERROR task 2031.0 in stage 47.0 (TID 25018) Executor: Exception in task 2031.0 in stage 47.0 (TID 25018) java.lang.ArithmeticException: / by zero at org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:128) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:1047) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:90) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ``` Because this is an `ArithmeticException`, the executor doesn't fallback. It's not a `FetchFailure` either, so the stage is not retried and the application fails. ### Why are the changes needed? The executor should fallback to fetch original blocks and not fail because this suggests that there is an issue with push-merged block. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified the existing UTs to validate that RuntimeException is thrown when numChunks are 0. Closes #41859 from otterc/SPARK-44215-branch-3.3. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> 06 July 2023, 01:48:05 UTC
e9b525e [SPARK-42784] should still create subDir when the number of subDir in merge dir is less than conf ### What changes were proposed in this pull request? Fixed a minor issue with diskBlockManager after push-based shuffle is enabled ### Why are the changes needed? this bug will affect the efficiency of push based shuffle ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #40412 from Stove-hust/feature-42784. Authored-by: meifencheng <meifencheng@meituan.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 35d51571a803b8fa7d14542236276425b517d3af) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 01 July 2023, 03:51:05 UTC
84620f2 [SPARK-44241][CORE] Mistakenly set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause incessant executor cons/destructions ### What changes were proposed in this pull request? This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means - connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment - connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled. ### Why are the changes needed? 1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately. 2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new unit tests Closes #41785 from yaooqinn/SPARK-44241. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 38645fa470b5af7c2e41efa4fb092bdf2463fbbd) Signed-off-by: Kent Yao <yao@apache.org> 30 June 2023, 10:33:54 UTC
b7d8ddb [SPARK-44184][PYTHON][DOCS] Remove a wrong doc about `ARROW_PRE_0_15_IPC_FORMAT` ### What changes were proposed in this pull request? This PR aims to remove a wrong documentation about `ARROW_PRE_0_15_IPC_FORMAT`. ### Why are the changes needed? Since Apache Spark 3.0.0, Spark doesn't allow `ARROW_PRE_0_15_IPC_FORMAT` environment variable at all. https://github.com/apache/spark/blob/2407183cb8637b6ac2d1b76320cae9cbde3411da/python/pyspark/sql/pandas/utils.py#L69-L73 ### Does this PR introduce _any_ user-facing change? No. This is a removal of outdated wrong documentation. ### How was this patch tested? Manual review. Closes #41730 from dongjoon-hyun/SPARK-44184. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 00e7c08606d0b6de22604d2a7350ea0711355300) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 June 2023, 01:53:43 UTC
d52aa4f [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts` ### What changes were proposed in this pull request? This PR aims to remove `spark.kubernetes.executor.lostCheckmaxAttempts` because it was not used after SPARK-24248 (Apache Spark 2.4.0) ### Why are the changes needed? To clean up this from documentation and code. ### Does this PR introduce _any_ user-facing change? No because it was no-op already. ### How was this patch tested? Pass the CIs. Closes #41713 from dongjoon-hyun/SPARK-44158. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6590e7db5212bb0dc90f22133a96e3d5e385af65) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 June 2023, 20:49:45 UTC
a7bbaca [SPARK-44134][CORE] Fix setting resources (GPU/FPGA) to 0 when they are set in spark-defaults.conf ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-44134 With resource aware scheduling, if you specify a default value in the spark-defaults.conf, a user can't override that to set it to 0. Meaning spark-defaults.conf has something like: spark.executor.resource.{resourceName}.amount=1 spark.task.resource.{resourceName}.amount =1 If the user tries to override when submitting an application with spark.executor.resource.{resourceName}.amount=0 and spark.task.resource.{resourceName}.amount =0, the applicatoin fails to submit. it should submit and just not try to allocate those resources. This worked back in Spark 3.0 but was broken when the stage level scheduling feature was added. Here I fixed it by simply removing any task resources from the list if they are set to 0. Note I also fixed a typo in the exception message when no executor resources are specified but task resources are. Note, ideally this is backported to all of the maintenance releases ### Why are the changes needed? Fix a bug described above ### Does this PR introduce _any_ user-facing change? no api changes ### How was this patch tested? Added unit test and then ran manually on standalone and YARN clusters to verify overriding the configs now works. Closes #41703 from tgravescs/fixResource0. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cf6e90c205bd76ff9e1fc2d88757d9d44ec93162) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 June 2023, 00:41:06 UTC
6868d6c [SPARK-44142][PYTHON] Replace type with tpe in utility to convert python types to spark types ### What changes were proposed in this pull request? In the typehints utility to convert python types to spark types, use the variable `tpe` for comparison to the string representation of categorical types rather than `type`. ### Why are the changes needed? Currently, the Python keyword `type` is used in the comparison, which will always be false. The user's type is stored in variable `tpe`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing `test_categorical.py` tests Closes #41697 from ted-jenks/patch-3. Authored-by: Ted <77486246+ted-jenks@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5feab1072ba057da3811069c3eb8efec0de1044c) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 June 2023, 00:09:33 UTC
b969906 [SPARK-44040][SQL] Fix compute stats when AggregateExec node above QueryStageExec ### What changes were proposed in this pull request? This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`. For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example: ```sql SELECT count(*) FROM tbl WHERE false; ``` The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891). ### Why are the changes needed? Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan === !Aggregate [id#5L], [id#5L] Project [id#5L] +- Union false, false +- Union false, false :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #41576 from wangyum/SPARK-44040. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 55ba63c257b6617ec3d2aca5bc1d0989d4f29de8) Signed-off-by: Yuming Wang <yumwang@ebay.com> 16 June 2023, 04:06:37 UTC
3586ca3 [SPARK-32559][SQL] Fix the trim logic did't handle ASCII control characters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in https://github.com/apache/spark/pull/29375 trim ASCII control characters unexpectly. Before this patch ![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7) And hive ![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe) ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes #41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <wangjunbo@qiyi.com> Co-authored-by: Junbo wang <1042815068@qq.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 80588e4ddfac1d2e2fdf4f9a7783c56be6a97cdd) Signed-off-by: Kent Yao <yao@apache.org> 13 June 2023, 03:44:40 UTC
8840614 [SPARK-43398][CORE] Executor timeout should be max of idle shuffle and rdd timeout ### What changes were proposed in this pull request? Executor timeout should be max of idle, shuffle and rdd timeout ### Why are the changes needed? Wrong timeout value when combining idle, shuffle and rdd timeout ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test in `ExecutorMonitorSuite` Closes #41082 from warrenzhu25/max-timeout. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7107742a381cde2e6de9425e3e436282a8c0d27c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 June 2023, 07:40:57 UTC
ababc57 [SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in event logs ### What changes were proposed in this pull request? This prevents NPE by handling the case where `modifiedConfigs` doesn't exist in event logs. ### Why are the changes needed? Basically, this is the same solution for that case. - https://github.com/apache/spark/pull/34907 The new code was added here, but we missed the corner case. - https://github.com/apache/spark/pull/35972 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41472 from dongjoon-hyun/SPARK-43976. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b4ab34bf9b22d0f0ca4ab13f9b6106f38ccfaebe) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 June 2023, 16:35:01 UTC
0c509be [SPARK-41958][CORE][3.3] Disallow arbitrary custom classpath with proxy user in cluster mode Backporting fix for SPARK-41958 to 3.3 branch from #39474 Below description from original PR. -------------------------- ### What changes were proposed in this pull request? This PR proposes to disallow arbitrary custom classpath with proxy user in cluster mode by default. ### Why are the changes needed? To avoid arbitrary classpath in spark cluster. ### Does this PR introduce _any_ user-facing change? Yes. User should reenable this feature by `spark.submit.proxyUser.allowCustomClasspathInClusterMode`. ### How was this patch tested? Manually tested. Closes #39474 from Ngone51/dev. Lead-authored-by: Peter Toth <peter.tothgmail.com> Co-authored-by: Yi Wu <yi.wudatabricks.com> Signed-off-by: Hyukjin Kwon <gurwls223apache.org> (cherry picked from commit 909da96e1471886a01a9e1def93630c4fd40e74a) ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #41428 from degant/spark-41958-3.3. Lead-authored-by: Degant Puri <depuri@microsoft.com> Co-authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 June 2023, 00:14:36 UTC
86d22f4 [SPARK-43956][SQL][3.3] Fix the bug doesn't display column's sql for Percentile[Cont|Disc] ### What changes were proposed in this pull request? This PR used to backport https://github.com/apache/spark/pull/41436 to 3.3 ### Why are the changes needed? Fix the bug doesn't display column's sql for Percentile[Cont|Disc]. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see the correct sql information. ### How was this patch tested? Test cases updated. Closes #41446 from beliefer/SPARK-43956_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 04 June 2023, 00:30:16 UTC
11d2f73 [SPARK-43751][SQL][DOC] Document `unbase64` behavior change ### What changes were proposed in this pull request? After SPARK-37820, `select unbase64("abcs==")`(malformed input) always throws an exception, this PR does not help in that case, it only improves the error message for `to_binary()`. So, `unbase64()`'s behavior for malformed input changed silently after SPARK-37820: - before: return a best-effort result, because it uses [LENIENT](https://github.com/apache/commons-codec/blob/rel/commons-codec-1.15/src/main/java/org/apache/commons/codec/binary/Base64InputStream.java#L46) policy: any trailing bits are composed into 8-bit bytes where possible. The remainder are discarded. - after: throw an exception And there is no way to restore the previous behavior. To tolerate the malformed input, the user should migrate `unbase64(<input>)` to `try_to_binary(<input>, 'base64')` to get NULL instead of interrupting by exception. ### Why are the changes needed? Add the behavior change to migration guide. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Manuelly review. Closes #41280 from pan3793/SPARK-43751. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit af6c1ec7c795584c28e15e4963eed83917e2f06a) Signed-off-by: Kent Yao <yao@apache.org> 26 May 2023, 03:34:25 UTC
457bb34 [MINOR][PS][TESTS] Fix `SeriesDateTimeTests.test_quarter` to work properly ### What changes were proposed in this pull request? This PR proposes to fix `SeriesDateTimeTests.test_quarter` to work properly. ### Why are the changes needed? Test has not been properly testing ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested, and the existing CI should pass Closes #41274 from itholic/minor_quarter_test. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5c53384def22b01b8ef28bee6f2d10648bce1a1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 07:04:42 UTC
a3e79f4 [SPARK-43719][WEBUI] Handle `missing row.excludedInStages` field ### What changes were proposed in this pull request? This PR aims to handle a corner case when `row.excludedInStages` field is missing. ### Why are the changes needed? To fix the following type error when Spark loads some very old 2.4.x or 3.0.x logs. ![missing](https://github.com/apache/spark/assets/9700541/f402df10-bf92-4c9f-8bd1-ec9b98a67966) We have two places and this PR protects both places. ``` $ git grep row.excludedInStages core/src/main/resources/org/apache/spark/ui/static/executorspage.js: if (typeof row.excludedInStages === "undefined" || row.excludedInStages.length == 0) { core/src/main/resources/org/apache/spark/ui/static/executorspage.js: return "Active (Excluded in Stages: [" + row.excludedInStages.join(", ") + "])"; ``` ### Does this PR introduce _any_ user-facing change? No, this will remove the error case only. ### How was this patch tested? Manual review. Closes #41266 from dongjoon-hyun/SPARK-43719. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit eeab2e701330f7bc24e9b09ce48925c2c3265aa8) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 05:17:33 UTC
ebf1572 [SPARK-43718][SQL] Set nullable correctly for keys in USING joins ### What changes were proposed in this pull request? In `Anaylzer#commonNaturalJoinProcessing`, set nullable correctly when adding the join keys to the Project's hidden_output tag. ### Why are the changes needed? The value of the hidden_output tag will be used for resolution of attributes in parent operators, so incorrect nullabilty can cause problems. For example, assume this data: ``` create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); ``` The following query produces incorrect results: ``` spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> ``` Similar issues occur with right outer join and left outer join. `t1.c1` and `t2.c1` have the wrong nullability at the time the array is resolved, so the array's `containsNull` value is incorrect. `UpdateNullability` will update the nullability of `t1.c1` and `t2.c1` in the `CreateArray` arguments, but will not update `containsNull` in the function's data type. Queries that don't use arrays also can get wrong results. Assume this data: ``` create or replace temp view t1 as values (0), (1), (2) as (c1); create or replace temp view t2 as values (1), (2), (3) as (c1); create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); ``` The following query produces incorrect results: ``` select t1.c1 as t1_c1, t2.c1 as t2_c1, b from t1 full outer join t2 using (c1), lateral ( select b from t3 where a = coalesce(t2.c1, 1) ) lt3; 1 1 2 NULL 3 4 Time taken: 2.395 seconds, Fetched 2 row(s) spark-sql (default)> ``` The result should be the following: ``` 0 NULL 2 1 1 2 NULL 3 4 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #41267 from bersprockets/using_anomaly. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 217b30a4f7ca18ade19a9552c2b87dd4caeabe57) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 04:47:56 UTC
99f10aa [SPARK-43589][SQL][3.3] Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString` ### What changes were proposed in this pull request? This is a logical backporting of #41232 This PR aims to fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString` instead of shift operations. ### Why are the changes needed? To avoid user confusion by giving more accurate values. For example, `maxBroadcastTableBytes` is 1GB and `dataSize` is `2GB - 1 byte`. **BEFORE** ``` Cannot broadcast the table that is larger than 1GB: 1 GB. ``` **AFTER** ``` Cannot broadcast the table that is larger than 1024.0 MiB: 2048.0 MiB. ``` ### Does this PR introduce _any_ user-facing change? Yes, but only error message. ### How was this patch tested? Pass the CIs with newly added test case. Closes #41234 from dongjoon-hyun/SPARK-43589-3.3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2023, 15:34:07 UTC
b4d454f [SPARK-43587][CORE][TESTS] Run `HealthTrackerIntegrationSuite` in a dedicated JVM ### What changes were proposed in this pull request? This PR aims to run `HealthTrackerIntegrationSuite` in a dedicated JVM to mitigate a flaky tests. ### Why are the changes needed? `HealthTrackerIntegrationSuite` has been flaky and SPARK-25400 and SPARK-37384 increased the timeout `from 1s to 10s` and `10s to 20s`, respectively. The usual suspect of this flakiness is some unknown side-effect like GCs. In this PR, we aims to run this in a separate JVM instead of increasing the timeout more. https://github.com/apache/spark/blob/abc140263303c409f8d4b9632645c5c6cbc11d20/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala#L56-L58 This is the recent failure. - https://github.com/apache/spark/actions/runs/5020505360/jobs/9002039817 ``` [info] HealthTrackerIntegrationSuite: [info] - If preferred node is bad, without excludeOnFailure job will fail (92 milliseconds) [info] - With default settings, job can succeed despite multiple bad executors on node (3 seconds, 163 milliseconds) [info] - Bad node with multiple executors, job will still succeed with the right confs *** FAILED *** (20 seconds, 43 milliseconds) [info] java.util.concurrent.TimeoutException: Futures timed out after [20 seconds] [info] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) [info] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187) [info] at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:355) [info] at org.apache.spark.scheduler.SchedulerIntegrationSuite.awaitJobTermination(SchedulerIntegrationSuite.scala:276) [info] at org.apache.spark.scheduler.HealthTrackerIntegrationSuite.$anonfun$new$9(HealthTrackerIntegrationSuite.scala:92) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41229 from dongjoon-hyun/SPARK-43587. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit eb2456bce2522779bf6b866a5fbb728472d35097) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2023, 06:31:41 UTC
132678f [SPARK-43541][SQL][3.3] Propagate all `Project` tags in resolving of expressions and missing columns ### What changes were proposed in this pull request? In the PR, I propose to propagate all tags in a `Project` while resolving of expressions and missing columns in `ColumnResolutionHelper.resolveExprsAndAddMissingAttrs()`. This is a backport of https://github.com/apache/spark/pull/41204. ### Why are the changes needed? To fix the bug reproduced by the query below: ```sql spark-sql (default)> WITH > t1 AS (select key from values ('a') t(key)), > t2 AS (select key from values ('a') t(key)) > SELECT t1.key > FROM t1 FULL OUTER JOIN t2 USING (key) > WHERE t1.key NOT LIKE 'bb.%'; [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `t1`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4 pos 7; ``` ### Does this PR introduce _any_ user-facing change? No. It fixes a bug, and outputs the expected result: `a`. ### How was this patch tested? By new test added to `using-join.sql`: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z using-join.sql" ``` and the related test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.HiveContextCompatibilitySuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 09d5742a8679839d0846f50e708df98663a6d64c) Closes #41221 from MaxGekk/fix-using-join-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2023, 00:24:50 UTC
9110c05 [SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row ### What changes were proposed in this pull request? This is a pull request to port the fix from the master branch to version 3.3. [PR](https://github.com/apache/spark/pull/40755) When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](https://github.com/apache/spark/pull/35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129gmail.com> Closes #40858 from kings129/fix_encoder_branch_33. Authored-by: --global <xuqiang129@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 May 2023, 21:14:30 UTC
85ff71f [SPARK-43395][BUILD] Exclude macOS tar extended metadata in make-distribution.sh ### What changes were proposed in this pull request? Add args `--no-mac-metadata --no-xattrs --no-fflags` to `tar` on macOS in `dev/make-distribution.sh` to exclude macOS-specific extended metadata. ### Why are the changes needed? The binary tarball created on macOS includes extended macOS-specific metadata and xattrs, which causes warnings when unarchiving it on Linux. Step to reproduce 1. create tarball on macOS (13.3.1) ``` ➜ apache-spark git:(master) tar --version bsdtar 3.5.3 - libarchive 3.5.3 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.8 ``` ``` ➜ apache-spark git:(master) dev/make-distribution.sh --tgz ``` 2. unarchive the binary tarball on Linux (CentOS-7) ``` ➜ ~ tar --version tar (GNU tar) 1.26 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by John Gilmore and Jay Fenlason. ``` ``` ➜ ~ tar -xzf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz tar: Ignoring unknown extended header keyword `SCHILY.fflags' tar: Ignoring unknown extended header keyword `LIBARCHIVE.xattr.com.apple.FinderInfo' ``` ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Create binary tarball on macOS then unarchive on Linux, warnings disappear after this change. Closes #41074 from pan3793/SPARK-43395. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 2d0240df3c474902e263f67b93fb497ca13da00f) Signed-off-by: Sean Owen <srowen@gmail.com> 06 May 2023, 14:38:19 UTC
bc2c955 [SPARK-43337][UI][3.3] Asc/desc arrow icons for sorting column does not get displayed in the table column ### What changes were proposed in this pull request? Remove css `!important` tag for asc/desc arrow icons in jquery.dataTables.1.10.25.min.css ### Why are the changes needed? Upgrading to DataTables 1.10.25 broke asc/desc arrow icons for sorting column. The sorting icon is not displayed when the column is clicked to sort by asc/desc. This is because the new DataTables 1.10.25's jquery.dataTables.1.10.25.min.css file added `!important` rule preventing the override set in webui-dataTables.css ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ![image](https://user-images.githubusercontent.com/52679095/236394863-e0004e7b-5173-495a-af23-32c1343e0ee6.png) ![image](https://user-images.githubusercontent.com/52679095/236394879-db0e5e0e-f6b3-48c3-9c79-694dd9abcb76.png) Closes #41060 from maytasm/fix-arrow-3. Authored-by: Maytas Monsereenusorn <maytasm@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> 05 May 2023, 19:07:59 UTC
e9aab41 [SPARK-43293][SQL] `__qualified_access_only` should be ignored in normal columns This is a followup of https://github.com/apache/spark/pull/39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes #40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 021f02e02fb88bbbccd810ae000e14e0c854e2e6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2023, 02:55:21 UTC
058dcbf [SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method ### What changes were proposed in this pull request? The df.describe() method will cached the RDD. And if the cached RDD is RDD[Unsaferow], which may be released after the row is used, then the result will be wong. Here we need to copy the RDD before caching as the [TakeOrderedAndProjectExec ](https://github.com/apache/spark/blob/d68d46c9e2cec04541e2457f4778117b570d8cdb/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L204)operator does. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Closes #40914 from JkSelf/describe. Authored-by: Jia Ke <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 April 2023, 09:24:46 UTC
8f0c75c [SPARK-43113][SQL][3.3] Evaluate stream-side variables when generating code for a bound condition ### What changes were proposed in this pull request? This is a back-port of #40766 and #40881. In `JoinCodegenSupport#getJoinCondition`, evaluate any referenced stream-side variables before using them in the generated code. This patch doesn't evaluate the passed stream-side variables directly, but instead evaluates a copy (`streamVars2`). This is because `SortMergeJoin#codegenFullOuter` will want to evaluate the stream-side vars within a different scope than the condition check, so we mustn't delete the initialization code from the original `ExprCode` instances. ### Why are the changes needed? When a bound condition of a full outer join references the same stream-side column more than once, wholestage codegen generates bad code. For example, the following query fails with a compilation error: ``` create or replace temp view v1 as select * from values (1, 1), (2, 2), (3, 1) as v1(key, value); create or replace temp view v2 as select * from values (1, 22, 22), (3, -1, -1), (7, null, null) as v2(a, b, c); select * from v1 full outer join v2 on key = a and value > b and value > c; ``` The error is: ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 277, Column 9: Redefinition of local variable "smj_isNull_7" ``` The same error occurs with code generated from ShuffleHashJoinExec: ``` select /*+ SHUFFLE_HASH(v2) */ * from v1 full outer join v2 on key = a and value > b and value > c; ``` In this case, the error is: ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 174, Column 5: Redefinition of local variable "shj_value_1" ``` Neither `SortMergeJoin#codegenFullOuter` nor `ShuffledHashJoinExec#doProduce` evaluate the stream-side variables before calling `consumeFullOuterJoinRow#getJoinCondition`. As a result, `getJoinCondition` generates definition/initialization code for each referenced stream-side variable at the point of use. If a stream-side variable is used more than once in the bound condition, the definition/initialization code is generated more than once, resulting in the "Redefinition of local variable" error. In the end, the query succeeds, since Spark disables wholestage codegen and tries again. (In the case other join-type/strategy pairs, either the implementations don't call `JoinCodegenSupport#getJoinCondition`, or the stream-side variables are pre-evaluated before the call is made, so no error happens in those cases). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes #40917 from bersprockets/full_join_codegen_issue_br33. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 April 2023, 00:58:15 UTC
6d0a271 [SPARK-41660][SQL][3.3] Only propagate metadata columns if they are used ### What changes were proposed in this pull request? backporting https://github.com/apache/spark/pull/39152 to 3.3 ### Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40889 from huaxingao/metadata. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> 21 April 2023, 14:46:00 UTC
cd16624 [SPARK-43158][DOCS] Set upperbound of pandas version for Binder integration This PR proposes to set the upperbound for pandas in Binder integration. We don't currently support pandas 2.0.0 properly, see also https://issues.apache.org/jira/browse/SPARK-42618 To make the quickstarts working. Yes, it fixes the quickstart. Tested in: - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Closes #40814 from HyukjinKwon/set-lower-bound-pandas. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a2592e6d46caf82c642f4a01a3fd5c282bbe174e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 April 2023, 03:59:18 UTC
cc6d3f9 [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions This PR fixes construct aggregate expressions by replacing grouping functions if a expression is part of aggregation. In the following example, the second `b` should also be replaced: <img width="545" alt="image" src="https://user-images.githubusercontent.com/5399861/230415618-84cd6334-690e-4b0b-867b-ccc4056226a8.png"> Fix bug: ``` spark-sql (default)> SELECT CASE WHEN a IS NULL THEN count(b) WHEN b IS NULL THEN count(c) END > FROM grouping > GROUP BY GROUPING SETS (a, b, c); [MISSING_AGGREGATION] The non-aggregating expression "b" is based on columns which are not participating in the GROUP BY clause. ``` No. Unit test. Closes #40685 from wangyum/SPARK-43050. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 45b84cd37add1b9ce274273ad5e519e6bc1d8013) Signed-off-by: Yuming Wang <yumwang@ebay.com> 15 April 2023, 02:17:43 UTC
5ce51c3 [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin` ### What changes were proposed in this pull request? This PR aims to use `sbt-eclipse` instead of `sbteclipse-plugin`. ### Why are the changes needed? Thanks to SPARK-34959, Apache Spark 3.2+ uses SBT 1.5.0 and we can use `set-eclipse` instead of old `sbteclipse-plugin`. - https://github.com/sbt/sbt-eclipse/releases/tag/6.0.0 ### Does this PR introduce _any_ user-facing change? No, this is a dev-only plugin. ### How was this patch tested? Pass the CIs and manual tests. ``` $ build/sbt eclipse Using /Users/dongjoon/.jenv/versions/1.8 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. Using SPARK_LOCAL_IP=localhost Attempting to fetch sbt Launching sbt from build/sbt-launch-1.8.2.jar [info] welcome to sbt 1.8.2 (AppleJDK-8.0.302.8.1 Java 1.8.0_302) [info] loading settings for project spark-merge-build from plugins.sbt ... [info] loading project definition from /Users/dongjoon/APACHE/spark-merge/project [info] Updating https://repo1.maven.org/maven2/com/github/sbt/sbt-eclipse_2.12_1.0/6.0.0/sbt-eclipse-6.0.0.pom 100.0% [##########] 2.5 KiB (4.5 KiB / s) ... ``` Closes #40708 from dongjoon-hyun/SPARK-43069. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9cba5529d1fc3faf6b743a632df751d84ec86a07) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 April 2023, 19:54:28 UTC
f228f84 [SPARK-39696][CORE] Fix data race in access to TaskMetrics.externalAccums ### What changes were proposed in this pull request? This PR fixes a data race around concurrent access to `TaskMetrics.externalAccums`. The race occurs between the `executor-heartbeater` thread and the thread executing the task. This data race is not known to cause issues on 2.12 but in 2.13 ~due this change https://github.com/scala/scala/pull/9258~ (LuciferYang bisected this to first cause failures in scala 2.13.7 one possible reason could be https://github.com/scala/scala/pull/9786) leads to an uncaught exception in the `executor-heartbeater` thread, which means that the executor will eventually be terminated due to missing hearbeats. This fix of using of using `CopyOnWriteArrayList` is cherry picked from https://github.com/apache/spark/pull/37206 where is was suggested as a fix by LuciferYang since `TaskMetrics.externalAccums` is also accessed from outside the class `TaskMetrics`. The old PR was closed because at that point there was no clear understanding of the race condition. JoshRosen commented here https://github.com/apache/spark/pull/37206#issuecomment-1189930626 saying that there should be no such race based on because all accumulators should be deserialized as part of task deserialization here: https://github.com/apache/spark/blob/0cc96f76d8a4858aee09e1fa32658da3ae76d384/core/src/main/scala/org/apache/spark/executor/Executor.scala#L507-L508 and therefore no writes should occur while the hearbeat thread will read the accumulators. But my understanding is that is incorrect as accumulators will also be deserialized as part of the taskBinary here: https://github.com/apache/spark/blob/169f828b1efe10d7f21e4b71a77f68cdd1d706d6/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala#L87-L88 which will happen while the heartbeater thread is potentially reading the accumulators. This can both due to user code using accumulators (see the new test case) but also when using the Dataframe/Dataset API as sql metrics will also be `externalAccums`. One way metrics will be sent as part of the taskBinary is when the dep is a `ShuffleDependency`: https://github.com/apache/spark/blob/fbbcf9434ac070dd4ced4fb9efe32899c6db12a9/core/src/main/scala/org/apache/spark/Dependency.scala#L85 with a ShuffleWriteProcessor that comes from https://github.com/apache/spark/blob/fbbcf9434ac070dd4ced4fb9efe32899c6db12a9/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L411-L422 ### Why are the changes needed? The current code has a data race. ### Does this PR introduce _any_ user-facing change? It will fix an uncaught exception in the `executor-hearbeater` thread when using scala 2.13. ### How was this patch tested? This patch adds a new test case, that before the fix was applied consistently produces the uncaught exception in the heartbeater thread when using scala 2.13. Closes #40663 from eejbyfeldt/SPARK-39696. Lead-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Co-authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6ce0822f76e11447487d5f6b3cce94a894f2ceef) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b2ff4c4f7ec21d41cb173b413bd5aa5feefd7eee) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 April 2023, 03:52:08 UTC
b40e408 [SPARK-43005][PYSPARK] Fix typo in pyspark/pandas/config.py By comparing compute.isin_limit and plotting.max_rows, `v is v` is likely to be a typo. ### What changes were proposed in this pull request? fix `v is v >= 0` with `v >= 0`. ### Why are the changes needed? By comparing compute.isin_limit and plotting.max_rows, `v is v` is likely to be a typo. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By GitHub Actions. Closes #40620 from thyecust/patch-2. Authored-by: thyecust <thy@mail.ecust.edu.cn> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 5ac2b0fc024ae499119dfd5ab2ee4d038418c5fd) Signed-off-by: Sean Owen <srowen@gmail.com> 03 April 2023, 13:24:39 UTC
92e8d26 [SPARK-43004][CORE] Fix typo in ResourceRequest.equals() vendor == vendor is always true, this is likely to be a typo. ### What changes were proposed in this pull request? fix `vendor == vendor` with `that.vendor == vendor`, and `discoveryScript == discoveryScript` with `that.discoveryScript == discoveryScript` ### Why are the changes needed? vendor == vendor is always true, this is likely to be a typo. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By GitHub Actions. Closes #40622 from thyecust/patch-4. Authored-by: thyecust <thy@mail.ecust.edu.cn> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 52c000ece27c9ef34969a7fb252714588f395926) Signed-off-by: Sean Owen <srowen@gmail.com> 03 April 2023, 03:36:21 UTC
6f266ee [SPARK-42967][CORE][3.2][3.3][3.4] Fix SparkListenerTaskStart.stageAttemptId when a task is started after the stage is cancelled ### What changes were proposed in this pull request? The PR fixes a bug that SparkListenerTaskStart can have `stageAttemptId = -1` when a task is launched after the stage is cancelled. Actually, we should use the information within `Task` to update the `stageAttemptId` field. ### Why are the changes needed? -1 is not a legal stageAttemptId value, thus it can lead to unexpected problem if a subscriber try to parse the stage information from the SparkListenerTaskStart event. ### Does this PR introduce _any_ user-facing change? No, it's a bugfix. ### How was this patch tested? Manually verified. Closes #40592 from jiangxb1987/SPARK-42967. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 1a6b1770c85f37982b15d261abf9cc6e4be740f4) Signed-off-by: Gengliang Wang <gengliang@apache.org> 30 March 2023, 22:48:32 UTC
794bea5 [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true Change `PlanSubqueries` to set `shouldBroadcast` to true when instantiating an `InSubqueryExec` instance. The below left outer join gets an error: ``` create or replace temp view v1 as select * from values (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, value9, value10); create or replace temp view v2 as select * from values (1, 2), (3, 8), (7, 9) as v2(a, b); create or replace temp view v3 as select * from values (3), (8) as v3(col1); set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 set spark.sql.adaptive.enabled=false; select * from v1 left outer join v2 on key = a and key in (select col1 from v3); ``` The join fails during predicate codegen: ``` 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to interpreter mode java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) at org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) ``` It fails again after fallback to interpreter mode: ``` 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) ``` Both the predicate codegen and the evaluation fail for the same reason: `PlanSubqueries` creates `InSubqueryExec` with `shouldBroadcast=false`. The driver waits for the subquery to finish, but it's the executor that uses the results of the subquery (for predicate codegen or evaluation). Because `shouldBroadcast` is set to false, the result is stored in a transient field (`InSubqueryExec#result`), so the result of the subquery is not serialized when the `InSubqueryExec` instance is sent to the executor. The issue occurs, as far as I can tell, only when both whole stage codegen is disabled and adaptive execution is disabled. When wholestage codegen is enabled, the predicate codegen happens on the driver, so the subquery's result is available. When adaptive execution is enabled, `PlanAdaptiveSubqueries` always sets `shouldBroadcast=true`, so the subquery's result is available on the executor, if needed. No. New unit test. Closes #40569 from bersprockets/join_subquery_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5b20f3d94095f54017be3d31d11305e597334d8b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 March 2023, 12:33:22 UTC
back to top