https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
0c0e7d4 Preparing Spark release v3.4.2-rc1 25 November 2023, 06:40:32 UTC
03dac18 [SPARK-46095][DOCS] Document `REST API` for Spark Standalone Cluster This PR aims to document `REST API` for Spark Standalone Cluster. To help the users to understand Apache Spark features. No. Manual review. `REST API` Section is added newly. **AFTER** <img width="704" alt="Screenshot 2023-11-24 at 4 13 53 PM" src="https://github.com/apache/spark/assets/9700541/a4e09d94-d216-4629-8b37-9d350365a428"> No. Closes #44007 from dongjoon-hyun/SPARK-46095. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 132c1a1f08d6555c950600c102db28b9d7581350) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2023, 01:41:37 UTC
a53c16a [SPARK-46016][DOCS][PS] Fix pandas API support list properly ### What changes were proposed in this pull request? This PR proposes to fix a critical issue in the [Supported pandas API documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html) where many essential APIs such as `DataFrame.max`, `DataFrame.min`, `DataFrame.mean`, `and DataFrame.median`, etc. were incorrectly marked as not implemented - marked as "N" - as below: <img width="291" alt="Screenshot 2023-11-24 at 12 37 49 PM" src="https://github.com/apache/spark/assets/44108233/95c5785c-711c-400c-b2ec-0db034e90fd8"> The root cause of this issue was that the script used to generate the support list excluded functions inherited from parent classes. For instance, `CategoricalIndex.max` is actually supported by inheriting the `Index` class but was not directly implemented in `CategoricalIndex`, leading to it being marked as unsupported: <img width="397" alt="Screenshot 2023-11-24 at 12 30 08 PM" src="https://github.com/apache/spark/assets/44108233/90e92996-a88a-4a20-bb0c-4909097e2688"> ### Why are the changes needed? The current documentation inaccurately represents the state of supported pandas API, which could significantly hinder user experience and adoption. By correcting these inaccuracies, we ensure that the documentation reflects the true capabilities of Pandas API on Spark, providing users with reliable and accurate information. ### Does this PR introduce _any_ user-facing change? No. This PR only updates the documentation to accurately reflect the current state of supported pandas API. ### How was this patch tested? Manually build documentation, and check if the supported pandas API list is correctly generated as below: <img width="299" alt="Screenshot 2023-11-24 at 12 36 31 PM" src="https://github.com/apache/spark/assets/44108233/a2da0f0b-0973-45cb-b22d-9582bbeb51b5"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43996 from itholic/fix_supported_api_gen. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 132bb63a897f4f4049f34deefc065ed3eac6a90f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 November 2023, 10:39:02 UTC
e87d166 [SPARK-46062][SQL] Sync the isStreaming flag between CTE definition and reference This PR proposes to sync the flag `isStreaming` from CTE definition to CTE reference. The essential issue is that CTE reference node cannot determine the flag `isStreaming` by itself, and never be able to have a proper value and always takes the default as it does not have a parameter in constructor. The other flag `resolved` is handled, and we need to do the same for `isStreaming`. Once we add the parameter to the constructor, we will also need to make sure the flag is in sync with CTE definition. We have a rule `ResolveWithCTE` doing the sync, hence we add the logic to sync the flag `isStreaming` as well. The bug may impact some rules which behaves differently depending on isStreaming flag. It would no longer be a problem once CTE reference is replaced with CTE definition at some point in "optimization phase", but all rules in analyzer and optimizer being triggered before the rule takes effect may misbehave based on incorrect isStreaming flag. No. New UT. No. Closes #43966 from HeartSaVioR/SPARK-46062. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 43046631a5d4ac7201361a00473cc87fa52ab5a7) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 23 November 2023, 14:32:48 UTC
cfe072a [SPARK-46064][SQL][SS] Move out EliminateEventTimeWatermark to the analyzer and change to only take effect on resolved child This PR proposes to move out EliminateEventTimeWatermark to the analyzer (one of the analysis rule), and also make a change to eliminate EventTimeWatermark node only when the child of EventTimeWatermark is "resolved". Currently, we apply EliminateEventTimeWatermark immediately when withWatermark is called, which means the rule is applied immediately against the child, regardless whether child is resolved or not. It is not an issue for the usage of DataFrame API initiated by read / readStream, because streaming sources have the flag isStreaming set to true even it is yet resolved. But mix-up of SQL and DataFrame API would expose the issue; we may not know the exact value of isStreaming flag on unresolved node and it is subject to change upon resolution. No. New UTs. No. Closes #43971 from HeartSaVioR/SPARK-46064. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit a703dace0aa400fa24b2bded1500f44ae7ac8db0) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 23 November 2023, 11:26:08 UTC
0fd1501 [MINOR][BUILD] Rename gprcVersion to grpcVersion in SparkBuild This PR aims to fix a typo. ``` - val gprcVersion = "1.56.0" + val grpcVersion = "1.56.0" ``` There are two occurrences. ``` $ git grep gprc project/SparkBuild.scala: val gprcVersion = "1.56.0" project/SparkBuild.scala: "io.grpc" % "protoc-gen-grpc-java" % BuildCommons.gprcVersion asProtocPlugin(), ``` To fix a typo. No. Pass the CIs. No. Closes #43923 from dongjoon-hyun/minor_grpc. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 18bcd020118a8efb49c03546ec501be6f0fc0852) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 November 2023, 10:18:57 UTC
94bb1b2 [SPARK-46006][YARN] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop ### What changes were proposed in this pull request? We meet a case that user call sc.stop() after run all custom code, but stuck in some place. Cause below situation 1. User call sc.stop() 2. sc.stop() stuck in some process, but SchedulerBackend.stop was called 3. Since yarn ApplicationMaster didn't finish, still call YarnAllocator.allocateResources() 4. Since driver endpoint stop new allocated executor failed to register 5. untll trigger Max number of executor failures 6. Caused by Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info ![image](https://github.com/apache/spark/assets/46485123/4a61fb40-5986-4ecc-9329-369187d5311d) When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId. ![image](https://github.com/apache/spark/assets/46485123/0133f606-e1d7-4db7-95fe-140c61379102) ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #43906 from AngersZhuuuu/SPARK-46006. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 06635e25f170e61f6cfe53232d001993ec7d376d) Signed-off-by: Kent Yao <yao@apache.org> 22 November 2023, 08:51:36 UTC
d127783 [SPARK-46012][CORE][FOLLOWUP] Invoke `fs.listStatus` once and reuse the result ### What changes were proposed in this pull request? This PR is a follow-up of #43914 and aims to invoke `fs.listStatus` once and reuse the result. ### Why are the changes needed? This will prevent the increase of the number of `listStatus` invocation . ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the existing test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43944 from dongjoon-hyun/SPARK-46012-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6be4a0358265fb81f68a27589f9940bd726c8ee7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 November 2023, 01:51:39 UTC
2ccd652 [SPARK-44973][SQL] Fix `ArrayIndexOutOfBoundsException` in `conv()` ### What changes were proposed in this pull request? Increase the size of the buffer allocated for the result of base conversion in `NumberConverter` to prevent ArrayIndexOutOfBoundsException when evaluating `conv(s"${Long.MinValue}", 10, -2)`. ### Why are the changes needed? I don't think the ArrayIndexOutOfBoundsException is intended behaviour. ### Does this PR introduce _any_ user-facing change? Users will no longer experience an ArrayIndexOutOfBoundsException for this specific set of arguments and will instead receive the expected base conversion. ### How was this patch tested? New unit test cases ### Was this patch authored or co-authored using generative AI tooling? No Closes #43880 from markj-db/SPARK-44973. Authored-by: Mark Jarvin <mark.jarvin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2ac8ff76a5169fe1f6cf130cc82738ba78bd8c65) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 19:38:49 UTC
92d49e3 [SPARK-46019][SQL][TESTS] Fix `HiveThriftServer2ListenerSuite` and `ThriftServerPageSuite` to create `java.io.tmpdir` if it doesn't exist ### What changes were proposed in this pull request? The pr aims to fix `HiveThriftServer2ListenerSuite` and `ThriftServerPageSuite` failed when there are running on local. ``` [info] ThriftServerPageSuite: [info] - thriftserver page should load successfully *** FAILED *** (515 milliseconds) [info] java.lang.IllegalStateException: Could not initialize plugin: interface org.mockito.plugins.MockMaker (alternate: null) [info] at org.mockito.internal.configuration.plugins.PluginLoader$1.invoke(PluginLoader.java:84) [info] at jdk.proxy2/jdk.proxy2.$Proxy20.isTypeMockable(Unknown Source) [info] at org.mockito.internal.util.MockUtil.typeMockabilityOf(MockUtil.java:78) [info] at org.mockito.internal.util.MockCreationValidator.validateType(MockCreationValidator.java:22) [info] at org.mockito.internal.creation.MockSettingsImpl.validatedSettings(MockSettingsImpl.java:267) [info] at org.mockito.internal.creation.MockSettingsImpl.build(MockSettingsImpl.java:234) [info] at org.mockito.internal.MockitoCore.mock(MockitoCore.java:86) [info] at org.mockito.Mockito.mock(Mockito.java:2037) [info] at org.mockito.Mockito.mock(Mockito.java:2010) [info] at org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite.getStatusStore(ThriftServerPageSuite.scala:49) ``` It can be simply reproduced by running the following command: ``` build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite" -Phive-thriftserver build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite" -Phive-thriftserver ``` ### Why are the changes needed? Fix tests failed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test: ``` build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite" -Phive-thriftserver build/sbt "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ui.ThriftServerPageSuite" -Phive-thriftserver ``` After it: ``` [info] - listener events should store successfully (live = true) (1 second, 711 milliseconds) [info] - listener events should store successfully (live = false) (6 milliseconds) [info] - cleanup session if exceeds the threshold (live = true) (21 milliseconds) [info] - cleanup session if exceeds the threshold (live = false) (3 milliseconds) [info] - update execution info when jobstart event come after execution end event (9 milliseconds) [info] - SPARK-31387 - listener update methods should not throw exception with unknown input (8 milliseconds) [info] Run completed in 3 seconds, 734 milliseconds. [info] Total number of tests run: 6 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 6, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 156 s (02:36), completed Nov 21, 2023, 1:57:21 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43921 from panbingkun/SPARK-46019. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fdcd20f4b51c3ddddaae12f7d3f429e7b77c9f5e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 18:26:55 UTC
c8597c2 [SPARK-46033][SQL][TESTS] Fix flaky ArithmeticExpressionSuite ### What changes were proposed in this pull request? The pr aims to fix flaky ArithmeticExpressionSuite. https://github.com/panbingkun/spark/actions/runs/6940660146/job/18879997046 <img width="1000" alt="image" src="https://github.com/apache/spark/assets/15246973/9fe6050a-7a06-4110-9152-d4512a49b284"> ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43935 from panbingkun/SPARK-46033. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b7930e718f453f8a9d923ad57161a982f16ca8e8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 18:18:05 UTC
9fb696b [SPARK-46012][CORE] EventLogFileReader should not read rolling logs if app status file is missing ### What changes were proposed in this pull request? This PR aims to prevent `EventLogFileReader` from reading rolling event logs if `appStatus` is missing. ### Why are the changes needed? Since Apache Spark 3.0.0, `appstatus_` is supposed to exist. https://github.com/apache/spark/blob/839f0c98bd85a14eadad13f8aaac876275ded5a4/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala#L277-L283 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43914 from dongjoon-hyun/SPARK-46012. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6ca1c67de082269b9337503bff5161f5a2d87225) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 November 2023, 01:50:24 UTC
c674ac3 [SPARK-45963][SQL][DOCS][3.4] Restore documentation for DSv2 API This PR cherry-picks https://github.com/apache/spark/pull/43855 to branch-3.4 --- ### What changes were proposed in this pull request? This PR restores the DSv2 documentation. https://github.com/apache/spark/pull/38392 mistakenly added `org/apache/spark/sql/connect` as a private that includes `org/apache/spark/sql/connector`. ### Why are the changes needed? For end users to read DSv2 documentation. ### Does this PR introduce _any_ user-facing change? Yes, it restores the DSv2 API documentation that used to be there https://spark.apache.org/docs/3.3.0/api/scala/org/apache/spark/sql/connector/catalog/index.html ### How was this patch tested? Manually tested via: ``` SKIP_PYTHONDOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43866 from HyukjinKwon/SPARK-45963-3.4. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 November 2023, 05:34:33 UTC
f5900a5 [SPARK-43393][SQL][3.4] Address sequence expression overflow bug ### What changes were proposed in this pull request? Spark has a (long-standing) overflow bug in the `sequence` expression. Consider the following operations: ``` spark.sql("CREATE TABLE foo (l LONG);") spark.sql(s"INSERT INTO foo VALUES (${Long.MaxValue});") spark.sql("SELECT sequence(0, l) FROM foo;").collect() ``` The result of these operations will be: ``` Array[org.apache.spark.sql.Row] = Array([WrappedArray()]) ``` an unintended consequence of overflow. The sequence is applied to values `0` and `Long.MaxValue` with a step size of `1` which uses a length computation defined [here](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3451). In this calculation, with `start = 0`, `stop = Long.MaxValue`, and `step = 1`, the calculated `len` overflows to `Long.MinValue`. The computation, in binary looks like: ``` 0111111111111111111111111111111111111111111111111111111111111111 - 0000000000000000000000000000000000000000000000000000000000000000 ------------------------------------------------------------------ 0111111111111111111111111111111111111111111111111111111111111111 / 0000000000000000000000000000000000000000000000000000000000000001 ------------------------------------------------------------------ 0111111111111111111111111111111111111111111111111111111111111111 + 0000000000000000000000000000000000000000000000000000000000000001 ------------------------------------------------------------------ 1000000000000000000000000000000000000000000000000000000000000000 ``` The following [check](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3454) passes as the negative `Long.MinValue` is still `<= MAX_ROUNDED_ARRAY_LENGTH`. The following cast to `toInt` uses this representation and [truncates the upper bits](https://github.com/apache/spark/blob/16411188c7ba6cb19c46a2bd512b2485a4c03e2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L3457) resulting in an empty length of `0`. Other overflows are similarly problematic. This PR addresses the issue by checking numeric operations in the length computation for overflow. ### Why are the changes needed? There is a correctness bug from overflow in the `sequence` expression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added in `CollectionExpressionsSuite.scala`. Closes #43819 from thepinetree/spark-sequence-overflow-3.4. Authored-by: Deepayan Patra <deepayan.patra@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 November 2023, 23:35:43 UTC
23f15af [SPARK-45786][SQL][FOLLOWUP][TEST] Fix Decimal random number tests with ANSI enabled ### What changes were proposed in this pull request? This follow-up PR fixes the test for SPARK-45786 that is failing in GHA with SPARK_ANSI_SQL_MODE=true ### Why are the changes needed? The issue discovered in https://github.com/apache/spark/pull/43678#discussion_r1395693417 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test updated ### Was this patch authored or co-authored using generative AI tooling? No Closes #43853 from kazuyukitanimura/SPARK-45786-FollowUp. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 949de3416a8ef5b7faa22149f5e07d8235237f40) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 November 2023, 10:50:13 UTC
559c97b [SPARK-45961][DOCS][3.4] Document `spark.master.*` configurations ### What changes were proposed in this pull request? This PR documents `spark.master.*` configurations. ### Why are the changes needed? Currently, `spark.master.*` configurations are undocumented. ``` $ git grep 'ConfigBuilder("spark.master' core/src/main/scala/org/apache/spark/internal/config/UI.scala: val MASTER_UI_DECOMMISSION_ALLOW_MODE = ConfigBuilder("spark.master.ui.decommission.allow.mode") core/src/main/scala/org/apache/spark/internal/config/package.scala: private[spark] val MASTER_REST_SERVER_ENABLED = ConfigBuilder("spark.master.rest.enabled") core/src/main/scala/org/apache/spark/internal/config/package.scala: private[spark] val MASTER_REST_SERVER_PORT = ConfigBuilder("spark.master.rest.port") core/src/main/scala/org/apache/spark/internal/config/package.scala: private[spark] val MASTER_UI_PORT = ConfigBuilder("spark.master.ui.port") ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ![Screenshot 2023-11-16 at 2 57 21 PM](https://github.com/apache/spark/assets/9700541/6e9646d6-0144-4d10-bba8-500e9ce5e4cb) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43850 from dongjoon-hyun/SPARK-45961-3.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 November 2023, 23:38:08 UTC
f927c0f [SPARK-45920][SQL][3.4] group by ordinal should be idempotent backport https://github.com/apache/spark/pull/43797 ### What changes were proposed in this pull request? GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #43838 from cloud-fan/3.4-port. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 November 2023, 16:18:40 UTC
c58df4a [SPARK-45764][PYTHON][DOCS][3.4] Make code block copyable ### What changes were proposed in this pull request? The pr aims to make code block `copyable `in pyspark docs. Backport above to `branch 3.4`. Master branch pr: https://github.com/apache/spark/pull/43799 ### Why are the changes needed? Improving the usability of PySpark documents. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to easily copy code block in pyspark docs. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43828 from panbingkun/branch-3.4_SPARK-45764. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 16 November 2023, 10:09:10 UTC
83439ee [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error ### What changes were proposed in this pull request? The pr aims to fix RST files `link substitutions` error. Target branch: branch-3.3, branch-3.4, branch-3.5, master. ### Why are the changes needed? When I was reviewing Python documents, I found that `the actual address` of the link was incorrect, eg: https://spark.apache.org/docs/latest/api/python/getting_started/install.html#installing-from-source <img width="916" alt="image" src="https://github.com/apache/spark/assets/15246973/069c1875-1e21-45db-a236-15c27ee7b913"> `The ref link url` of `Building Spark`: from `https://spark.apache.org/docs/3.5.0/#downloading` to `https://spark.apache.org/docs/3.5.0/building-spark.html`. We should fix it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43815 from panbingkun/SPARK-45935. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 79ccdfa31e282ebe9a82c8f20c703b6ad2ea6bc1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 November 2023, 09:01:26 UTC
b53c170 [SPARK-45592][SPARK-45282][SQL] Correctness issue in AQE with InMemoryTableScanExec This PR fixes an correctness issue while enabling AQE for SQL Cache. This issue was caused by AQE coalescing the top-level shuffle in the physical plan of InMemoryTableScan and wrongfully reported the output partitioning of that InMemoryTableScan as HashPartitioning as if it had not been coalesced. The caller query of that InMemoryTableScan in turn failed to align the partitions correctly and output incorrect join results. The fix addresses the issue by disabling coalescing in InMemoryTableScan for shuffles in the final stage. This fix also guarantees that AQE enabled for SQL cache vs. disabled would always be a performance win, since AQE optimizations are applied to all non-top-level stages and meanwhile no extra shuffle would be introduced between the parent query and the cached relation (if coalescing in top-level shuffles of InMemoryTableScan was not disabled, an extra shuffle would end up being added on top of the cached relation when the cache is used in a join query and the partition key matches the join key in order to avoid the correctness issue). To fix correctness issue and to avoid potential AQE perf regressions in queries using SQL cache. No. Added UTs. No. Closes #43760 from maryannxue/spark-45592. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 128f5523194d5241c7b0f08b5be183288128ba16) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 November 2023, 20:39:48 UTC
8163909 [SPARK-45882][SQL][3.4] BroadcastHashJoinExec propagate partitioning should respect CoalescedHashPartitioning This pr backport https://github.com/apache/spark/pull/43753 to branch-3.4 ### What changes were proposed in this pull request? Add HashPartitioningLike trait and make HashPartitioning and CoalescedHashPartitioning extend it. When we propagate output partiitoning, we should handle HashPartitioningLike instead of HashPartitioning. This pr also changes the BroadcastHashJoinExec to use HashPartitioningLike to avoid regression. ### Why are the changes needed? Avoid unnecessary shuffle exchange. ### Does this PR introduce _any_ user-facing change? yes, avoid regression ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #43793 from ulysses-you/partitioning-3.4. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> 14 November 2023, 11:55:53 UTC
0749938 [SPARK-45896][SQL][3.4] Construct `ValidateExternalType` with the correct expected type ### What changes were proposed in this pull request? This is a backport of #43770. When creating a serializer for a `Map` or `Seq` with an element of type `Option`, pass an expected type of `Option` to `ValidateExternalType` rather than the `Option`'s type argument. ### Why are the changes needed? In 3.4.1, 3.5.0, and master, the following code gets an error: ``` scala> val df = Seq(Seq(Some(Seq(0)))).toDF("a") val df = Seq(Seq(Some(Seq(0)))).toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array<int> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... ``` However, this code works in 3.3.3. Similarly, this code gets an error: ``` scala> val df = Seq(Seq(Some(java.sql.Timestamp.valueOf("2023-01-01 00:00:00")))).toDF("a") val df = Seq(Seq(Some(java.sql.Timestamp.valueOf("2023-01-01 00:00:00")))).toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, TimestampType, fromJavaTimestamp, unwrapoption(ObjectType(class java.sql.Timestamp), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), TimestampType, ObjectType(class scala.Option))), true, false, true), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of timestamp ... ``` As with the first example, this code works in 3.3.3. `ScalaReflection#validateAndSerializeElement` will construct `ValidateExternalType` with an expected type of the `Option`'s type parameter. Therefore, for element types `Option[Seq/Date/Timestamp/BigDecimal]`, `ValidateExternalType` will try to validate that the element is of the contained type (e.g., `BigDecimal`) rather than of type `Option`. Since the element type is of type `Option`, the validation fails. Validation currently works by accident for element types `Option[Map/<primitive-type]`, simply because in that case `ValidateExternalType` ignores that passed expected type and tries to validate based on the encoder's `clsTag` field (which, for the `OptionEncoder`, will be class `Option`). ### Does this PR introduce _any_ user-facing change? Other than fixing the bug, no. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43775 from bersprockets/encoding_error_br34. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 13 November 2023, 02:23:04 UTC
eace7d3 [SPARK-45592][SPARK-45282][SQL][3.4] Correctness issue in AQE with InMemoryTableScanExec ### What changes were proposed in this pull request? This backports https://github.com/apache/spark/pull/43435 SPARK-45592 to the 3.4 branch. This is because it was already reported there as SPARK-45282 but it required enabling some extra configuration to hit the bug. ### Why are the changes needed? Fix correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixing correctness issue. ### How was this patch tested? New tests based on the reproduction example in SPARK-45282 ### Was this patch authored or co-authored using generative AI tooling? No Closes #43729 from eejbyfeldt/SPARK-45282. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 November 2023, 21:54:01 UTC
92bea64 [MINOR][DOCS] Fix the example value in the docs ### What changes were proposed in this pull request? fix the example value ### Why are the changes needed? for doc ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Just example value in the docs, no need to test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43750 from jlfsdtc/fix_typo_in_doc. Authored-by: longfei.jiang <longfei.jiang@kyligence.io> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit b501a223bfcf4ddbcb0b2447aa06c549051630b0) Signed-off-by: Kent Yao <yao@apache.org> 11 November 2023, 05:49:59 UTC
3978bf4 [SPARK-45884][BUILD][3.4] Upgrade ORC to 1.8.6 ### What changes were proposed in this pull request? This PR aims to upgrade ORC to 1.8.6 for Apache Spark 3.4.2. ### Why are the changes needed? To bring the latest maintenance releases as a part of Apache Spark 3.4.2 release - https://github.com/apache/orc/releases/tag/v1.8.6 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43755 from dongjoon-hyun/SPARK-45884. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 November 2023, 16:56:02 UTC
6186852 [SPARK-45878][SQL][TESTS] Fix ConcurrentModificationException in CliSuite ### What changes were proposed in this pull request? This PR changes the ArrayBuffer for logs to immutable for reading to prevent ConcurrentModificationException which hides the actual cause of failure ### Why are the changes needed? ```scala [info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql *** FAILED *** (11 seconds, 105 milliseconds) [info] java.util.ConcurrentModificationException: mutation occurred during iteration [info] at scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43) [info] at scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47) [info] at scala.collection.IterableOnceOps.addString(IterableOnce.scala:1247) [info] at scala.collection.IterableOnceOps.addString$(IterableOnce.scala:1241) [info] at scala.collection.AbstractIterable.addString(Iterable.scala:933) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.runCliWithin(CliSuite.scala:205) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$new$20(CliSuite.scala:501) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #43749 from yaooqinn/SPARK-45878. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit b347237735094e9092f4100583ed1d6f3eacf1f6) Signed-off-by: Kent Yao <yao@apache.org> 10 November 2023, 13:10:19 UTC
c53cadd [SPARK-45847][SQL][TESTS] CliSuite flakiness due to non-sequential guarantee for stdout&stderr ### What changes were proposed in this pull request? In CliSuite, This PR adds a retry for tests that write errors to STDERR. ### Why are the changes needed? To fix flakiness tests as below https://github.com/chenhao-db/apache-spark/actions/runs/6791437199/job/18463313766 https://github.com/dongjoon-hyun/spark/actions/runs/6753670527/job/18361206900 ```sql [info] Spark master: local, Application Id: local-1699402393189 [info] spark-sql> /* SELECT /*+ HINT() 4; */; [info] [info] [PARSE_SYNTAX_ERROR] Syntax error at or near ';'. SQLSTATE: 42601 (line 1, pos 26) [info] [info] == SQL == [info] /* SELECT /*+ HINT() 4; */; [info] --------------------------^^^ [info] [info] spark-sql> /* SELECT /*+ HINT() 4; */ SELECT 1; [info] 1 [info] Time taken: 1.499 seconds, Fetched 1 row(s) [info] [info] [UNCLOSED_BRACKETED_COMMENT] Found an unclosed bracketed comment. Please, append */ at the end of the comment. SQLSTATE: 42601 [info] == SQL == [info] /* Here is a unclosed bracketed comment SELECT 1; [info] spark-sql> /* Here is a unclosed bracketed comment SELECT 1; [info] spark-sql> /* SELECT /*+ HINT() */ 4; */; [info] spark-sql> ``` As you can see the fragment above, the query on the 3rd line from the bottom, came from STDOUT, was printed later than its error output, came from STDERR. In this scenario, the error output would not match anything and would simply go unnoticed. Finally, timed out and failed. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests and CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #43725 from yaooqinn/SPARK-45847. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 06d8cbe073499ff16bca3165e2de1192daad3984) Signed-off-by: Kent Yao <yao@apache.org> 10 November 2023, 06:33:54 UTC
259ac25 [SPARK-45814][CONNECT][SQL][3.4] Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak ### What changes were proposed in this pull request? Make `ArrowBatchIterator` implement `AutoCloseable` and `ArrowConverters.createEmptyArrowBatch()` call close() to avoid memory leak. ### Why are the changes needed? `ArrowConverters.createEmptyArrowBatch` don't call `super.hasNext`, if `TaskContext.get` returns `None`, then memory allocated in `ArrowBatchIterator` is leaked. In spark connect, `createEmptyArrowBatch` is called in [SparkConnectPlanner](https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L2558) and [SparkConnectPlanExecution](https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala#L224), which cause a long running driver consume all off-heap memory specified by `-XX:MaxDirectMemorySize`. This is the exception stack: ``` org.apache.arrow.memory.OutOfMemoryException: Failure allocating buffer. at io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:67) at org.apache.arrow.memory.NettyAllocationManager.<init>(NettyAllocationManager.java:77) at org.apache.arrow.memory.NettyAllocationManager.<init>(NettyAllocationManager.java:84) at org.apache.arrow.memory.NettyAllocationManager$1.create(NettyAllocationManager.java:34) at org.apache.arrow.memory.BaseAllocator.newAllocationManager(BaseAllocator.java:354) at org.apache.arrow.memory.BaseAllocator.newAllocationManager(BaseAllocator.java:349) at org.apache.arrow.memory.BaseAllocator.bufferWithoutReservation(BaseAllocator.java:337) at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:315) at org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:279) at org.apache.arrow.vector.BaseValueVector.allocFixedDataAndValidityBufs(BaseValueVector.java:192) at org.apache.arrow.vector.BaseFixedWidthVector.allocateBytes(BaseFixedWidthVector.java:338) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:308) at org.apache.arrow.vector.BaseFixedWidthVector.allocateNew(BaseFixedWidthVector.java:273) at org.apache.spark.sql.execution.arrow.ArrowWriter$.$anonfun$create$1(ArrowWriter.scala:44) at scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100) at scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87) at scala.collection.convert.JavaCollectionWrappers$JListWrapper.map(JavaCollectionWrappers.scala:103) at org.apache.spark.sql.execution.arrow.ArrowWriter$.create(ArrowWriter.scala:43) at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchIterator.<init>(ArrowConverters.scala:93) at org.apache.spark.sql.execution.arrow.ArrowConverters$ArrowBatchWithSchemaIterator.<init>(ArrowConverters.scala:138) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.<init>(ArrowConverters.scala:231) at org.apache.spark.sql.execution.arrow.ArrowConverters$.createEmptyArrowBatch(ArrowConverters.scala:229) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.handleSqlCommand(SparkConnectPlanner.scala:2481) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.process(SparkConnectPlanner.scala:2426) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handleCommand(ExecuteThreadRunner.scala:202) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:158) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:132) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:189) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:189) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withContextClassLoader$1(SessionHolder.scala:176) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:178) at org.apache.spark.sql.connect.service.SessionHolder.withContextClassLoader(SessionHolder.scala:175) at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:188) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:132) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:84) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:228) Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 1069547799, max: 1073741824) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:845) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:774) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:721) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:696) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215) at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180) at io.netty.buffer.PoolArena.allocate(PoolArena.java:137) at io.netty.buffer.PoolArena.allocate(PoolArena.java:129) at io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.newDirectBufferL(PooledByteBufAllocatorL.java:181) at io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.directBuffer(PooledByteBufAllocatorL.java:214) at io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:58) ... 37 more ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43728 from xieshuaihu/3_4_SPARK-45814. Authored-by: xieshuaihu <xieshuaihu@agora.io> Signed-off-by: yangjie01 <yangjie01@baidu.com> 10 November 2023, 04:33:24 UTC
f82e238 [SPARK-45829][DOCS] Update the default value for spark.executor.logs.rolling.maxSize **What changes were proposed in this pull request?** The PR updates the default value of 'spark.executor.logs.rolling.maxSize' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.executor.logs.rolling.maxSize' is 1024 * 1024, but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #43712 from chenyu-opensource/branch-SPARK-45829. Authored-by: chenyu <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit a9127068194a48786df4f429ceb4f908c71f7138) Signed-off-by: Kent Yao <yao@apache.org> 08 November 2023, 11:17:19 UTC
77e52fd [SPARK-45786][SQL] Fix inaccurate Decimal multiplication and division results ### What changes were proposed in this pull request? This PR fixes inaccurate Decimal multiplication and division results. ### Why are the changes needed? Decimal multiplication and division results may be inaccurate due to rounding issues. #### Multiplication: ``` scala> sql("select -14120025096157587712113961295153.858047 * -0.4652").show(truncate=false) +----------------------------------------------------+ |(-14120025096157587712113961295153.858047 * -0.4652)| +----------------------------------------------------+ |6568635674732509803675414794505.574764 | +----------------------------------------------------+ ``` The correct answer is `6568635674732509803675414794505.574763` Please note that the last digit is `3` instead of `4` as ``` scala> java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652")) val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644 ``` Since the factional part `.574763` is followed by `4644`, it should not be rounded up. #### Division: ``` scala> sql("select -0.172787979 / 533704665545018957788294905796.5").show(truncate=false) +-------------------------------------------------+ |(-0.172787979 / 533704665545018957788294905796.5)| +-------------------------------------------------+ |-3.237521E-31 | +-------------------------------------------------+ ``` The correct answer is `-3.237520E-31` Please note that the last digit is `0` instead of `1` as ``` scala> java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"), 100, java.math.RoundingMode.DOWN) val res22: java.math.BigDecimal = -3.237520489418037889998826491401059986665344697406144511563561222578738E-31 ``` Since the factional part `.237520` is followed by `4894...`, it should not be rounded up. ### Does this PR introduce _any_ user-facing change? Yes, users will see correct Decimal multiplication and division results. Directly multiplying and dividing with `org.apache.spark.sql.types.Decimal()` (not via SQL) will return 39 digit at maximum instead of 38 at maximum and round down instead of round half-up ### How was this patch tested? Test added ### Was this patch authored or co-authored using generative AI tooling? No Closes #43678 from kazuyukitanimura/SPARK-45786. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5ef3a846f52ab90cb7183953cff3080449d0b57b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 November 2023, 17:06:19 UTC
f916162 [SPARK-44843][TESTS] Double streamingTimeout for StateStoreMetricsTest to make RocksDBStateStore related streaming tests reliable ### What changes were proposed in this pull request? This PR increases streamingTimeout and the check interval for StateStoreMetricsTest to make RocksDBStateStore-related streaming tests reliable, hopefully. ### Why are the changes needed? ``` SPARK-35896: metrics in StateOperatorProgress are output correctly (RocksDBStateStore with changelog checkpointing) *** FAILED *** (1 minute) [info] Timed out waiting for stream: The code passed to failAfter did not complete within 60 seconds. [info] java.base/java.lang.Thread.getStackTrace(Thread.java:1619) ``` The probability of these tests failing is close to 100%, which seriously affects the UX of making PRs for the contributors. https://github.com/yaooqinn/spark/actions/runs/6744173341/job/18333952141 ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? this can be verified by `sql - slow test` job in CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #43647 from yaooqinn/SPARK-44843. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit afdce266f0ffeb068d47eca2f2af1bcba66b0e95) Signed-off-by: Kent Yao <yao@apache.org> 03 November 2023, 17:17:09 UTC
c9c6d8a [SPARK-45751][DOCS] Update the default value for spark.executor.logs.rolling.maxRetainedFile **What changes were proposed in this pull request?** The PR updates the default value of 'spark.executor.logs.rolling.maxRetainedFiles' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.executor.logs.rolling.maxRetainedFiles' is -1, but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #43618 from chenyu-opensource/branch-SPARK-45751. Authored-by: chenyu <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit e6b4fa835de3f6d0057bf3809ea369d785967bcd) Signed-off-by: Kent Yao <yao@apache.org> 01 November 2023, 09:13:41 UTC
ed3a74d [SPARK-45749][CORE][WEBUI] Fix `Spark History Server` to sort `Duration` column properly ### What changes were proposed in this pull request? This PR aims to fix an UI regression at Apache Spark 3.2.0 caused by SPARK-34123. From Apache Spark **3.2.0** to **3.5.0**, `Spark History Server` cannot sort `Duration` column. After this PR, Spark History Server can sort `Duration` column properly like Apache Spark 3.1.3 and before. ### Why are the changes needed? Before SPARK-34123, Apache Spark had the `title` attribute for sorting. - https://github.com/apache/spark/pull/31191 ``` <td><span title="{{durationMillisec}}">{{duration}}</span></td> ``` Without `title`, `title-numeric` doesn't work. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix. ### How was this patch tested? Manual test. Please use `Safari Private Browsing ` or `Chrome Incognito` mode. <img width="96" alt="Screenshot 2023-10-31 at 5 47 34 PM" src="https://github.com/apache/spark/assets/9700541/8c8464d2-c58b-465c-8f98-edab1ec2317d"> <img width="94" alt="Screenshot 2023-10-31 at 5 47 29 PM" src="https://github.com/apache/spark/assets/9700541/03e8373d-bda3-4835-90ad-9a45670e853a"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43613 from dongjoon-hyun/SPARK-45749. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f72510ca9e04ae88660346de440b231fc8225698) Signed-off-by: Kent Yao <yao@apache.org> 01 November 2023, 01:56:19 UTC
05f48d7 [SPARK-45735][PYTHON][CONNECT][TESTS] Reenable CatalogTests without Spark Connect ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/39214 that restores the original Catalog tests in PySpark. That PR mistakenly disabled the tests without Spark Connect: https://github.com/apache/spark/blob/fc6a5cca06cf15c4a952cb56720f627efdba7cce/python/pyspark/sql/tests/test_catalog.py#L489 ### Why are the changes needed? To restore the test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Reenabled unittests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43595 from HyukjinKwon/SPARK-45735. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 76d9a70932df97d8ea4cc6e279933dee29a88571) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 31 October 2023, 07:47:58 UTC
d8922f3 [SPARK-45678][CORE] Cover BufferReleasingInputStream.available/reset under tryOrFetchFailedException ### What changes were proposed in this pull request? This patch proposes to wrap `BufferReleasingInputStream.available/reset` under `tryOrFetchFailedException`. So `IOException` during `available`/`reset` call will be rethrown as `FetchFailedException`. ### Why are the changes needed? We have encountered shuffle data corruption issue: ``` Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:112) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:504) at org.xerial.snappy.Snappy.uncompress(Snappy.java:543) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:450) at org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:497) at org.apache.spark.storage.BufferReleasingInputStream.available(ShuffleBlockFetcherIterator.scala:1356) ``` Spark shuffle has capacity to detect corruption for a few stream op like `read` and `skip`, such `IOException` in the stack trace will be rethrown as `FetchFailedException` that will re-try the failed shuffle task. But in the stack trace it is `available` that is not covered by the mechanism. So no-retry has been happened and the Spark application just failed. As the `available`/`reset` op will also involve data decompression and throw `IOException`, we should be able to check it like `read` and `skip` do. ### Does this PR introduce _any_ user-facing change? Yes. Data corruption during `available`/`reset` op is now causing `FetchFailedException` like `read` and `skip` that can be retried instead of `IOException`. ### How was this patch tested? Added test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43543 from viirya/add_available. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Chao Sun <sunchao@apple.com> 28 October 2023, 02:23:10 UTC
9c5d7c4 [SPARK-45670][CORE][3.4] SparkSubmit does not support `--total-executor-cores` when deploying on K8s This is the cherry-pick of https://github.com/apache/spark/pull/43536 for branch-3.4 ### What changes were proposed in this pull request? Remove Kubernetes from the support list of `--total-executor-cores` in SparkSubmit ### Why are the changes needed? `--total-executor-cores` does not take effect in Spark on K8s, [the comments from original PR](https://github.com/apache/spark/pull/19717#discussion_r154568773) also proves that ### Does this PR introduce _any_ user-facing change? The output of `spark-submit --help` changed ```patch ... - Spark standalone, Mesos and Kubernetes only: + Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. ... ``` ### How was this patch tested? Pass GA and review. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43549 from pan3793/SPARK-45670-3.4. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 October 2023, 06:25:16 UTC
75aed56 [SPARK-45652][SQL][3.4] SPJ: Handle empty input partitions after dynamic filtering This is a cherry-pick of https://github.com/apache/spark/pull/43531 to branch-3.4, with a few modifications. ### What changes were proposed in this pull request? Handle the case when input partitions become empty after V2 dynamic filtering, when SPJ is enabled. ### Why are the changes needed? Current in the situation when all input partitions are filtered out via dynamic filtering, SPJ doesn't work but instead will panic: ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions$lzycompute(BatchScanExec.scala:108) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.filteredPartitions(BatchScanExec.scala:65) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:136) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:135) at org.apache.spark.sql.boson.BosonBatchScanExec.inputRDD$lzycompute(BosonBatchScanExec.scala:28) ``` This is because the `groupPartitions` method will return `None` in this scenario. We should handle the case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test case for this. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43539 from sunchao/SPARK-45652-branch-3.4. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 October 2023, 20:54:24 UTC
ecdb69f [SPARK-40154][PYTHON][DOCS] Correct storage level in Dataframe.cache docstring ### What changes were proposed in this pull request? Corrects the docstring `DataFrame.cache` to give the correct storage level after it changed with Spark 3.0. It seems that the docstring of `DataFrame.persist` was updated, but `cache` was forgotten. ### Why are the changes needed? The doctoring claims that `cache` uses serialised storage, but it actually uses deserialised storage. I confirmed that this is still the case with Spark 3.5.0 using the example code from the Jira ticket. ### Does this PR introduce _any_ user-facing change? Yes, the docstring changes. ### How was this patch tested? The Github actions workflow succeeded. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43229 from paulstaab/SPARK-40154. Authored-by: Paul Staab <paulstaab@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 94607dd001b133a25dc9865f25b3f9e7f5a5daa3) Signed-off-by: Sean Owen <srowen@gmail.com> 25 October 2023, 12:36:34 UTC
c730365 [SPARK-45588][SPARK-45640][SQL][TESTS][3.4] Fix flaky ProtobufCatalystDataConversionSuite ### What changes were proposed in this pull request? The pr aims to fix flaky ProtobufCatalystDataConversionSuite, include: - Fix the type check (when the random value was empty array, we didn't skip it. Original intention is to skip default values for types.) [SPARK-45588] - When data.get(0) is null, data.get(0).asInstanceOf[Array[Byte]].isEmpty will be thrown java.lang.NullPointerException. [SPARK-45640] Backport above to branch 3.4. Master branch pr: https://github.com/apache/spark/pull/43424 & https://github.com/apache/spark/pull/43493 ### Why are the changes needed? Fix flaky ProtobufCatalystDataConversionSuite. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA - Manually test ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43520 from panbingkun/branch-3.4_SPARK-45640. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 25 October 2023, 04:56:51 UTC
521c2e0 [SPARK-45430] Fix for FramelessOffsetWindowFunction when IGNORE NULLS and offset > rowCount ### What changes were proposed in this pull request? This is a fix for the failure when function that utilized `FramelessOffsetWindowFunctionFrame` is used with `ignoreNulls = true` and `offset > rowCount`. e.g. ``` select x, lead(x, 5) IGNORE NULLS over (order by x) from (select explode(sequence(1, 3)) x) ``` ### Why are the changes needed? Fix existing bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modify existing unit test to cover this case ### Was this patch authored or co-authored using generative AI tooling? No Closes #43236 from vitaliili-db/SPARK-45430. Authored-by: Vitalii Li <vitalii.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 32e1e58411913517c87d7e75942437f4e1c1d40e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 October 2023, 07:15:42 UTC
b24bc5c [SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: https://github.com/apache/spark/pull/43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 13b67ee8cc377a5cc47d02b9addbc00eabfc8b6c) Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 October 2023, 05:53:54 UTC
e2911e7 [MINOR][SQL] Remove signature from Hive thriftserver exception ### What changes were proposed in this pull request? Don't return expected signature to caller in Hive thriftserver exception ### Why are the changes needed? Please see private discussion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43402 from srowen/HiveCookieSigner. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit cf59b1f51c16301f689b4e0f17ba4dbd140e1b19) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 October 2023, 23:11:16 UTC
03b7f7d [SPARK-45568][TESTS] Fix flaky WholeStageCodegenSparkSubmitSuite ### What changes were proposed in this pull request? WholeStageCodegenSparkSubmitSuite is [flaky](https://github.com/apache/spark/actions/runs/6479534195/job/17593342589) because SHUFFLE_PARTITIONS(200) creates 200 reducers for one total core and improper stop progress causes executor launcher reties. The heavy load and reties might result in timeout test failures. ### Why are the changes needed? CI robustness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing WholeStageCodegenSparkSubmitSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #43394 from yaooqinn/SPARK-45568. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f00ec39542a5f9ac75d8c24f0f04a7be703c8d7c) Signed-off-by: Kent Yao <yao@apache.org> 17 October 2023, 14:23:14 UTC
d0eaebb [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access Cleaner on Java 9+ This PR adds `--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED` to our JVM flags so that we can access `jdk.internal.ref.Cleaner` in JDK 9+. This allows Spark to allocate direct memory while ignoring the JVM's MaxDirectMemorySize limit. Spark uses JDK internal APIs to directly construct DirectByteBuffers while bypassing that limit, but there is a fallback path at https://github.com/apache/spark/blob/v3.5.0/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java#L213 that is used if we cannot reflectively access the `Cleaner` API. No. Added a unit test in `PlatformUtilSuite`. No. Closes #43344 from JoshRosen/SPARK-45508. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 96bac6c033b5bb37101ebcd8436ab9a84db8e092) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 146fba1a22e3f1555f3e4494522810030f9a7854) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 13 October 2023, 16:53:53 UTC
f985d71 [SPARK-45433][SQL][3.4] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat ### What changes were proposed in this pull request? This is a backport PR of #43243. Fix the bug of schema inference when timestamps do not match specified timestampFormat. Please check #43243 for detail. ### Why are the changes needed? Fix schema inference bug on 3.4. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? Closes #43343 from Hisoka-X/backport-SPARK-45433-inference-schema. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 12 October 2023, 12:09:48 UTC
2019af9 [SPARK-45473][SQL][3.4] Fix incorrect error message for RoundBase ### What changes were proposed in this pull request? This minor patch fixes incorrect error message of `RoundBase`. ### Why are the changes needed? Fix incorrect error message. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43316 from viirya/minor_fix-3.4. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 October 2023, 02:50:10 UTC
96f20a1 [SPARK-45389][SQL][3.4] Correct MetaException matching rule on getting partition metadata This is the backport of https://github.com/apache/spark/pull/43191 for `branch-3.4` ### What changes were proposed in this pull request? This PR aims to fix the HMS call fallback logic introduced in SPARK-35437. ```patch try { ... hive.getPartitionNames ... hive.getPartitionsByNames } catch { - case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] => + case ex: HiveException if ex.getCause.isInstanceOf[MetaException] => ... } ``` ### Why are the changes needed? Directly method call won't throw `InvocationTargetException`, and check the code of `hive.getPartitionNames` and `hive.getPartitionsByNames`, both of them will wrap a `HiveException` if `MetaException` throws. ### Does this PR introduce _any_ user-facing change? Yes, it should be a bug fix. ### How was this patch tested? Pass GA and code review. (I'm not sure how to construct/simulate a MetaException during the HMS thrift call with the current HMS testing infrastructure) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43281 from pan3793/SPARK-45389-3.4. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> 10 October 2023, 06:40:24 UTC
b785020 [SPARK-44729][PYTHON][DOCS][3.4] Add canonical links to the PySpark docs page ### What changes were proposed in this pull request? The pr aims to add canonical links to the PySpark docs page, backport this to branch 3.4. Master branch pr: https://github.com/apache/spark/pull/42425. ### Why are the changes needed? Backport this to branch 3.4. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43285 from panbingkun/branch-3.4_SPARK-44729. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 October 2023, 20:48:37 UTC
fd825d8 [MINOR][DOCS] Update `CTAS` with `LOCATION` behavior with Spark 3.2+ ### What changes were proposed in this pull request? This PR aims to update `CTAS` with `LOCATION` behavior according to Spark 3.2+. ### Why are the changes needed? SPARK-28551 changed the behavior at Apache Spark 3.2.0. https://github.com/apache/spark/blob/24b82dfd6cfb9a658af615446be5423695830dd9/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2306-L2313 ### Does this PR introduce _any_ user-facing change? No. This is a documentation fix. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43277 from dongjoon-hyun/minor_ctas. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2d6d09b71e77b362a4c774170e2ca992a31fb1ea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 October 2023, 11:45:17 UTC
e056fc8 [SPARK-44074][CORE][SQL][TESTS][3.4] Fix loglevel restore behavior of `SparkFunSuite#withLogAppender` and re-enable UT `Logging plan changes for execution` ### What changes were proposed in this pull request? The main change of this pr is to add a call of `SparkFunSuite#wupdateLoggers` after restore loglevel when 'level' of `withLogAppender` function is not `None`, and under the premise of this change, the UT `Logging plan changes for execution` disabled in https://github.com/apache/spark/pull/43141 can be re-enabled. ### Why are the changes needed? - Fix bug of `SparkFunSuite#withLogAppender` when 'level' is not None - Re-enable UT `Logging plan changes for execution` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test ``` build/sbt "sql/testOnly org.apache.spark.sql.JoinHintSuite org.apache.spark.sql.execution.QueryExecutionSuite" ``` **Before** ``` [info] - Logging plan changes for execution *** FAILED *** (36 milliseconds) [info] testAppender.loggingEvents.exists(((x$10: org.apache.logging.log4j.core.LogEvent) => x$10.getMessage().getFormattedMessage().contains(expectedMsg))) was false (QueryExecutionSuite.scala:232) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.execution.QueryExecutionSuite.$anonfun$new$34(QueryExecutionSuite.scala:232) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.execution.QueryExecutionSuite.$anonfun$new$31(QueryExecutionSuite.scala:231) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) ... ``` The failure reason is `withLogAppender(hintAppender, level = Some(Level.WARN))` used in `JoinHintSuite`, but `SparkFunSuite#wupdateLoggers` doesn't have the correct restore Loglevel. The test was successful before SPARK-44034 due to there was `AdaptiveQueryExecSuite` between `JoinHintSuite` and `QueryExecutionSuite`, and `AdaptiveQueryExecSuite` called `withLogAppender(hintAppender, level = Some(Level.DEBUG))`, but `AdaptiveQueryExecSuite` move to `slow sql` test group after SPARK-44034 **After** ``` [info] Run completed in 7 seconds, 485 milliseconds. [info] Total number of tests run: 32 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 32, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` Closes #43157 from LuciferYang/SPARK-44074-34. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 29 September 2023, 06:45:59 UTC
e706ba1 [SPARK-45227][CORE] Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend ### What changes were proposed in this pull request? Fix a subtle thread-safety issue with CoarseGrainedExecutorBackend where an executor process randomly gets stuck ### Why are the changes needed? For each executor, the single-threaded dispatcher can run into an "infinite loop" (as explained in the SPARK-45227). Once an executor process runs into a state, it'd stop launching tasks from the driver or reporting task status back. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` $ build/mvn package -DskipTests -pl core $ build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite test ``` ### Was this patch authored or co-authored using generative AI tooling? No ****************************************************************************** **_Please feel free to skip reading unless you're interested in details_** ****************************************************************************** ### Symptom Our Spark 3 app running on EMR 6.10.0 with Spark 3.3.1 got stuck in the very last step of writing a data frame to S3 by calling `df.write`. Looking at Spark UI, we saw that an executor process hung over 1 hour. After we manually killed the executor process, the app succeeded. Note that the same EMR cluster with two worker nodes was able to run the same app without any issue before and after the incident. Below is what's observed from relevant container logs and thread dump. - A regular task that's sent to the executor, which also reported back to the driver upon the task completion. ``` $zgrep 'task 150' container_1694029806204_12865_01_000001/stderr.gz 23/09/12 18:13:55 INFO TaskSetManager: Starting task 150.0 in stage 23.0 (TID 923) (ip-10-0-185-107.ec2.internal, executor 3, partition 150, NODE_LOCAL, 4432 bytes) taskResourceAssignments Map() 23/09/12 18:13:55 INFO TaskSetManager: Finished task 150.0 in stage 23.0 (TID 923) in 126 ms on ip-10-0-185-107.ec2.internal (executor 3) (16/200) $zgrep ' 923' container_1694029806204_12865_01_000004/stderr.gz 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 923 $zgrep 'task 150' container_1694029806204_12865_01_000004/stderr.gz 23/09/12 18:13:55 INFO Executor: Running task 150.0 in stage 23.0 (TID 923) 23/09/12 18:13:55 INFO Executor: Finished task 150.0 in stage 23.0 (TID 923). 4495 bytes result sent to driver ``` - Another task that's sent to the executor but didn't get launched since the single-threaded dispatcher was stuck (presumably in an "infinite loop" as explained later). ``` $zgrep 'task 153' container_1694029806204_12865_01_000001/stderr.gz 23/09/12 18:13:55 INFO TaskSetManager: Starting task 153.0 in stage 23.0 (TID 924) (ip-10-0-185-107.ec2.internal, executor 3, partition 153, NODE_LOCAL, 4432 bytes) taskResourceAssignments Map() $zgrep ' 924' container_1694029806204_12865_01_000004/stderr.gz 23/09/12 18:13:55 INFO YarnCoarseGrainedExecutorBackend: Got assigned task 924 $zgrep 'task 153' container_1694029806204_12865_01_000004/stderr.gz >> note that the above command has no matching result, indicating that task 153.0 in stage 23.0 (TID 924) was never launched ``` - Thread dump shows that the dispatcher-Executor thread has the following stack trace. ``` "dispatcher-Executor" #40 daemon prio=5 os_prio=0 tid=0x0000ffff98e37800 nid=0x1aff runnable [0x0000ffff73bba000] java.lang.Thread.State: RUNNABLE at scala.runtime.BoxesRunTime.equalsNumObject(BoxesRunTime.java:142) at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:131) at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123) at scala.collection.mutable.HashTable.elemEquals(HashTable.scala:365) at scala.collection.mutable.HashTable.elemEquals$(HashTable.scala:365) at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:44) at scala.collection.mutable.HashTable.findEntry0(HashTable.scala:140) at scala.collection.mutable.HashTable.findOrAddEntry(HashTable.scala:169) at scala.collection.mutable.HashTable.findOrAddEntry$(HashTable.scala:167) at scala.collection.mutable.HashMap.findOrAddEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.put(HashMap.scala:126) at scala.collection.mutable.HashMap.update(HashMap.scala:131) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:200) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox$$Lambda$323/1930826709.apply$mcV$sp(Unknown Source) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` ### Relevant code paths Within an executor process, there's a [dispatcher thread](https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/rpc/netty/MessageLoop.scala#L170) dedicated to CoarseGrainedExecutorBackend(a single RPC endpoint) that launches tasks scheduled by the driver. Each task is run on a TaskRunner thread backed by a thread pool created for the executor. The TaskRunner thread and the dispatcher thread are different. However, they read and write a common object (i.e., taskResources) that's a mutable hashmap without thread-safety, in [Executor](https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/executor/Executor.scala#L561) and [CoarseGrainedExecutorBackend](https://github.com/apache/spark/blob/1fdd46f173f7bc90e0523eb0a2d5e8e27e990102/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L189), respectively. ### What's going on? Based on the above observations, our hypothesis is that the dispatcher thread runs into an "infinite loop" due to a race condition when two threads access the same hashmap object. For illustration purpose, let's consider the following scenario where two threads (Thread 1 and Thread 2) access a hash table without thread-safety - Thread 1 sees A.next = B, but then yields execution to Thread 2 <img src="https://issues.apache.org/jira/secure/attachment/13063040/13063040_hashtable1.png" width="400"> - Thread 2 triggers a resize operation resulting in B.next = A (Note that hashmap doesn't care about ordering), and then yields execution to Thread 1. <img src="https://issues.apache.org/jira/secure/attachment/13063041/13063041_hashtable2.png" width="400"> - After taking over CPU, Thread 1 would run into an "infinite loop" when traversing the list in the last bucket, given A.next = B and B.next = A in its view. Closes #43021 from xiongbo-sjtu/master. Authored-by: Bo Xiong <xiongbo@amazon.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 8e6b1603a66706ee27a0b16d850f5ee56d633354) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 29 September 2023, 03:54:26 UTC
68db395 [SPARK-45057][CORE] Avoid acquire read lock when keepReadLock is false ### What changes were proposed in this pull request? Add `keepReadLock` parameter in `lockNewBlockForWriting()`. When `keepReadLock` is `false`, skip `lockForReading()` to avoid block on read Lock or potential deadlock issue. When 2 tasks try to compute same rdd with replication level of 2 and running on only 2 executors. Deadlock will happen. Details refer [SPARK-45057] Task thread hold write lock and waiting for replication to remote executor while shuffle server thread which handling block upload request waiting on `lockForReading` in [BlockInfoManager.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala#L457C24-L457C24) ### Why are the changes needed? This could save unnecessary read lock acquire and avoid deadlock issue mention above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in BlockInfoManagerSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #43067 from warrenzhu25/deadlock. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0d6fda5bbee99f9d1821952195efc6764816ec2f) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 28 September 2023, 23:51:55 UTC
85bf705 [SPARK-44937][CORE] Mark connection as timedOut in TransportClient.close ### What changes were proposed in this pull request? This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in https://github.com/apache/spark/pull/9853. ### Why are the changes needed? Without this change, some of the new tests added in https://github.com/apache/spark/pull/42685 would fail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests were run in CI. Without this change, some of the new tests added in https://github.com/apache/spark/pull/42685 fail ### Was this patch authored or co-authored using generative AI tooling? No Closes #43162 from hasnain-db/spark-tls-timeout. Authored-by: Hasnain Lakhani <hasnain.lakhani@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2a88feadd4b7cec9e01bc744e589783e3390e5ce) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 28 September 2023, 23:17:13 UTC
44c79d9 [SPARK-44034][TESTS][3.4] Add a new test group for sql module ### What changes were proposed in this pull request? The purpose of this pr is to add a new test tag `SlowSQLTest` to the sql module, and identified some Suites with test cases more than 3 seconds, and apply it to GA testing task to reduce the testing pressure of the `sql others` group. ### Why are the changes needed? For a long time, the sql module UTs has only two groups: `slow` and `others`. The test cases in group `slow` are fixed, while the number of test cases in group `others` continues to increase, which has had a certain impact on the testing duration and stability of group `others`. So this PR proposes to add a new testing group to share the testing pressure of `sql others` group, which has made the testing time of the three groups more average, and hope it can improve the stability of the GA task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Should monitor GA ### Was this patch authored or co-authored using generative AI tooling? No Closes #43141 from LuciferYang/SPARK-44034-34. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 28 September 2023, 02:10:51 UTC
20924aa [SPARK-45286][DOCS] Add back Matomo analytics ### What changes were proposed in this pull request? Add analytics to doc pages using the ASF's Matomo service ### Why are the changes needed? We had previously removed Google Analytics from the website and release docs, per ASF policy: https://github.com/apache/spark/pull/36310 We just restored analytics using the ASF-hosted Matomo service on the website: https://github.com/apache/spark-website/commit/a1548627b48a62c2e51870d1488ca3e09397bd30 This change would put the same new tracking code back into the release docs. It would let us see what docs and resources are most used, I suppose. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #43063 from srowen/SPARK-45286. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a881438114ea3e8e918d981ef89ed1ab956d6fca) Signed-off-by: Sean Owen <srowen@gmail.com> 24 September 2023, 19:18:16 UTC
b581eeb [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB` ### What changes were proposed in this pull request? This pr change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB` ### Why are the changes needed? SPARK-42277 change to use `RocksDB` for `spark.history.store.hybridStore.diskBackend` by default, but in `monitoring.md`, the default value is still set as `LEVELDB`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43015 from LuciferYang/SPARK-45237. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f1bc0f938162485a96de5788f53f9fa4fb37a3b1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 September 2023, 15:47:12 UTC
796d878 [SPARK-45210][DOCS][3.4] Switch languages consistently across docs for all code snippets (Spark 3.4 and below) ### What changes were proposed in this pull request? This PR proposes to recover the availity of switching languages consistently across docs for all code snippets in Spark 3.4 and below by using the proper class selector in the JQuery. Previously the selector was a string `.nav-link tab_python` which did not comply multiple class selection: https://www.w3.org/TR/CSS21/selector.html#class-html. I assume it worked as a legacy behaviour somewhere. Now it uses the standard way `.nav-link.tab_python`. Note that https://github.com/apache/spark/pull/42657 works because there's only single class assigned (after we refactored the site at https://github.com/apache/spark/pull/40269) ### Why are the changes needed? This is a regression in our documentation site. ### Does this PR introduce _any_ user-facing change? Yes, once you click the language tab, it will apply to the examples in the whole page. ### How was this patch tested? Manually tested after building the site. ![Screenshot 2023-09-19 at 12 08 17 PM](https://github.com/apache/spark/assets/6477701/09d0c117-9774-4404-8e2e-d454b7f700a3) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42989 from HyukjinKwon/SPARK-45210. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 19 September 2023, 05:51:27 UTC
e5d8287 [SPARK-44910][SQL][3.4] Encoders.bean does not support superclasses with generic type arguments ### What changes were proposed in this pull request? This pull request adds Encoders.bean support for beans having a superclass declared with generic type arguments. For example: ``` class JavaBeanWithGenericsA<T> { public T getPropertyA() { return null; } public void setPropertyA(T a) { } } class JavaBeanWithGenericBase extends JavaBeanWithGenericsA<String> { } Encoders.bean(JavaBeanWithGenericBase.class); // Exception ``` That feature had to be part of [PR 42327](https://github.com/apache/spark/commit/1f5d78b5952fcc6c7d36d3338a5594070e3a62dd) but was missing as I was focusing on nested beans only (hvanhovell ) ### Why are the changes needed? JavaTypeInference.encoderFor did not solve TypeVariable objects for superclasses so when managing a case like in the example above an exception was thrown. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests have been extended, new specific tests have been added ### Was this patch authored or co-authored using generative AI tooling? No hvanhovell this is branch-3.4 port of [PR-42634](https://github.com/apache/spark/pull/42634) Closes #42914 from gbloisi-openaire/SPARK-44910-branch-3.4. Authored-by: Giambattista Bloisi <gbloisi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 19 September 2023, 04:38:42 UTC
66f5f96 [SPARK-45081][SQL][3.4] Encoders.bean does no longer work with read-only properties ### What changes were proposed in this pull request? This PR re-enables Encoders.bean to be called against beans having read-only properties, that is properties that have only getters and no setter method. Beans with read only properties are even used in internal tests. Setter methods of a Java bean encoder are stored within an Option wrapper because they are missing in case of read-only properties. When a java bean has to be initialized, setter methods for the bean properties have to be called: this PR filters out read-only properties from that process. ### Why are the changes needed? The changes are required to avoid an exception to the thrown by getting the value of a None option object. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? An additional regression test has been added ### Was this patch authored or co-authored using generative AI tooling? No hvanhovell this is 3.4 branch port of [PR-42829](https://github.com/apache/spark/pull/42829) Closes #42913 from gbloisi-openaire/SPARK-45081-branch-3.4. Authored-by: Giambattista Bloisi <gbloisi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 September 2023, 20:39:54 UTC
464811a [SPARK-45078][SQL][3.4] Fix `array_insert` ImplicitCastInputTypes not work ### What changes were proposed in this pull request? This is a backport PR for https://github.com/apache/spark/pull/42951, to fix `array_insert` ImplicitCastInputTypes not work. ### Why are the changes needed? Fix error behavior in `array_insert` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42960 from Hisoka-X/arrayinsert-fix-3.4. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 September 2023, 18:55:23 UTC
b4eb947 [SPARK-45187][CORE] Fix `WorkerPage` to use the same pattern for `logPage` urls ### What changes were proposed in this pull request? This PR aims to use the same pattern for `logPage` urls of `WorkerPage` to make it work consistently when `spark.ui.reverseProxy=true`. ### Why are the changes needed? Since Apache Spark 3.2.0 (SPARK-34635, #31753), Apache Spark adds trailing slashes to reduce redirections for `logPage`. ```scala s"$workerUrlRef/logPage?driverId=$driverId&logType=stdout") s"$workerUrlRef/logPage/?driverId=$driverId&logType=stdout") ... <a href={s"$workerUrlRef/logPage?appId=${executor <a href={s"$workerUrlRef/logPage/?appId=${executor ``` This PR aims to fix a leftover in `WorkerPage` to make it work consistently in case of the reverse proxy situation via `spark.ui.reverseProxy`. Currently, in some proxy environments, `appId` link is working but `driverId` link is broken due to the redirections. This inconsistent behavior makes the users confused. ``` - <a href={s"$workerUrlRef/logPage?driverId=${driver.driverId}&logType=stdout"}>stdout</a> - <a href={s"$workerUrlRef/logPage?driverId=${driver.driverId}&logType=stderr"}>stderr</a> + <a href={s"$workerUrlRef/logPage/?driverId=${driver.driverId}&logType=stdout"}>stdout</a> + <a href={s"$workerUrlRef/logPage/?driverId=${driver.driverId}&logType=stderr"}>stderr</a> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests because it requires a reverse proxy. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42959 from dongjoon-hyun/SPARK-45187. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f8f2735426ee7ad3d7a1f5bd07e72643516f4a35) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 84a053e72ac9d9cfc91bab777cea94958d3a91da) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 September 2023, 17:35:26 UTC
1202094 [SPARK-45127][DOCS] Exclude README.md from document build ### What changes were proposed in this pull request? The pr aims to exclude `README.md` from document build. ### Why are the changes needed? - Currently, our document `README.html` does not have any CSS style applied to it, as shown below: https://spark.apache.org/docs/latest/README.html <img width="1432" alt="image" src="https://github.com/apache/spark/assets/15246973/1dfe5f69-30d9-4ce4-8d82-1bba5e721ccd"> **If we do not intend to display the above page to users, we should remove it during the document build process.** - As we saw in the project `spark-website`, it has already set the following configuration: https://github.com/apache/spark-website/blob/642d1fb834817014e1799e73882d53650c1c1662/_config.yml#L7 <img width="720" alt="image" src="https://github.com/apache/spark/assets/15246973/421b7be5-4ece-407e-9d49-8e7487b74a47"> Let's stay consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. After this pr, the README.html file will no longer be generated ``` (base) panbingkun:~/Developer/spark/spark-community/docs/_site$ls -al README.html ls: README.html: No such file or directory ``` - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42883 from panbingkun/SPARK-45127. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 804f741453fb146b5261084fa3baf26631badb79) Signed-off-by: Sean Owen <srowen@gmail.com> 16 September 2023, 14:04:55 UTC
7544bdb [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' **What changes were proposed in this pull request?** The PR updates the default value of 'spark.submit.deployMode' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.submit.deployMode' is 'client', but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #42902 from chenyu-opensource/branch-SPARK-45146. Authored-by: chenyu-opensource <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e) Signed-off-by: Sean Owen <srowen@gmail.com> 13 September 2023, 13:48:36 UTC
5440178 [SPARK-45109][SQL][CONNECT][3.4] Fix log function in Connect ### What changes were proposed in this pull request? This is a backport PR of https://github.com/apache/spark/pull/42869 as the 1 argument `log` function should point to `ln`. (Please note that the original https://github.com/apache/spark/pull/42863 doesn't need to be backported as `aes_descrypt` and `ln` is not implemented in Connect in Spark 3.4.) ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No, these Spark Connect functions haven't been released. ### How was this patch tested? Exsiting UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42872 from peter-toth/SPARK-45109-fix-log-3.4. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 September 2023, 07:23:49 UTC
ed6c1b3 [SPARK-45075][SQL][3.4] Fix alter table with invalid default value will not report error ### What changes were proposed in this pull request? This is a backporting PR to branch-3.4 from https://github.com/apache/spark/pull/42810 Changed the way of assert the error to adapt to 3.4 ### Why are the changes needed? Fix bug on 3.4 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42876 from Hisoka-X/SPARK-45075_followup_3.4_alter_column. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 September 2023, 06:28:20 UTC
e3b8402 Revert "[SPARK-45075][SQL] Fix alter table with invalid default value will not report error" This reverts commit 4a181e5eacfb8103cce50decaeabdd1441dca676. 11 September 2023, 16:05:29 UTC
4a181e5 [SPARK-45075][SQL] Fix alter table with invalid default value will not report error ### What changes were proposed in this pull request? This PR make sure ALTER TABLE ALTER COLUMN with invalid default value on DataSource V2 will report error, before this PR it will alter sucess. ### Why are the changes needed? Fix the error behavior on DataSource V2 with ALTER TABLE statement. ### Does this PR introduce _any_ user-facing change? Yes, the invalid default value will report error. ### How was this patch tested? Add new test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42810 from Hisoka-X/SPARK-45075_alter_invalid_default_value_on_v2. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 4dd4737b34b1215e374674b777c2eb8906a29ed7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 September 2023, 20:18:10 UTC
bf83dfa [SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present. The following query gets incorrect results: ``` drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} ``` The result should be: ``` {"f1":[1.0,2.0,3.0],"f2":[1,2,3]} ``` The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary: ``` java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet ... row group 1: RC:1 TS:112 OFFSET:4 ------------------------------------------------------------------------------------------------------------------------------------------------------- value: .f1: ..list: ...element: INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0] .f2: ..list: ...element: INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0] ``` The same bug also occurs when field f2 is a map. This PR fixes that case as well. No, except for fixing the correctness issue. New tests. No. Closes #42850 from bersprockets/vector_oddity. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fac236e1350d1c71dd772251709db3af877a69c2) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 September 2023, 20:03:39 UTC
d50a3ce [SPARK-43203][SQL][3.4] Move all Drop Table case to DataSource V2 ### What changes were proposed in this pull request? cherry pick #41348 and #42056 , this a bug fixed should be included in 3.4.2 ### Why are the changes needed? Fix DROP table behavior in session catalog ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by: - V2 table catalog tests: `org.apache.spark.sql.execution.command.v2.DropTableSuite` - V1 table catalog tests: `org.apache.spark.sql.execution.command.v1.DropTableSuiteBase` Closes #41765 from Hisoka-X/move_drop_table_v2_to_3.4.2. Lead-authored-by: Jia Fan <fanjiaeminem@qq.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 September 2023, 15:22:17 UTC
9053931 [SPARK-45100][SQL][3.4] Fix an internal error from `reflect()`on `NULL` class and method ### What changes were proposed in this pull request? In the PR, I propose to check that the `class` and `method` arguments are not a NULL in `CallMethodViaReflection`. And if they are, throw an `AnalysisException` with new error class `DATATYPE_MISMATCH.UNEXPECTED_NULL`. This is a backport of https://github.com/apache/spark/pull/42849. ### Why are the changes needed? To fix the issue demonstrated by the example: ```sql $ spark-sql (default)> select reflect('java.util.UUID', CAST(NULL AS STRING)); [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *.MiscFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit fd424caf6c46e7030ac2deb2afbe3f4a5fc1095c) Closes #42855 from MaxGekk/fix-internal-error-in-reflect-3.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 September 2023, 15:08:13 UTC
d535462 [SPARK-45103][BUILD][3.4] Update ORC to 1.8.5 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.8.4 for Apache Spark 3.4.2. Please note that Apache ORC community maintains - Apache ORC 1.8.x for Apache Spark 3.4.x - Apache ORC 1.9.x for Apache Spark 3.5.x - Apache ORC 2.0.x for Apache Spark 4.0.x ### Why are the changes needed? To bring the latest bug fixes like [ORC-1482](https://issues.apache.org/jira/browse/ORC-1482). - https://orc.apache.org/news/2023/09/05/ORC-1.8.5/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42851 from dongjoon-hyun/SPARK-45103. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 September 2023, 22:37:17 UTC
f0b4215 [SPARK-45079][SQL] Fix an internal error from `percentile_approx()`on `NULL` accuracy ### What changes were proposed in this pull request? In the PR, I propose to check the `accuracy` argument is not a NULL in `ApproximatePercentile`. And if it is, throw an `AnalysisException` with new error class `DATATYPE_MISMATCH.UNEXPECTED_NULL`. ### Why are the changes needed? To fix the issue demonstrated by the example: ```sql $ spark-sql (default)> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), NULL) FROM VALUES (0), (1), (2), (10) AS tab(col); [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *.ApproximatePercentileQuerySuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42817 from MaxGekk/fix-internal-error-in-percentile_approx. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 24b29adcf53616067a9fa2ca201e3f4d2f54436b) Signed-off-by: Max Gekk <max.gekk@gmail.com> 06 September 2023, 07:33:09 UTC
a96804b [SPARK-45071][SQL] Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data ### What changes were proposed in this pull request? Since `BinaryArithmetic#dataType` will recursively process the datatype of each node, the driver will be very slow when multiple columns are processed. For example, the following code: ```scala import spark.implicits._ import scala.util.Random import org.apache.spark.sql.Row import org.apache.spark.sql.functions.sum import org.apache.spark.sql.types.{StructType, StructField, IntegerType} val N = 30 val M = 100 val columns = Seq.fill(N)(Random.alphanumeric.take(8).mkString) val data = Seq.fill(M)(Seq.fill(N)(Random.nextInt(16) - 5)) val schema = StructType(columns.map(StructField(_, IntegerType))) val rdd = spark.sparkContext.parallelize(data.map(Row.fromSeq(_))) val df = spark.createDataFrame(rdd, schema) val colExprs = columns.map(sum(_)) // gen a new column , and add the other 30 column df.withColumn("new_col_sum", expr(columns.mkString(" + "))) ``` This code will take a few minutes for the driver to execute in the spark3.4 version, but only takes a few seconds to execute in the spark3.2 version. Related issue: [SPARK-39316](https://github.com/apache/spark/pull/36698) ### Why are the changes needed? Optimize the processing speed of `BinaryArithmetic#dataType` when processing multi-column data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual testing ### Was this patch authored or co-authored using generative AI tooling? no Closes #42804 from zzzzming95/SPARK-45071. Authored-by: zzzzming95 <505306252@qq.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 16e813cecd55490a71ef6c05fca2209fbdae078f) Signed-off-by: Yuming Wang <yumwang@ebay.com> 06 September 2023, 03:39:14 UTC
daf481d [SPARK-44940][SQL][3.4] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled ### What changes were proposed in this pull request? Backport of https://github.com/apache/spark/pull/42667 to branch-3.4. The PR improves JSON parsing when `spark.sql.json.enablePartialResults` is enabled: - Fixes the issue when using nested arrays `ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow` - Improves parsing of the nested struct fields, e.g. `{"a1": "AAA", "a2": [{"f1": "", "f2": ""}], "a3": "id1", "a4": "XXX"}` used to be parsed as `|AAA|NULL |NULL|NULL|` and now is parsed as `|AAA|[{NULL, }]|id1|XXX|`. - Improves performance of nested JSON parsing. The initial implementation would throw too many exceptions when multiple nested fields failed to parse. When the config is disabled, it is not a problem because the entire record is marked as NULL. The internal benchmarks show the performance improvement from slowdown of over 160% to an improvement of 7-8% compared to the master branch when the flag is enabled. I will create a follow-up ticket to add a benchmark for this regression. ### Why are the changes needed? Fixes some corner cases in JSON parsing and improves performance when `spark.sql.json.enablePartialResults` is enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added tests to verify nested structs, maps, and arrays can be parsed without affecting the subsequent fields in the JSON. I also updated the existing tests when `spark.sql.json.enablePartialResults` is enabled because we parse more data now. I added a benchmark to check performance. Before the change (master, https://github.com/apache/spark/commit/a45a3a3d60cb97b107a177ad16bfe36372bc3e9b): ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws [info] Intel(R) Xeon(R) Platinum 8375C CPU 2.90GHz [info] Partial JSON results: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] parse invalid JSON 9537 9820 452 0.0 953651.6 1.0X ``` After the change (this PR): ``` OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws Intel(R) Xeon(R) Platinum 8375C CPU 2.90GHz Partial JSON results: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ parse invalid JSON 3100 3106 6 0.0 309967.6 1.0X ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42792 from sadikovi/SPARK-44940-3.4. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 September 2023, 21:39:06 UTC
5a1c6e6 [SPARK-44846][SQL] Convert the lower redundant Aggregate to Project in RemoveRedundantAggregates ### What changes were proposed in this pull request? This PR provides a safe way to remove a redundant `Aggregate` in rule `RemoveRedundantAggregates`. Just convert the lower redundant `Aggregate` to `Project`. ### Why are the changes needed? The aggregate contains complex grouping expressions after `RemoveRedundantAggregates`, if `aggregateExpressions` has (if / case) branches, it is possible that `groupingExpressions` is no longer a subexpression of `aggregateExpressions` after execute `PushFoldableIntoBranches` rule, Then cause `boundReference` error. For example ``` SELECT c * 2 AS d FROM ( SELECT if(b > 1, 1, b) AS c FROM ( SELECT if(a < 0, 0, a) AS b FROM VALUES (-1), (1), (2) AS t1(a) ) t2 GROUP BY b ) t3 GROUP BY c ``` Before pr ``` == Optimized Logical Plan == Aggregate [if ((b#0 > 1)) 1 else b#0], [if ((b#0 > 1)) 2 else (b#0 * 2) AS d#2] +- Project [if ((a#3 < 0)) 0 else a#3 AS b#0] +- LocalRelation [a#3] ``` ``` == Error == Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7] java.lang.IllegalStateException: Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7] at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241) at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653) ...... ``` After pr ``` == Optimized Logical Plan == Aggregate [c#1], [(c#1 * 2) AS d#2] +- Project [if ((b#0 > 1)) 1 else b#0 AS c#1] +- Project [if ((a#3 < 0)) 0 else a#3 AS b#0] +- LocalRelation [a#3] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #42633 from zml1206/SPARK-44846-2. Authored-by: zml1206 <zhuml1206@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 32a87f03da7eef41161a5a7a3aba4a48e0421912) Signed-off-by: Yuming Wang <yumwang@ebay.com> 04 September 2023, 12:46:16 UTC
d61bfb8 [SPARK-45054][SQL] HiveExternalCatalog.listPartitions should restore partition statistics ### What changes were proposed in this pull request? Call `restorePartitionMetadata` in `listPartitions` to restore Spark SQL statistics. ### Why are the changes needed? Currently when `listPartitions` is called, it doesn't restore Spark SQL statistics stored in metastore, such as `spark.sql.statistics.totalSize`. This means callers who rely on stats from the method call may wrong results. In particular, when `spark.sql.statistics.size.autoUpdate.enabled` is turned on, during insert overwrite Spark will first list partitions and get old statistics, and then compare them with new statistics and see which partitions need to be updated. This issue will sometimes cause it to update all partitions instead of only those partitions that have been touched. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test. ### Was this patch authored or co-authored using generative AI tooling? Closes #42777 from sunchao/list-partition-stat. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> 02 September 2023, 03:24:00 UTC
2004837 [SPARK-44990][SQL] Reduce the frequency of get `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` ### What changes were proposed in this pull request? This PR move get config `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` to lazy val of `UnivocityGenerator`. To reduce the frequency of get it. As report, it will affect performance. ### Why are the changes needed? Reduce the frequency of get `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test ### Was this patch authored or co-authored using generative AI tooling? No Closes #42738 from Hisoka-X/SPARK-44990_csv_null_value_config. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit dac750b855c35a88420b6ba1b943bf0b6f0dded1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 August 2023, 17:55:19 UTC
64c26b7 [MINOR][SQL][DOC] Fix incorrect link in sql menu and typo ### What changes were proposed in this pull request? Fix incorrect link in sql menu and typo. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? run `SKIP_API=1 bundle exec jekyll build` ![image](https://github.com/apache/spark/assets/17894939/7cc564ec-41cd-4e92-b19e-d33a53188a10) ### Was this patch authored or co-authored using generative AI tooling? No Closes #42697 from wForget/doc. Authored-by: wforget <643348094@qq.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 421ff4e3c047865a1887cae94b85dbf40bb7bac9) Signed-off-by: Kent Yao <yao@apache.org> 28 August 2023, 08:09:54 UTC
fef3de3 [SPARK-44547][CORE] Ignore fallback storage for cached RDD migration ### What changes were proposed in this pull request? Fix bugs that makes the RDD decommissioner never finish ### Why are the changes needed? The cached RDD decommissioner is in a forever retry loop when the only viable peer is the fallback storage, which it doesn't know how to handle. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests are added and tested using Spark jobs. Closes #42155 from ukby1234/franky.SPARK-44547. Authored-by: Frank Yin <franky@ziprecruiter.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 47555da2ae292b07488ba181db1aceac8e7ddb3a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 August 2023, 05:19:59 UTC
6b6a0d5 Revert "[SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef" This reverts commit b359f6ce32cb1a8b9607c773e2534b05ecf4cdc5. 24 August 2023, 17:54:19 UTC
0d50898 [SPARK-44840][SQL][3.4] Make `array_insert()` 1-based for negative indexes ### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. This is a backport of https://github.com/apache/spark/pull/42564 ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit ce50a563d311ccfe36d1fcc4f0743e4e4d7d8116) Closes #42655 from MaxGekk/fix-array_insert-3.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 August 2023, 16:57:07 UTC
b359f6c [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef ### What changes were proposed in this pull request? Originally, when a CTE has duplicate expression IDs in its output, the rule PushdownPredicatesAndPruneColumnsForCTEDef wrongly assesses that the columns in the CTE were pruned, as it compares the size of the attribute set containing the union of columns (which is unique) and the original output of the CTE (which contains duplicate columns) and notices that the former is less than the latter. This causes incorrect pruning of the CTE output, resulting in a missing reference and causing the error as documented in the ticket. This PR changes the logic to use the needsPruning function to assess whether a CTE has been pruned, which uses the outputSet to check if any columns has been pruned instead of the output. ### Why are the changes needed? The incorrect behaviour of PushdownPredicatesAndPruneColumnsForCTEDef in CTEs with duplicate expression IDs in its output causes a crash when such a query is run. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test for the crashing case was added. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42635 from wenyuen-db/SPARK-44934. Authored-by: Wen Yuen Pang <wenyuen.pang@databricks.com> Signed-off-by: Peter Toth <peter.toth@gmail.com> (cherry picked from commit 3b405948ee47702e5a7250dc27430836145b0e19) Signed-off-by: Peter Toth <peter.toth@gmail.com> 24 August 2023, 15:02:26 UTC
21a86b6 [SPARK-44929][TESTS] Standardize log output for console appender in tests ### What changes were proposed in this pull request? This PR set a character length limit for the error message and a stack depth limit for error stack traces to the console appender in tests. The original patterns are - %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex - %t: %m%n%ex And they're adjusted to the new consistent pattern - `%d{HH:mm:ss.SSS} %p %c: %maxLen{%m}{512}%n%ex{8}%n` ### Why are the changes needed? In testing, intentional and unintentional failures are created to generate extensive log volumes. For instance, a single FileNotFound error may be logged multiple times in the writer, task runner, task set manager, and other areas, resulting in thousands of lines per failure. For example, tests in ParquetRebaseDatetimeSuite will be run with V1 and V2 Datasource, two or more specific values, and multiple configuration pairs. I have seen the SparkUpgradeException all over the CI logs ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ``` build/sbt "sql/testOnly *ParquetRebaseDatetimeV1Suite" ``` ``` 15:59:55.446 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Job job_202308230059551630377040190578321_1301 aborted. 15:59:55.446 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1301.0 (TID 1595) org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1. at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) 15:59:55.446 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1. at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420) at org.apache.spark.sql.execution.datasources.... 15:59:55.446 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 1301.0 failed 1 times; aborting job 15:59:55.447 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job 0ead031e-c9dd-446b-b20b-c76ec54978b1. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1301.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1. at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) 15:59:55.579 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1303.0 (TID 1597) ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #42627 from yaooqinn/SPARK-44929. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 830500150f7e3972d1fa5b47d0ab564bfa7e4b12) Signed-off-by: Kent Yao <yao@apache.org> 24 August 2023, 05:52:07 UTC
91d85c6 [SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in Docker images if exists ### What changes were proposed in this pull request? This PR aims to fix `RELEASE` file to have the correct information in Docker images if `RELEASE` file exists. Please note that `RELEASE` file doesn't exists in SPARK_HOME directory when we run the K8s integration test from Spark Git repository. So, we keep the following empty `RELEASE` file generation and use `COPY` conditionally via glob syntax. https://github.com/apache/spark/blob/2a3aec1f9040e08999a2df88f92340cd2710e552/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L37 ### Why are the changes needed? Currently, it's an empty file in the official Apache Spark Docker images. ``` $ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE -rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE $ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail -n1 -rw-r--r-- 1 root root 0 Feb 21 2022 /opt/spark/RELEASE ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually build image and check it with `docker run -it --rm NEW_IMAGE ls -al /opt/spark/RELEASE` I copied this `Dockerfile` into Apache Spark 3.5.0 RC2 binary distribution and tested in the following way. ``` $ cd spark-3.5.0-rc2-bin-hadoop3 $ cp /tmp/Dockerfile kubernetes/dockerfiles/spark/Dockerfile $ bin/docker-image-tool.sh -t SPARK-44935 build $ docker run -it --rm docker.io/library/spark:SPARK-44935 ls -al /opt/spark/RELEASE | tail -n1 -rw-r--r-- 1 root root 165 Aug 18 21:10 /opt/spark/RELEASE $ docker run -it --rm docker.io/library/spark:SPARK-44935 cat /opt/spark/RELEASE | tail -n2 Spark 3.5.0 (git revision 010c4a6a05) built for Hadoop 3.3.4 Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-3 -Phive -Phive-thriftserver ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42636 from dongjoon-hyun/SPARK-44935. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d382c6b3aef28bde6adcdf62b7be565ff1152942) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 August 2023, 23:01:13 UTC
7a3b950 [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient() ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see https://github.com/apache/spark/pull/16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42619 from JoshRosen/remove-awaitUninterruptibly. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 2137606a6686b33bed57800c6b166059b134a089) Signed-off-by: Kent Yao <yao@apache.org> 23 August 2023, 05:55:01 UTC
70624e6 [SPARK-44925][K8S] K8s default service token file should not be materialized into token ### What changes were proposed in this pull request? This PR aims to stop materializing `OAuth token` from the default service token file, `/var/run/secrets/kubernetes.io/serviceaccount/token`, because the content of volumes varies which means being renewed or expired by K8s control plane. We need to read the content in a on-demand manner to be in the up-to-date status. Note the followings: - Since we use `autoConfigure` for K8s client, K8s client still uses the default service tokens if exists and needed. https://github.com/apache/spark/blob/13588c10cbc380ecba1231223425eaad2eb9ec80/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala#L91 - This PR doesn't change Spark's behavior for the user-provided token file location. Spark will load the content of the user-provided token file locations to get `OAuth token` because Spark cannot assume that the files of that locations are refreshed or not in the future. ### Why are the changes needed? [BoundServiceAccountTokenVolume](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume) became `Stable` at K8s 1.22. - [KEP-1205 Bound Service Account Tokens](https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md#boundserviceaccounttokenvolume-1) : **BoundServiceAccountTokenVolume** Alpha | Beta | GA -- | -- | -- 1.13 | 1.21 | 1.22 - [EKS Service Account with 90 Days Expiration](https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html) > For Amazon EKS clusters, the extended expiry period is 90 days. Your Amazon EKS cluster's Kubernetes API server rejects requests with tokens that are greater than 90 days old. - As of today, [all supported EKS clusters are from 1.23 to 1.27](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html) which means we always use `BoundServiceAccountTokenVolume`. ### Does this PR introduce _any_ user-facing change? No. This fixes only the bugs caused by some outdated tokens where K8s control plane denies Spark's K8s API invocation. ### How was this patch tested? Pass the CIs with the all existing unit tests and integration tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42624 from dongjoon-hyun/SPARK-44925. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7b1a3494107b304a93f571920fc3816cde71f706) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 August 2023, 05:46:14 UTC
6496ca1 [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests ### What changes were proposed in this pull request? This PR disable InternalParquetRecordWriter logs for SQL tests ### Why are the changes needed? The InternalParquetRecordWriter creates over 1,800 records, which equates to more than 80,000 lines of code. This accounts for 80% of the volume of the "slow tests" before the GitHub action truncates it. #### A record for example ``` 2023-08-22T13:38:14.5103112Z 13:38:14.406 WARN org.apache.parquet.hadoop.InternalParquetRecordWriter: Too much memory used: Store { 2023-08-22T13:38:14.5103259Z [dummy_col] optional int32 dummy_col { 2023-08-22T13:38:14.5103352Z r:0 bytes 2023-08-22T13:38:14.5103444Z d:0 bytes 2023-08-22T13:38:14.5103580Z data: FallbackValuesWriter{ 2023-08-22T13:38:14.5103739Z data: initial: DictionaryValuesWriter{ 2023-08-22T13:38:14.5103854Z data: initial: dict:8 2023-08-22T13:38:14.5103976Z data: initial: values:400 2023-08-22T13:38:14.5104079Z data: initial:} 2023-08-22T13:38:14.5104087Z 2023-08-22T13:38:14.5104325Z data: fallback: PLAIN CapacityByteArrayOutputStream 0 slabs, 0 bytes 2023-08-22T13:38:14.5104418Z data:} 2023-08-22T13:38:14.5104426Z 2023-08-22T13:38:14.5104695Z pages: ColumnChunkPageWriter ConcatenatingByteArrayCollector 0 slabs, 0 bytes 2023-08-22T13:38:14.5104795Z total: 400/408 2023-08-22T13:38:14.5104936Z } 2023-08-22T13:38:14.5105129Z [expected_rowIdx_col] optional int32 expected_rowIdx_col { 2023-08-22T13:38:14.5105291Z r:0 bytes 2023-08-22T13:38:14.5105383Z d:0 bytes 2023-08-22T13:38:14.5105518Z data: FallbackValuesWriter{ 2023-08-22T13:38:14.5105679Z data: initial: DictionaryValuesWriter{ 2023-08-22T13:38:14.5105797Z data: initial: dict:400 2023-08-22T13:38:14.5105919Z data: initial: values:400 2023-08-22T13:38:14.5106022Z data: initial:} 2023-08-22T13:38:14.5106030Z 2023-08-22T13:38:14.5106267Z data: fallback: PLAIN CapacityByteArrayOutputStream 0 slabs, 0 bytes 2023-08-22T13:38:14.5106356Z data:} 2023-08-22T13:38:14.5106364Z 2023-08-22T13:38:14.5106636Z pages: ColumnChunkPageWriter ConcatenatingByteArrayCollector 0 slabs, 0 bytes 2023-08-22T13:38:14.5106736Z total: 400/800 2023-08-22T13:38:14.5106820Z } 2023-08-22T13:38:14.5106942Z [id] required int64 id { 2023-08-22T13:38:14.5107037Z r:0 bytes 2023-08-22T13:38:14.5107133Z d:0 bytes 2023-08-22T13:38:14.5107275Z data: FallbackValuesWriter{ 2023-08-22T13:38:14.5107436Z data: initial: DictionaryValuesWriter{ 2023-08-22T13:38:14.5107557Z data: initial: dict:800 2023-08-22T13:38:14.5107677Z data: initial: values:400 2023-08-22T13:38:14.5107779Z data: initial:} 2023-08-22T13:38:14.5107787Z 2023-08-22T13:38:14.5108022Z data: fallback: PLAIN CapacityByteArrayOutputStream 0 slabs, 0 bytes 2023-08-22T13:38:14.5108113Z data:} 2023-08-22T13:38:14.5108121Z 2023-08-22T13:38:14.5108384Z pages: ColumnChunkPageWriter ConcatenatingByteArrayCollector 0 slabs, 0 bytes 2023-08-22T13:38:14.5108485Z total: 800/1,200 2023-08-22T13:38:14.5108570Z } 2023-08-22T13:38:14.5108655Z } 2023-08-22T13:38:14.5108664Z ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? see log volume of SQL tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #42614 from yaooqinn/log. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 34b51cc784478f65c4c9a6cf3fdc0ed4ca31f745) Signed-off-by: Kent Yao <yao@apache.org> 23 August 2023, 05:42:03 UTC
0060279 [SPARK-44871][SQL][3.4] Fix percentile_disc behaviour ### What changes were proposed in this pull request? This PR fixes `percentile_disc()` function as currently it returns inforrect results in some cases. E.g.: ``` SELECT percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 FROM VALUES (0), (1), (2), (3), (4) AS v(a) ``` currently returns: ``` +---+---+---+---+---+---+---+---+---+---+---+ | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| +---+---+---+---+---+---+---+---+---+---+---+ |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| +---+---+---+---+---+---+---+---+---+---+---+ ``` but after this PR it returns the correct: ``` +---+---+---+---+---+---+---+---+---+---+---+ | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| +---+---+---+---+---+---+---+---+---+---+---+ |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| +---+---+---+---+---+---+---+---+---+---+---+ ``` ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness bug, but the old behaviour can be restored with `spark.sql.legacy.percentileDiscCalculation=true`. ### How was this patch tested? Added new UTs. Closes #42610 from peter-toth/SPARK-44871-fix-percentile-disc-behaviour-3.4. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 August 2023, 16:27:15 UTC
c7cb844 [SPARK-44854][PYTHON] Python timedelta to DayTimeIntervalType edge case bug ### What changes were proposed in this pull request? This PR proposes to change the way that python `datetime.timedelta` objects are converted to `pyspark.sql.types.DayTimeIntervalType` objects. Specifically, it modifies the logic inside `toInternal` which returns the timedelta as a python integer (would be int64 in other languages) storing the timedelta as microseconds. The current logic inadvertently adds an extra second when doing the conversion for certain python timedelta objects, thereby returning an incorrect value. An illustrative example is as follows: ``` from datetime import timedelta from pyspark.sql.types import DayTimeIntervalType, StructField, StructType spark = ...spark session setup here... td = timedelta(days=4498031, seconds=16054, microseconds=999981) df = spark.createDataFrame([(td,)], StructType([StructField(name="timedelta_col", dataType=DayTimeIntervalType(), nullable=False)])) df.show(truncate=False) > +------------------------------------------------+ > |timedelta_col | > +------------------------------------------------+ > |INTERVAL '4498031 04:27:35.999981' DAY TO SECOND| > +------------------------------------------------+ print(str(td)) > '4498031 days, 4:27:34.999981' ``` In the above example, look at the seconds. The original python timedelta object has 34 seconds, the pyspark DayTimeIntervalType column has 35 seconds. ### Why are the changes needed? To fix a bug. It is a bug because the wrong value is returned after conversion. Adding the above timedelta entry to existing unit tests causes the test to fail. ### Does this PR introduce _any_ user-facing change? Yes. Users should now see the correct timedelta values in pyspark dataframes for similar such edge cases. ### How was this patch tested? Illustrative edge case examples were added to the unit test (`python/pyspark/sql/tests/test_types.py` the `test_daytime_interval_type` test), verified that the existing code failed the test, new code was added, and verified that the unit test now passes. ### JIRA ticket link This PR should close https://issues.apache.org/jira/browse/SPARK-44854 Closes #42541 from hdaly0/SPARK-44854. Authored-by: Ocean <haghighidaly@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9fdf65aefc552c909f6643f8a31405d0622eeb7e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 August 2023, 02:55:02 UTC
faba1ff [SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again ### What changes were proposed in this pull request? This PR creates an alternative to the assign_issue function in jira.client.JIRA. The original one has an issue that it will search users again and only choose the assignee from 20 candidates. If it's unmatched, it picks the head blindly. For example, ```python >>> assignee = asf_jira.user("yao") >>> "SPARK-44801" 'SPARK-44801' >>> asf_jira.assign_issue(issue.key, assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'issue' is not defined >>> asf_jira.assign_issue("SPARK-44801", assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 123, in wrapper result = func(*arg_list, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 1891, in assign_issue self._session.put(url, data=json.dumps(payload)) File "/Users/hzyaoqin/python/lib/python3.11/site-packages/requests/sessions.py", line 649, in put return self.request("PUT", url, data=data, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 246, in request elif raise_on_error(response, **processed_kwargs): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 71, in raise_on_error raise JIRAError( jira.exceptions.JIRAError: JiraError HTTP 400 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-44801/assignee response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' cannot be assigned issues."}} ``` The Jira userid 'yao' fails to return my JIRA profile as a candidate(20 in total) to match. So, 'airhot' from the head replaces me as an assignee. ### Why are the changes needed? bugfix for merge_spark_pr ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test locally ```python >>> def assign_issue(client: jira.client.JIRA, issue: int, assignee: str) -> bool: ... """Assign an issue to a user. ... ... Args: ... issue (Union[int, str]): the issue ID or key to assign ... assignee (str): the user to assign the issue to. None will set it to unassigned. -1 will set it to Automatic. ... ... Returns: ... bool ... """ ... url = getattr(client, "_get_latest_url")(f"issue/{issue}/assignee") ... payload = {"name": assignee} ... getattr(client, "_session").put(url, data=json.dumps(payload)) ... return True ... >>> >>> assign_issue(asf_jira, "SPARK-44801", "yao") True ``` Closes #42496 from yaooqinn/SPARK-44813. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 8fb799d47bbd5d5ce9db35283d08ab1a31dc37b9) Signed-off-by: Kent Yao <yao@apache.org> 18 August 2023, 18:04:04 UTC
1cfa080 Revert "[SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again" This reverts commit 3c5e57d886b81808370353781bfce2b2ce20a473. 18 August 2023, 17:56:07 UTC
62b4846 [SPARK-44875][INFRA] Fix spelling for commentator to test SPARK-44813 ### What changes were proposed in this pull request? Fix a typo to verify SPARK-44813 ### Why are the changes needed? Fix a typo and verify SPARK-44813 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #42561 from yaooqinn/SPARK-44875. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 54dd18b5e0953df37e5f0937f1f79e65db70b787) Signed-off-by: Kent Yao <yao@apache.org> 18 August 2023, 17:49:21 UTC
3c5e57d [SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again ### What changes were proposed in this pull request? This PR creates an alternative to the assign_issue function in jira.client.JIRA. The original one has an issue that it will search users again and only choose the assignee from 20 candidates. If it's unmatched, it picks the head blindly. For example, ```python >>> assignee = asf_jira.user("yao") >>> "SPARK-44801" 'SPARK-44801' >>> asf_jira.assign_issue(issue.key, assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'issue' is not defined >>> asf_jira.assign_issue("SPARK-44801", assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 123, in wrapper result = func(*arg_list, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 1891, in assign_issue self._session.put(url, data=json.dumps(payload)) File "/Users/hzyaoqin/python/lib/python3.11/site-packages/requests/sessions.py", line 649, in put return self.request("PUT", url, data=data, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 246, in request elif raise_on_error(response, **processed_kwargs): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 71, in raise_on_error raise JIRAError( jira.exceptions.JIRAError: JiraError HTTP 400 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-44801/assignee response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' cannot be assigned issues."}} ``` The Jira userid 'yao' fails to return my JIRA profile as a candidate(20 in total) to match. So, 'airhot' from the head replaces me as an assignee. ### Why are the changes needed? bugfix for merge_spark_pr ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test locally ```python >>> def assign_issue(client: jira.client.JIRA, issue: int, assignee: str) -> bool: ... """Assign an issue to a user. ... ... Args: ... issue (Union[int, str]): the issue ID or key to assign ... assignee (str): the user to assign the issue to. None will set it to unassigned. -1 will set it to Automatic. ... ... Returns: ... bool ... """ ... url = getattr(client, "_get_latest_url")(f"issue/{issue}/assignee") ... payload = {"name": assignee} ... getattr(client, "_session").put(url, data=json.dumps(payload)) ... return True ... >>> >>> assign_issue(asf_jira, "SPARK-44801", "yao") True ``` Closes #42496 from yaooqinn/SPARK-44813. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 00255bc63b1a3bbe80bedc639b88d4a8e3f88f72) Signed-off-by: Sean Owen <srowen@gmail.com> 18 August 2023, 15:03:05 UTC
a5e9175 [SPARK-44857][CORE][UI] Fix `getBaseURI` error in Spark Worker LogPage UI buttons ### What changes were proposed in this pull request? This PR aims to fix `getBaseURI` errors when we clicks Spark Worker LogPage UI buttons in Apache Spark 3.2.0+. ### Why are the changes needed? Run a Spark job and open the Spark Worker UI, http://localhost:8081 . ``` $ sbin/start-master.sh $ sbin/start-worker.sh spark://127.0.0.1:7077 $ bin/spark-shell --master spark://127.0.0.1:7077 ``` Click `stderr` and `Load New` button. The button is out of order currently due to the following error because `getBaseURI` is defined in `utils.js`. ![Screenshot 2023-08-17 at 2 38 45 PM](https://github.com/apache/spark/assets/9700541/c2358ae3-46d2-43fe-9cc1-ce343725ce4c) ### Does this PR introduce _any_ user-facing change? This will make the buttons work. ### How was this patch tested? Manual. Closes #42546 from dongjoon-hyun/SPARK-44857. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f807bd239aebe182a04f5452d6efdf458e44143c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 August 2023, 01:07:28 UTC
6c1107e [SPARK-44859][SS] Fix incorrect property name in structured streaming doc ### What changes were proposed in this pull request? We found that one structured streaming property for asynchronous progress tracking is not correct when comparing with codebase. ### Why are the changes needed? Fix incorrect property name in structured streaming document. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? None, doc change only. Closes #42544 from viirya/minor_doc_fix. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 07f70227f8bf81928d98101a88fd2885784451f5) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 August 2023, 22:52:20 UTC
a846a22 [SPARK-44745][DOCS][K8S] Document shuffle data recovery from the remounted K8s PVCs ### What changes were proposed in this pull request? This PR aims to document an example of shuffle data recovery configuration from the remounted K8s PVCs. ### Why are the changes needed? This will help the users use this feature more easily. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review because this is a doc-only change. ![Screenshot 2023-08-09 at 1 39 48 PM](https://github.com/apache/spark/assets/9700541/8cc7240b-570d-4c2e-b90a-54795c18df0a) ``` $ kubectl logs -f xxx-exec-16 | grep Kube ... 23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Try to recover shuffle data. 23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Found 192 files 23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Try to recover /data/spark-x/executor-x/blockmgr-41a810ea-9503-447b-afc7-1cb104cd03cf/11/shuffle_0_11160_0.data 23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Try to recover /data/spark-x/executor-x/blockmgr-41a810ea-9503-447b-afc7-1cb104cd03cf/0e/shuffle_0_10063_0.data 23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Try to recover /data/spark-x/executor-x/blockmgr-41a810ea-9503-447b-afc7-1cb104cd03cf/0e/shuffle_0_10283_0.data 23/08/09 21:09:21 INFO KubernetesLocalDiskShuffleExecutorComponents: Ignore a non-shuffle block file. ``` Closes #42417 from dongjoon-hyun/SPARK-44745. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4db378fae30733cbd2be41e95a3cd8ad2184e06f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 August 2023, 22:25:53 UTC
3c4d08f [SPARK-44581][YARN] Fix the bug that ShutdownHookManager gets wrong UGI from SecurityManager of ApplicationMaster ### What changes were proposed in this pull request? I make the SecurityManager instance a lazy value ### Why are the changes needed? fix the bug in issue [SPARK-44581](https://issues.apache.org/jira/browse/SPARK-44581) **Bug:** In spark3.2 it throws the org.apache.hadoop.security.AccessControlException, but in spark2.4 this hook does not throw exception. I rebuild the hadoop-client-api.jar, and add some debug log before the hadoop shutdown hook is created, and rebuild the spark-yarn.jar to add some debug log when creating the spark shutdown hook manager, here is the screenshot of the log: ![image](https://github.com/apache/spark/assets/62563545/ea338db3-646c-432c-bf16-1f445adc2ad9) We can see from the screenshot, the ShutdownHookManager is initialized before the ApplicationManager create a new ugi. **Reason** The main cause is that ShutdownHook thread is created before we create the ugi in ApplicationMaster. When we set the config key "hadoop.security.credential.provider.path", the ApplicationMaster will try to get a filesystem when generating SSLOptions, and when initialize the filesystem during which it will generate a new thread whose ugi is inherited from the current process (yarn). After this, it will generate a new ugi (SPARK_USER) in ApplicationMaster and execute the doAs() function. Here is the chain of the call: ApplicationMaster.(ApplicationMaster.scala:83) -> org.apache.spark.SecurityManager.(SecurityManager.scala:98) -> org.apache.spark.SSLOptions$.parse(SSLOptions.scala:188) -> org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:2353) -> org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:2434) -> org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:82) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I didn't add new UnitTest for this, but I rebuild the package, and runs a program in my cluster, and turns out that the user when I delete the staging file turns to be the same with the SPARK_USER. Closes #42405 from liangyu-1/SPARK-44581. Authored-by: 余良 <yul165@chinaunicom.cn> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit e584ed4ad96a0f0573455511d7be0e9b2afbeb96) Signed-off-by: Kent Yao <yao@apache.org> 09 August 2023, 05:47:44 UTC
246b163 [SPARK-44725][DOCS] Document `spark.network.timeoutInterval` ### What changes were proposed in this pull request? This PR aims to document `spark.network.timeoutInterval` configuration. ### Why are the changes needed? Like `spark.network.timeout`, `spark.network.timeoutInterval` exists since Apache Spark 1.3.x. https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31ed3ad998995b8/core/src/main/scala/org/apache/spark/internal/config/Network.scala#L48-L52 Since this is a user-facing configuration like the following, we had better document it. https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31ed3ad998995b8/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala#L91-L93 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual because this is a doc-only change. Closes #42402 from dongjoon-hyun/SPARK-44725. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3af2e77bd20cad3d9fe23cc0689eed29d5f5a537) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 August 2023, 23:03:55 UTC
6e7e50e [SPARK-44657][CONNECT] Fix incorrect limit handling in ArrowBatchWithSchemaIterator and config parsing of CONNECT_GRPC_ARROW_MAX_BATCH_SIZE Fixes the limit checking of `maxEstimatedBatchSize` and `maxRecordsPerBatch` to respect the more restrictive limit and fixes the config parsing of `CONNECT_GRPC_ARROW_MAX_BATCH_SIZE` by converting the value to bytes. Bugfix. In the arrow writer [code](https://github.com/apache/spark/blob/6161bf44f40f8146ea4c115c788fd4eaeb128769/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L154-L163) , the conditions don’t seem to hold what the documentation says regd "maxBatchSize and maxRecordsPerBatch, respect whatever smaller" since it seems to actually respect the conf which is "larger" (i.e less restrictive) due to || operator. Further, when the `CONNECT_GRPC_ARROW_MAX_BATCH_SIZE` conf is read, the value is not converted to bytes from MiB ([example](https://github.com/apache/spark/blob/3e5203c64c06cc8a8560dfa0fb6f52e74589b583/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala#L103)). No. Existing tests. Closes #42321 from vicennial/SPARK-44657. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f9d417fc17a82ddf02d6bbab82abc8e1aa264356) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 08 August 2023, 08:31:27 UTC
5c7d1e2 [SPARK-44641][SQL] Incorrect result in certain scenarios when SPJ is not triggered This PR makes sure we use unique partition values when calculating the final partitions in `BatchScanExec`, to make sure no duplicated partitions are generated. When `spark.sql.sources.v2.bucketing.pushPartValues.enabled` and `spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled` are enabled, and SPJ is not triggered, currently Spark will generate incorrect/duplicated results. This is because with both configs enabled, Spark will delay the partition grouping until the time it calculates the final partitions used by the input RDD. To calculate the partitions, it uses partition values from the `KeyGroupedPartitioning` to find out the right ordering for the partitions. However, since grouping is not done when the partition values is computed, there could be duplicated partition values. This means the result could contain duplicated partitions too. No, this is a bug fix. Added a new test case for this scenario. Closes #42324 from sunchao/SPARK-44641. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit aa1261dc129618d27a1bdc743a5fdd54219f7c01) Signed-off-by: Chao Sun <sunchao@apple.com> 08 August 2023, 02:19:50 UTC
back to top