sort by:
Revision Author Date Message Commit Date
da5912a Revert "[MINOR][SQL] Remove orphans in ProtoToParsedPlanTestSuite and PlanGenerationTestSuite" This reverts commit d431d4034219f2c84c105e1894eebe03745bf105. 06 August 2024, 01:44:19 UTC
d9c5902 [SPARK-49117][K8S] Fix `docker-image-tool.sh` to be up-to-date ### What changes were proposed in this pull request? This PR aims to fix `docker-image-tool.sh` to be up-to-date. ### Why are the changes needed? Apache Spark 4 dropped Java 11 support. So, we should fix the following. - https://github.com/apache/spark/pull/43005 ``` - - Build and push Java11-based image with tag "v3.4.0" to docker.io/myrepo + - Build and push Java17-based image with tag "v4.0.0" to docker.io/myrepo ``` Apache Spark 4 requires JDK instead of JRE. So, we should fix the following. - https://github.com/apache/spark/pull/45761 ``` - $0 -r docker.io/myrepo -t v3.4.0 -b java_image_tag=11-jre build + $0 -r docker.io/myrepo -t v4.0.0 -b java_image_tag=17 build ``` Lastly, `3.4.0` is too old because it's released on April 13, 2023. We had better use v4.0.0. ``` - $0 -r docker.io/myrepo -t v3.4.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build + $0 -r docker.io/myrepo -t v4.0.0 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build ``` ### Does this PR introduce _any_ user-facing change? No functional change because this is a usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47618 from dongjoon-hyun/SPARK-49117. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 August 2024, 01:43:21 UTC
6d32472 [SPARK-49112][CONNECT][TEST] Make `createLocalRelationProto` support `TimestampType` ### What changes were proposed in this pull request? Make `createLocalRelationProto` support relation with `TimestampType` ### Why are the changes needed? existing helper function `createLocalRelationProto` cannot create table with `TimestampType`: ``` org.apache.spark.SparkException: [INTERNAL_ERROR] Missing timezoneId where it is mandatory. SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:99) at org.apache.spark.SparkException$.internalError(SparkException.scala:103) at org.apache.spark.sql.util.ArrowUtils$.toArrowType(ArrowUtils.scala:57) at org.apache.spark.sql.util.ArrowUtils$.toArrowField(ArrowUtils.scala:139) at org.apache.spark.sql.util.ArrowUtils$.$anonfun$toArrowSchema$1(ArrowUtils.scala:181) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) ``` ### Does this PR introduce _any_ user-facing change? No, test-only ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes #47608 from zhengruifeng/create_timestamp_localrel. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 August 2024, 01:00:08 UTC
de8ee94 [SPARK-49014][BUILD] Bump Apache Avro to 1.12.0 ### What changes were proposed in this pull request? This PR aims to update `Apache Avro` to 1.12.0. ### Why are the changes needed? Apache Avro 1.12.0 is the latest feature release. - https://github.com/apache/avro/releases/tag/release-1.12.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47498 from Fokko/fd-bump-avro. Authored-by: Fokko <fokko@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 August 2024, 22:21:59 UTC
5f9c870 [SPARK-49097][INFRA] Add Python3 environment detection for the `build_error_docs` method in `build_api_decs.rb` ### What changes were proposed in this pull request? This PR aims to add Python3 environment detection for the `build_orror_docs` method in `build_api_decs.rb`. ### Why are the changes needed? Make the environment exception prompts more friendly for developers when generating documents. Before: <img width="1322" alt="image" src="https://github.com/user-attachments/assets/9f31c951-e63a-479a-9600-2b62e8ad9ddd"> After: <img width="1379" alt="image" src="https://github.com/user-attachments/assets/b0841f42-237a-429e-8673-5254328c6dd2"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47590 from wayneguow/inf. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 August 2024, 22:17:08 UTC
e8227a8 [SPARK-49018][SQL] Fix approx_count_distinct not working correctly with collation ### What changes were proposed in this pull request? Fix for approx_count_distinct not working correctly with collated strings. ### Why are the changes needed? approx_count_distinct was not working with any collation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test added to CollationSQLExpressionSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47503 from viktorluc-db/bugfix. Authored-by: viktorluc-db <viktor.lucic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2024, 12:55:49 UTC
5250ff3 [SPARK-48791][CORE][FOLLOW-UP] Fix regression caused by immutable conversion on TaskMetrics#externalAccums ### What changes were proposed in this pull request? This is a followup fix for https://github.com/apache/spark/pull/47197. We found that the perf regression still exists after that fix and located the culprit is the immutable conversion on `TaskMetrics#externalAccums`. This PR fixes it by avoiding the immutable conversion, and then enforce the read lock protection during the accessing on `TaskMetrics#externalAccums` to avoid the race issue (https://github.com/apache/spark/pull/40663). ### Why are the changes needed? Fix perf regression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Covered by existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47578 from Ngone51/SPARK-48791-followup. Lead-authored-by: Yi Wu <yi.wu@databricks.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 05 August 2024, 12:39:29 UTC
79cb096 Revert "[SPARK-48763][TESTS][FOLLOW-UP] Update project location in PlanGenerationTestSuite" This reverts commit f1f686dbf57c101ae6c0d88c927f64d0470f2615. 05 August 2024, 12:28:29 UTC
e5b6b5f [SPARK-48338][SQL] Improve exceptions thrown from parser/interpreter ### What changes were proposed in this pull request? Introduced a new class `SqlScriptingException`, which is thrown during SQL script parsing/interpreting, and contains information about the line number on which the error occured. ### Why are the changes needed? Users should know which line of their script caused an error. ### Does this PR introduce _any_ user-facing change? Users will now see a line number on some of their error messages. The format of the new error messages is: `{LINE:N} errorMessage` where N is the line number of the error, and errorMessage is the existing error message, before this change. ### How was this patch tested? No new tests required, existing tests are updated. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47553 from dusantism-db/dusantism-db/sql-script-exception. Authored-by: Dušan Tišma <dusan.tisma@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2024, 12:18:11 UTC
d431d40 [MINOR][SQL] Remove orphans in ProtoToParsedPlanTestSuite and PlanGenerationTestSuite ### What changes were proposed in this pull request? This PR proposes to remove orphans in ProtoToParsedPlanTestSuite and PlanGenerationTestSuite ### Why are the changes needed? To remove unused files. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Generated by below ```bash SPARK_CLEAN_ORPHANED_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite" SPARK_CLEAN_ORPHANED_GOLDEN_FILES=1 build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47603 from HyukjinKwon/minor-cleanup-orphans. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 August 2024, 10:29:52 UTC
76a1ca5 [SPARK-49060][CONNECT] Clean up Mima rules for SQL-Connect binary compatibility checks ### What changes were proposed in this pull request? This PR modifies some Mima rules which are used for checking the binary compatibility between `sql` and `connect` modules. Major changes include: - Removed unnecessary filters for specific `private[sql]` constructors - there's a wildcard rule which filters out all of them. - Removed outdated filters about APIs that are already consistent. - Add a warning about unused filters. Current output: ```bash $ ./dev/connect-jvm-client-mima-check Do connect-client-jvm module mima check ... Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.queryExecution") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.sqlContext") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.selectUntyped") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.rdd") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.toJavaRDD") did not filter out any problems. Warning: ExcludeByName[Problem]("org.apache.spark.sql.Dataset.javaRDD") did not filter out any problems. finish connect-client-jvm module mima check ... connect-client-jvm module mima check passed. ``` I manually audited all rules defined in the list. One issue I found is that all APIs in `Dataset` are not being checked at all, likely due to having a `private[sql]` companion object in `spark-core`. Changing the object's visibility from `private[sql]` to `public` will resolve this issue. The exact reason is unknown and is to be investigated. ### Why are the changes needed? Need to make sure Mima is really working. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Not needed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47487 from xupefei/mima-refactor. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 August 2024, 10:11:34 UTC
682eb1b [SPARK-49107][SQL] `ROUTINE_ALREADY_EXISTS` supports RoutineType ### What changes were proposed in this pull request? `ROUTINE_ALREADY_EXISTS` supports RoutineType: - existingRoutineType - newRoutineType ### Why are the changes needed? To make `ROUTINE_ALREADY_EXISTS` able to contain the type information ### Does this PR introduce _any_ user-facing change? minor change in error message ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47600 from zhengruifeng/sql_error_routine_exists. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 05 August 2024, 10:04:24 UTC
dbc8f01 [SPARK-49108][EXAMPLE] Add `submit_pi.sh` REST API example ### What changes were proposed in this pull request? This PR aims to provide `submit_pi.sh` example via REST API. ### Why are the changes needed? To provide the REST API feature more clearly via a working example. ### Does this PR introduce _any_ user-facing change? No, this is a new example. ### How was this patch tested? Manual review. 1. Start Spark Cluster and Submit the job via `submit_pi.sh` script. ``` $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true" sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077 $ ./examples/src/main/scripts/submit_pi.sh { "action" : "CreateSubmissionResponse", "message" : "Driver successfully submitted as driver-20240804234519-0000", "serverSparkVersion" : "4.0.0-SNAPSHOT", "submissionId" : "driver-20240804234519-0000", "success" : true }% ``` 2. Visit Spark Master UI and Spark job `Executor` UI. - http://localhost:8080 <img width="712" alt="Screenshot 2024-08-04 at 23 45 50" src="https://github.com/user-attachments/assets/9e1c583b-a954-484c-8728-e4e51d7f08ed"> 3. After completion, check the status via REST API. ``` $ curl http://localhost:6066/v1/submissions/status/driver-20240804234519-0000 { "action" : "SubmissionStatusResponse", "driverState" : "FINISHED", "serverSparkVersion" : "4.0.0-SNAPSHOT", "submissionId" : "driver-20240804234519-0000", "success" : true, "workerHostPort" : "127.0.0.1:61480", "workerId" : "worker-20240804234513-127.0.0.1-61480" } ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47601 from dongjoon-hyun/SPARK-49108. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Kent Yao <yao@apache.org> 05 August 2024, 10:01:39 UTC
f1f686d [SPARK-48763][TESTS][FOLLOW-UP] Update project location in PlanGenerationTestSuite ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/47579 that updates the Spark Connect location in the test `PlanGenerationTestSuite`. ### Why are the changes needed? Just for completeness. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran it via: ```bash SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47605 from HyukjinKwon/SPARK-48763-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 August 2024, 09:59:29 UTC
6bf6088 [SPARK-49063][SQL] Fix Between with ScalarSubqueries ### What changes were proposed in this pull request? Fix for between with ScalarSubqueries. ### Why are the changes needed? There is a regression introduced from a previous PR https://github.com/apache/spark/pull/44299. This needs to be addressed as between operator was completely broken with resolved ScalarSubqueries. ### Does this PR introduce _any_ user-facing change? No, the bug is not release yet. ### How was this patch tested? Tests added to golden file. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47581 from mihailom-db/fixbetween. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2024, 08:15:04 UTC
f99291a [SPARK-48346][SQL] Support for IF ELSE statements in SQL scripts ### What changes were proposed in this pull request? This PR proposes introduction of IF/ELSE statement to SQL scripting language. To evaluate conditions in IF or ELSE IF clauses, introduction of boolean statement evaluator is required as well. Changes summary: - Grammar/parser changes: - `ifElseStatement` grammar rule - `visitIfElseStatement` rule visitor - `IfElseStatement` logical operator - `IfElseStatementExec` execution node: - Internal states - `Condition` and `Body` - Iterator implementation - iterate over conditions until the one that evaluates to `true` is found - Use `StatementBooleanEvaluator` implementation to evaluate conditions - `DataFrameEvaluator`: - Implementation of `StatementBooleanEvaluator` - Evaluates results to `true` if it is single row, single column of boolean type with value `true` - `SqlScriptingInterpreter` - add logic to transform `IfElseStatement` to `IfElseStatementExec` ### Why are the changes needed? We are gradually introducing SQL Scripting to Spark, and IF/ELSE is one of the basic control flow constructs in the SQL language. For more details, check [JIRA item](https://issues.apache.org/jira/browse/SPARK-48346). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests are introduced to all of the three scripting test suites: `SqlScriptingParserSuite`, `SqlScriptingExecutionNodeSuite` and `SqlScriptingInterpreterSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47442 from davidm-db/sql_scripting_if_else. Authored-by: David Milicevic <david.milicevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2024, 07:31:59 UTC
f01eafd [SPARK-49057][SQL] Do not block the AQE loop when submitting query stages ### What changes were proposed in this pull request? We missed the fact that submitting a shuffle or broadcast query stage can be heavy, as it needs to submit subqueries and wait for the results. This blocks the AQE loop and hurts the parallelism of AQE. This PR fixes the problem by using shuffle/broadcast's own thread pool to wait for subqueries and other preparations. This PR also re-implements https://github.com/apache/spark/pull/45234 to avoid submitting the shuffle job if the query is failed and all query stages need to be cancelled. ### Why are the changes needed? better parallelism for AQE ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test case ### Was this patch authored or co-authored using generative AI tooling? no Closes #47533 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2024, 06:57:25 UTC
94f8872 [SPARK-49078][SQL] Support show columns syntax in v2 table ### What changes were proposed in this pull request? Support v2 table with show columns syntax. ### Why are the changes needed? In lakehouse format such as Paimon、Iceberg,table are v2 which not support show column syntax,this pr is aimed to support it ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? DataSourceV2SQLSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #47568 from xuzifu666/support_show_columns_v2. Lead-authored-by: xuyu <11161569@vivo.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 05 August 2024, 03:25:58 UTC
b76f0b9 [SPARK-49047][PYTHON][CONNECT] Truncate the message for logging ### What changes were proposed in this pull request? Truncate the message for logging, by truncating the bytes and string fields ### Why are the changes needed? existing implementation generates too massive logging ### Does this PR introduce _any_ user-facing change? No, logging only ``` In [7]: df = spark.createDataFrame([('a B c'), ('X y Z'), ], ['abc']) In [8]: plan = df._plan.to_proto(spark._client) In [9]: spark._client._proto_to_string(plan, False) Out[9]: 'root { common { plan_id: 4 } to_df { input { common { plan_id: 3 } local_relation { data: "\\377\\377\\377\\377p\\000\\000\\000\\020\\000\\000\\000\\000\\000\\n\\000\\014\\000\\006\\000\\005\\000\\010\\000\\n\\000\\000\\000\\000\\001\\004\\000\\014\\000\\000\\000\\010\\000\\010\\000\\000\\000\\004\\000\\010\\000\\000\\000\\004\\000\\000\\000\\001\\000\\000\\000\\024\\000\\000\\000\\020\\000\\024\\000\\010\\000\\006\\000\\007\\000\\014\\000\\000\\000\\020\\000\\020\\000\\000\\000\\000\\000\\001\\005\\020\\000\\000\\000\\030\\000\\000\\000\\004\\000\\000\\000\\000\\000\\000\\000\\003\\000\\000\\000abc\\000\\004\\000\\004\\000\\004\\000\\000\\000\\000\\000\\000\\000\\377\\377\\377\\377\\230\\000\\000\\000\\024\\000\\000\\000\\000\\000\\000\\000\\014\\000\\026\\000\\006\\000\\005\\000\\010\\000\\014\\000\\014\\000\\000\\000\\000\\003\\004\\000\\030\\000\\000\\000 \\000\\000\\000\\000\\000\\000\\000\\000\\000\\n\\000\\030\\000\\014\\000\\004\\000\\010\\000\\n\\000\\000\\000L\\000\\000\\000\\020\\000\\000\\000\\002\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\003\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\014\\000\\000\\000\\000\\000\\000\\000\\020\\000\\000\\000\\000\\000\\000\\000\\n\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\001\\000\\000\\000\\002\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\005\\000\\000\\000\\n\\000\\000\\000\\000\\000\\000\\000a B cX y Z\\000\\000\\000\\000\\000\\000\\377\\377\\377\\377\\000\\000\\000\\000" schema: "{\\"fields\\":[{\\"metadata\\":{},\\"name\\":\\"abc\\",\\"nullable\\":true,\\"type\\":\\"string\\"}],\\"type\\":\\"struct\\"}" } } column_names: "abc" } }' In [10]: spark._client._proto_to_string(plan, True) Out[10]: 'root { common { plan_id: 4 } to_df { input { common { plan_id: 3 } local_relation { data: "\\377\\377\\377\\377p\\000\\000\\000[truncated]" schema: "{\\"fields\\":[{\\"metadata\\":{},\\"name\\"[truncated]" } } column_names: "abc" } }' ``` ### How was this patch tested? added UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #47554 from zhengruifeng/py_client_truncate. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 05 August 2024, 00:23:04 UTC
4e69e16 [SPARK-49104][CORE][DOCS] Document `JWSFilter` usage in Spark UI and REST API and rename parameter to `secretKey` ### What changes were proposed in this pull request? This PR aims the following. - Document `JWSFilter` and its usage in `Spark UI` and `REST API` - `Spark UI` section of `Configuration` page - `Spark Security` page - `Spark Standalone` page - Rename the parameter `key` to `secretKey` to redact it in Spark Driver UI and Spark Master UI. ### Why are the changes needed? To apply recent new security features - #47575 - #47595 ### Does this PR introduce _any_ user-facing change? No because this is a new feature of Apache Spark 4.0.0. ### How was this patch tested? Pass the CIs and manual review. - `spark-standalone.html` ![Screenshot 2024-08-03 at 22 40 53](https://github.com/user-attachments/assets/f1b95a01-c14b-4f14-96b6-3181afaf6f9f) - `security.html` ![Screenshot 2024-08-03 at 22 39 00](https://github.com/user-attachments/assets/8413f6a3-47df-4d71-87ee-25ab32171c6c) ![Screenshot 2024-08-03 at 22 39 51](https://github.com/user-attachments/assets/01546724-d5b5-40d5-a980-236f9d13ae81) - `configuration.html` ![Screenshot 2024-08-03 at 22 38 07](https://github.com/user-attachments/assets/c0845a7f-6ae1-4194-b98a-68d7442c9785) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47596 from dongjoon-hyun/SPARK-49104. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 August 2024, 23:49:07 UTC
250a4cc [SPARK-48834][SQL][TESTS][FOLLOWUP] Add `assume(shouldTestPandasUDFs)` to PythonUDFSuite complex variant input test ### What changes were proposed in this pull request? This is a follow-up of the following in order to add a missed `assume(shouldTestPandasUDFs)`. - #47253 ### Why are the changes needed? After this PR, all tests of `PythonUDFSuite` will have `assume(shouldTestPythonUDFs)`. ``` $ cd ./sql/core/src/test/scala/org/apache/spark/sql/execution/python/ $ grep -C1 'test(' PythonUDFSuite.scala test("SPARK-28445: PythonUDF as grouping key and aggregate expressions") { assume(shouldTestPythonUDFs) -- test("SPARK-28445: PythonUDF as grouping key and used in aggregate expressions") { assume(shouldTestPythonUDFs) -- test("SPARK-28445: PythonUDF in aggregate expression has grouping key in its arguments") { assume(shouldTestPythonUDFs) -- test("SPARK-28445: PythonUDF over grouping key is argument to aggregate function") { assume(shouldTestPythonUDFs) -- test("SPARK-39962: Global aggregation of Pandas UDF should respect the column order") { assume(shouldTestPandasUDFs) -- test("variant input to pandas grouped agg UDF") { assume(shouldTestPandasUDFs) -- test("complex variant input to pandas grouped agg UDF") { assume(shouldTestPandasUDFs) -- test("variant output to pandas grouped agg UDF") { assume(shouldTestPandasUDFs) -- test("complex variant output to pandas grouped agg UDF") { assume(shouldTestPandasUDFs) -- test("SPARK-34265: Instrument Python UDF execution using SQL Metrics") { assume(shouldTestPythonUDFs) -- test("PythonUDAF pretty name") { assume(shouldTestPandasUDFs) -- test("SPARK-48706: Negative test case for Python UDF in higher order functions") { assume(shouldTestPythonUDFs) -- test("SPARK-48666: Python UDF execution against partitioned column") { assume(shouldTestPythonUDFs) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47599 from dongjoon-hyun/SPARK-48834. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 August 2024, 23:28:32 UTC
44b10ba [MINOR][DOCS] Fix broken links in streaming docs ### What changes were proposed in this pull request? This PR aims to fix broken links in streaming docs. ### Why are the changes needed? Fix broken links. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test with re-gen docs by `SKIP_PYTHONDOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build --watch `. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47589 from wayneguow/dead_link. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 August 2024, 22:22:50 UTC
18db040 [SPARK-49106][DOCS] Documented `Prometheus` endpoints ### What changes were proposed in this pull request? Since SPARK-46886 enables `spark.ui.prometheus.enabled` by default, this PR aims to provide clear documentation on the endpoints exposed by this PR. ![Screenshot 2024-08-04 at 15 03 34](https://github.com/user-attachments/assets/0b7ba631-68c4-43c8-8903-2068a9f7a135) ### Why are the changes needed? Provide a better documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #47219 from jerryzhou196/master. Lead-authored-by: Jerry Zhou <j448zhou@uwaterloo.ca> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Jerry Zhou <jerryzhou196@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 August 2024, 22:10:40 UTC
b6e1a0d [SPARK-48731][INFRA] Upgrade `docker/build-push-action` to v6 ### What changes were proposed in this pull request? The pr aims to upgrade `docker/build-push-action` from `v5` to `v6`. ### Why are the changes needed? https://github.com/docker/build-push-action/releases/tag/v6.5.0 ... https://github.com/docker/build-push-action/releases/tag/v6.0.0 <img width="1097" alt="image" src="https://github.com/apache/spark/assets/15246973/136a0257-6a94-4771-a97c-3f703925b269"> https://docs.docker.com/build/ci/github-actions/build-summary/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Manual observation as following: https://github.com/panbingkun/spark/actions/runs/9689581212 <img width="1003" alt="image" src="https://github.com/apache/spark/assets/15246973/43af8e42-32d3-463c-9bbf-33cf9817bc1f"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47112 from panbingkun/SPARK-48731. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 August 2024, 20:16:42 UTC
dbba92a [SPARK-48949][SQL] SPJ: Runtime partition filtering ### What changes were proposed in this pull request? Introduce runtime partition filtering for SPJ. In planning, we have the list of partition values on both sides to plan the tasks. We can thus filter out partition values based on the join type. Currently LEFT OUTER, RIGHT OUTER, INNER join types are supported as they are more common, we can optimize other join types in subsequent PR. ### Why are the changes needed? In some common join types (INNER, LEFT, RIGHT), we have an opportunity to greatly reduce the data scanned in SPJ. For example, a small table joining a larger table by partition key, can prune out most of the partitions of the large table. There is some similarity with the concept of DPP, but that uses heuristics and this is more exact as SPJ planning requires us anyway to list out both sides partitioning. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests in KeyGroupedPartitioningSuite. Closes #47426 from szehon-ho/spj_partition_filter. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 August 2024, 20:11:59 UTC
b17e508 [SPARK-49105][BUILD][TESTS] Upgrade `ojdbc11` to 23.5.0.24.07 and OracleDatabaseOnDocker docker image tag to `oracle-free:23.5-slim` ### What changes were proposed in this pull request? This PR aims to upgrade `ojdbc11` to 23.5.0.24.07 and OracleDatabaseOnDocker docker image tag to `oracle-free:23.5-slim`. ### Why are the changes needed? Keep `Oracle` related test infrastructure to the latest version. And there are a lot of bug fixes of `ojdbc11` 23.5.0.24.07 : https://download.oracle.com/otn-pub/otn_software/jdbc/23c/Bugs-fixed-in-23ai.txt ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47597 from wayneguow/ojdbc11. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 August 2024, 20:10:27 UTC
bf02b9b [SPARK-49103][CORE] Support `spark.master.rest.filters` ### What changes were proposed in this pull request? This PR aims to support `spark.master.rest.filters` configuration like the existing `spark.ui.filters` configuration. Recently, Apache Spark starts to support `JWSFilter`. We can take advantage of `JWSFilter` to protect Spark Master REST API. - #47575 ### Why are the changes needed? Like `Spark UI`, we had better provide the same capability to Apache Spark Master REST API . For example, we can protect `JWSFilter` to `Spark Master REST API` like the following. **MASTER REST API WITH JWSFilter** ``` $ build/sbt package $ cp jjwt-impl-0.12.6.jar assembly/target/scala-2.13/jars $ cp jjwt-jackson-0.12.6.jar assembly/target/scala-2.13/jars $ SPARK_NO_DAEMONIZE=1 \ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true -Dspark.master.rest.filters=org.apache.spark.ui.JWSFilter -Dspark.org.apache.spark.ui.JWSFilter.param.key=VmlzaXQgaHR0cHM6Ly9zcGFyay5hcGFjaGUub3JnIHRvIGRvd25sb2FkIEFwYWNoZSBTcGFyay4=" \ sbin/start-master.sh ``` **AUTHORIZATION FAILURE** ``` $ curl -v -XPOST http://localhost:6066/v1/submissions/clear * Host localhost:6066 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:6066... * connect to ::1 port 6066 from ::1 port 51705 failed: Connection refused * Trying 127.0.0.1:6066... * Connected to localhost (127.0.0.1) port 6066 > POST /v1/submissions/clear HTTP/1.1 > Host: localhost:6066 > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 403 Forbidden < Date: Sat, 03 Aug 2024 22:18:03 GMT < Cache-Control: must-revalidate,no-cache,no-store < Content-Type: text/html;charset=iso-8859-1 < Content-Length: 590 < Server: Jetty(11.0.21) < <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> <title>Error 403 Authorization header is missing.</title> </head> <body><h2>HTTP ERROR 403 Authorization header is missing.</h2> <table> <tr><th>URI:</th><td>/v1/submissions/clear</td></tr> <tr><th>STATUS:</th><td>403</td></tr> <tr><th>MESSAGE:</th><td>Authorization header is missing.</td></tr> <tr><th>SERVLET:</th><td>org.apache.spark.deploy.rest.StandaloneClearRequestServlet-7f171159</td></tr> </table> <hr/><a href="https://eclipse.org/jetty">Powered by Jetty:// 11.0.21</a><hr/> </body> </html> * Connection #0 to host localhost left intact ``` **SUCCESS** ``` $ curl -v -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.4EKWlOkobpaAPR0J4BE0cPQ-ZD1tRQKLZp1vtE7upPw" -XPOST http://localhost:6066/v1/submissions/clear * Host localhost:6066 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:6066... * connect to ::1 port 6066 from ::1 port 51697 failed: Connection refused * Trying 127.0.0.1:6066... * Connected to localhost (127.0.0.1) port 6066 > POST /v1/submissions/clear HTTP/1.1 > Host: localhost:6066 > User-Agent: curl/8.7.1 > Accept: */* > Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.4EKWlOkobpaAPR0J4BE0cPQ-ZD1tRQKLZp1vtE7upPw > * Request completely sent off < HTTP/1.1 200 OK < Date: Sat, 03 Aug 2024 22:16:51 GMT < Content-Type: application/json;charset=utf-8 < Content-Length: 113 < Server: Jetty(11.0.21) < { "action" : "ClearResponse", "message" : "", "serverSparkVersion" : "4.0.0-SNAPSHOT", "success" : true * Connection #0 to host localhost left intact }% ``` ### Does this PR introduce _any_ user-facing change? No, this is a new feature which is not loaded by default. ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47595 from dongjoon-hyun/SPARK-49103. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 August 2024, 02:30:41 UTC
7612f13 [SPARK-47894][CORE][FOLLOWUP] Add a trailing slash to `MasterPage`'s `Environment` page link ### What changes were proposed in this pull request? This is a follow-up of #46111 to prevent redundant redirection by adding trailing slashes like #46157 . ### Why are the changes needed? - To remove redundant redirection. ``` $ curl -v http://localhost:8080/environment * Host localhost:8080 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:8080... * connect to ::1 port 8080 from ::1 port 50081 failed: Connection refused * Trying 127.0.0.1:8080... * Connected to localhost (127.0.0.1) port 8080 > GET /environment HTTP/1.1 > Host: localhost:8080 > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 302 Found < Date: Sat, 03 Aug 2024 02:39:02 GMT < Location: http://localhost:8080/environment/ < Content-Length: 0 < * Connection #0 to host localhost left intact ``` - Some browser doesn't preserve all HTTP header information of the original PRs when it redirects. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and do the manual test. ``` $ sbin/start-master.sh ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47594 from dongjoon-hyun/SPARK-47894. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 August 2024, 19:21:35 UTC
aaf602a [SPARK-48480][SS][CONNECT] StreamingQueryListener should not be affected by spark.interrupt() ### What changes were proposed in this pull request? This PR implements a small architecture change for the server side listenerBusListener. Before, when the first `addListener` call reaches to the server, there is a thread created, and there is a latch to hold this thread long running. This is to prevent this thread from returning, which would send a `ResultComplete` to the client, and closes the client receiving iterator. In client side listener we need to keep the iterator open all the time (until the last `removeListener` call) to keep receiving events. In this PR, we delegate the sending of the final `ResultComplete` to the listener thread itself. Now the thread doesn't need to be held stuck. This would 1. remove a hanging thread running on the server and 2. Shield the listener from being effected by `spark.interruptAll`. `spark.interruptAll` interrupts all spark connect threads. So before this change, the listener long-running thread is also interrupted, therefore would be affected by it and stop sending back events. Now the long-running thread is closed, so it won't be affected. ### Why are the changes needed? Spark Connect improvement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46929 from WweiL/listener-uninterruptible. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2024, 08:01:38 UTC
9e35d04 [SPARK-48931][SS][FOLLOWUP] Reduce Cloud Store List API cost for state store maintenance task ### What changes were proposed in this pull request? Updating migration doc for #47393 ### Why are the changes needed? Better visibility of the change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #47507 from riyaverm-db/update-migration-doc. Authored-by: Riya Verma <riya.verma@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 03 August 2024, 06:33:21 UTC
94b572a [SPARK-49080][SQL][TEST] Upgrade `mssql-jdbc` to 12.8.0.jre11 and MsSQLServer docker image tag to `2022-CU14-ubuntu-22.04` ### What changes were proposed in this pull request? This PR aims to upgrade `mssql-jdbc` to 12.8.0.jre11 and MySQLServer docker image to `mcr.microsoft.com/mssql/server:2022-CU14-ubuntu-22.04`. ### Why are the changes needed? This is the latest stable version of `mssql-jdbc`, related release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.7.0 https://github.com/microsoft/mssql-jdbc/releases/tag/v12.7.1 https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.0 Some fixed issues: - Fix to ensure metadata returned follows JDBC data type specs https://github.com/microsoft/mssql-jdbc/pull/2326 - Added token cache map to fix use of unintended auth token for subsequent connections https://github.com/microsoft/mssql-jdbc/pull/2341 - Clear prepared statement handle before reconnect https://github.com/microsoft/mssql-jdbc/pull/2364 - Reset socketTimeout to original value after a successful connection open https://github.com/microsoft/mssql-jdbc/pull/2355 - Clear prepared statement cache when resetting statement pool connection https://github.com/microsoft/mssql-jdbc/pull/2361 - Fixed ClassLoader leak of ActivityCorrelator ThreadLocal https://github.com/microsoft/mssql-jdbc/pull/2366 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47569 from wayneguow/ms_12_8. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 August 2024, 01:14:53 UTC
5f2d3b0 [SPARK-48763][FOLLOWUP] Make `dev/lint-scala` error message more accurate ### What changes were proposed in this pull request? The pr is followuping https://github.com/apache/spark/pull/47157, to make `dev/lint-scala` error message more accurate. ### Why are the changes needed? After move from: `connector/connect/server` `connector/connect/common` to: `sql/connect/server` `sql/connect/common` Our error message in `dev/lint-scala` should be updated synchronously. eg: <img width="709" alt="image" src="https://github.com/apache/spark/assets/15246973/d749e371-7621-4063-b512-279d0690d573"> <img width="1406" alt="image" src="https://github.com/user-attachments/assets/ab681963-37f7-4f48-9458-61f591477365"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47585 from panbingkun/SPARK-48763_FOLLOWUP_2. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 August 2024, 01:09:27 UTC
6631abc [SPARK-49094][SQL] Fix ignoreCorruptFiles non-functioning for hive orc impl with mergeSchema off ### What changes were proposed in this pull request? ignoreCorruptFiles now applies to all file data sources except for hive orc implementation with mergeSchema off ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47583 from yaooqinn/SPARK-49094. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 August 2024, 00:56:48 UTC
3da31b0 [SPARK-49090][CORE] Support `JWSFilter` ### What changes were proposed in this pull request? This PR aims to support `JWSFilter` which is a servlet filter that requires `JWS`, a cryptographically signed JSON Web Token, in the header via `spark.ui.filters` configuration. - spark.ui.filters=org.apache.spark.ui.JWSFilter - spark.org.apache.spark.ui.JWSFilter.param.key=YOUR-BASE64URL-ENCODED-KEY To simply put, `JWSFilter` will check the following for all requests. - The HTTP request should have `Authorization: Bearer <jws>` header. - `<jws>` is a string with three fields, `<header>.<payload>.<signature>`. - `<header>` is supposed to be a base64url-encoded string of `{"alg":"HS256","typ":"JWT"}`. - `<payload>` is a base64url-encoded string of fully-user-defined content. - `<signature>` is a signature based on `<header>.<payload>` and a user-provided key parameter. For example, the value of `<header>` will be `eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9` always and the value of `payload` can be `e30` if the payload is empty, `{}`. The `<signature>` part is changed by the shared value of `spark.org.apache.spark.ui.JWSFilter.param.key` between the server and client. ``` jshell> java.util.Base64.getUrlEncoder().encodeToString("{\"alg\":\"HS256\",\"typ\":\"JWT\"}".getBytes()) $2 ==> "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9" jshell> java.util.Base64.getUrlEncoder().encodeToString("{}".getBytes()) $3 ==> "e30=" ``` ### Why are the changes needed? To provide a little better security on WebUI consistently including Spark Standalone Clusters. For example, **SETTING** ``` $ jshell | Welcome to JShell -- Version 17.0.12 | For an introduction type: /help intro jshell> java.util.Base64.getUrlEncoder().encodeToString("Visit https://spark.apache.org to download Apache Spark.".getBytes()) $1 ==> "VmlzaXQgaHR0cHM6Ly9zcGFyay5hcGFjaGUub3JnIHRvIGRvd25sb2FkIEFwYWNoZSBTcGFyay4=" ``` ``` $ cat conf/spark-defaults.conf spark.ui.filters org.apache.spark.ui.JWSFilter spark.org.apache.spark.ui.JWSFilter.param.key VmlzaXQgaHR0cHM6Ly9zcGFyay5hcGFjaGUub3JnIHRvIGRvd25sb2FkIEFwYWNoZSBTcGFyay4= ``` **SPARK-SHELL** ``` $ build/sbt package $ cp jjwt-impl-0.12.6.jar assembly/target/scala-2.13/jars $ cp jjwt-jackson-0.12.6.jar assembly/target/scala-2.13/jars $ bin/spark-shell ``` Without JWS (ErrorCode: 403 Forbidden) ``` $ curl -v http://localhost:4040/ * Host localhost:4040 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:4040... * connect to ::1 port 4040 from ::1 port 61313 failed: Connection refused * Trying 127.0.0.1:4040... * Connected to localhost (127.0.0.1) port 4040 > GET / HTTP/1.1 > Host: localhost:4040 > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 403 Forbidden < Date: Fri, 02 Aug 2024 01:27:23 GMT < Cache-Control: must-revalidate,no-cache,no-store < Content-Type: text/html;charset=iso-8859-1 < Content-Length: 472 < <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> <title>Error 403 Authorization header is missing.</title> </head> <body><h2>HTTP ERROR 403 Authorization header is missing.</h2> <table> <tr><th>URI:</th><td>/</td></tr> <tr><th>STATUS:</th><td>403</td></tr> <tr><th>MESSAGE:</th><td>Authorization header is missing.</td></tr> <tr><th>SERVLET:</th><td>org.apache.spark.ui.JettyUtils$$anon$2-3b39bee2</td></tr> </table> </body> </html> * Connection #0 to host localhost left intact ``` With JWS, ``` $ curl -v -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.4EKWlOkobpaAPR0J4BE0cPQ-ZD1tRQKLZp1vtE7upPw" http://localhost:4040/ * Host localhost:4040 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:4040... * connect to ::1 port 4040 from ::1 port 61311 failed: Connection refused * Trying 127.0.0.1:4040... * Connected to localhost (127.0.0.1) port 4040 > GET / HTTP/1.1 > Host: localhost:4040 > User-Agent: curl/8.7.1 > Accept: */* > Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.4EKWlOkobpaAPR0J4BE0cPQ-ZD1tRQKLZp1vtE7upPw > * Request completely sent off < HTTP/1.1 302 Found < Date: Fri, 02 Aug 2024 01:27:01 GMT < Cache-Control: no-cache, no-store, must-revalidate < X-Frame-Options: SAMEORIGIN < X-XSS-Protection: 1; mode=block < X-Content-Type-Options: nosniff < Location: http://localhost:4040/jobs/ < Content-Length: 0 < * Connection #0 to host localhost left intact ``` **SPARK MASTER** Without JWS (ErrorCode: 403 Forbidden) ``` $ curl -v http://localhost:8080/json/ * Host localhost:8080 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:8080... * connect to ::1 port 8080 from ::1 port 61331 failed: Connection refused * Trying 127.0.0.1:8080... * Connected to localhost (127.0.0.1) port 8080 > GET /json/ HTTP/1.1 > Host: localhost:8080 > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 403 Forbidden < Date: Fri, 02 Aug 2024 01:34:03 GMT < Cache-Control: must-revalidate,no-cache,no-store < Content-Type: text/html;charset=iso-8859-1 < Content-Length: 477 < <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> <title>Error 403 Authorization header is missing.</title> </head> <body><h2>HTTP ERROR 403 Authorization header is missing.</h2> <table> <tr><th>URI:</th><td>/json/</td></tr> <tr><th>STATUS:</th><td>403</td></tr> <tr><th>MESSAGE:</th><td>Authorization header is missing.</td></tr> <tr><th>SERVLET:</th><td>org.apache.spark.ui.JettyUtils$$anon$1-6c52101f</td></tr> </table> </body> </html> * Connection #0 to host localhost left intact ``` With JWS ``` $ curl -v -H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.4EKWlOkobpaAPR0J4BE0cPQ-ZD1tRQKLZp1vtE7upPw" http://localhost:8080/json/ * Host localhost:8080 was resolved. * IPv6: ::1 * IPv4: 127.0.0.1 * Trying [::1]:8080... * connect to ::1 port 8080 from ::1 port 61329 failed: Connection refused * Trying 127.0.0.1:8080... * Connected to localhost (127.0.0.1) port 8080 > GET /json/ HTTP/1.1 > Host: localhost:8080 > User-Agent: curl/8.7.1 > Accept: */* > Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.4EKWlOkobpaAPR0J4BE0cPQ-ZD1tRQKLZp1vtE7upPw > * Request completely sent off < HTTP/1.1 200 OK < Date: Fri, 02 Aug 2024 01:33:10 GMT < Cache-Control: no-cache, no-store, must-revalidate < X-Frame-Options: SAMEORIGIN < X-XSS-Protection: 1; mode=block < X-Content-Type-Options: nosniff < Content-Type: text/json;charset=utf-8 < Vary: Accept-Encoding < Content-Length: 320 < { "url" : "spark://M3-Max.local:7077", "workers" : [ ], "aliveworkers" : 0, "cores" : 0, "coresused" : 0, "memory" : 0, "memoryused" : 0, "resources" : [ ], "resourcesused" : [ ], "activeapps" : [ ], "completedapps" : [ ], "activedrivers" : [ ], "completeddrivers" : [ ], "status" : "ALIVE" * Connection #0 to host localhost left intact }% ``` ### Does this PR introduce _any_ user-facing change? No, this is a new filter. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47575 from dongjoon-hyun/SPARK-49090. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 August 2024, 14:10:31 UTC
080e7eb [SPARK-49000][SQL][FOLLOWUP] Improve code style and update comments ### What changes were proposed in this pull request? Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on DISTINCT literals. Physical plan for `select count(distinct 1) from t`: ``` -- count(distinct 1) == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[count(distinct 1)], output=[count(DISTINCT 1)#2L]) +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], output=[count#6L]) +- HashAggregate(keys=[], functions=[], output=[]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20] +- HashAggregate(keys=[], functions=[], output=[]) +- FileScan parquet spark_catalog.default.t[] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> ``` Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` node yields one row to `partial_count` node, which then captures one row. This four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`. To fix the problem, we're adding `Expand` node which will force non-empty grouping expressions in `HashAggregateExec` nodes. This will in turn enable streaming zero rows to parent `partial_count` node, yielding correct final result. ### Why are the changes needed? Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table `t`: `select count(distinct 1) from t` returns 1, while the correct result should be 0. For reference: `select count(1) from t` returns 0, which is the correct and expected result. ### Does this PR introduce _any_ user-facing change? Yes, this fixes a critical bug in Spark. ### How was this patch tested? New e2e SQL tests for aggregates with DISTINCT literals. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47565 from uros-db/SPARK-49000-followup. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Kent Yao <yao@apache.org> 02 August 2024, 08:07:19 UTC
c248b06 [SPARK-48763][CONNECT][BUILD][FOLLOW-UP] Move Spark Connect common/server into sql directory ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/47157 that moves `connect` into `sql/connect`. ### Why are the changes needed? The reasons are as follow: - There was a bit of question about moving `connect` as a standalone top level (https://github.com/apache/spark/pull/47157#issuecomment-2202337766). - Technically all Spark Connect related code have to placed under `sql` just like Hive thrift server. - Spark Connect server is 99% SQL dedicated code for now - Spark Connect server already is using a lot of `spark.sql` configurations, e.g., `spark.sql.connect.serverStacktrace.enabled` - Spark Connect common is only for SQL module. If other components have to be implemented, that common has to be placed within that directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47579 from HyukjinKwon/SPARK-48763-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 August 2024, 07:51:04 UTC
6e66be7 [MINOR][DOCS] Fix typos in docs/sql-ref-number-pattern.md ### What changes were proposed in this pull request? Fix typos in docs/sql-ref-number-pattern.md ### Why are the changes needed? Fix typos. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the CIs. Closes #47557 from tomscut/doc-typos. Authored-by: tom03.li <tom03.li@vipshop.com> Signed-off-by: Kent Yao <yao@apache.org> 02 August 2024, 07:25:47 UTC
aca0d24 [SPARK-49093][SQL] GROUP BY with MapType nested inside complex type ### What changes were proposed in this pull request? Currently we are supporting GROUP BY <column>, where column is of MapType, but we don't support scenarios where column type contains MapType nested in some complex type (i.e. ARRAY<MAP<INT,INT>>), this PR addresses this issue. ### Why are the changes needed? We are extending support for MapType columns in GROUP BY clause. ### Does this PR introduce _any_ user-facing change? Customer would be able to use MapType nested in complex type as part of GROUP BY clause. ### How was this patch tested? Added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47331 from nebojsa-db/SC-170296. Authored-by: Nebojsa Savic <nebojsa.savic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 August 2024, 06:28:20 UTC
5a5390c [SPARK-49071][SQL] Remove ArraySortLike trait ### What changes were proposed in this pull request? This pr cleanup the legacy code of `SortArray` to remove `ArraySortLike` and inline `nullOrder`. The `ArraySort` has been rewritten since https://github.com/apache/spark/pull/25728, so the `SortArray` is the only implementation of `ArraySortLike`. ### Why are the changes needed? cleanup the code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #47547 from ulysses-you/cleanup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> 02 August 2024, 05:28:38 UTC
cbe3633 [SPARK-49072][DOCS] Fix abnormal display of text content which contains two $ in one line but non-formula in docs ### What changes were proposed in this pull request? There are some display exceptions in some documents currently, for examples: - https://spark.apache.org/docs/3.5.1/running-on-kubernetes.html#secret-management ![image](https://github.com/user-attachments/assets/5a4fa4e0-b773-4007-96d0-c036bc7e0c13) - https://spark.apache.org/docs/latest/sql-migration-guide.html ![image](https://github.com/user-attachments/assets/e5f7ea17-9573-4917-b9cd-e36fd83d35fb) The reason is that the `MathJax` javascript package will display the content between two $ as a formula. This PR aims to fix abnormal display of text content which contains two $ in one line but not non-formula in docs. ### Why are the changes needed? Fix doc display exceptions. ### Does this PR introduce _any_ user-facing change? Yes, Improve user experience about docs. ### How was this patch tested? Local manual tests with command `SKIP_API=1 bundle exec jekyll build --watch`. The new results after this PR: ![image](https://github.com/user-attachments/assets/3d0b62fc-44da-45cd-a295-5d098ce3b8ec) ![image](https://github.com/user-attachments/assets/f8884a26-7029-4926-a290-a24b7ae75fa4) ![image](https://github.com/user-attachments/assets/6d7a1289-b459-4ad3-8636-e4713ead3921) ![image](https://github.com/user-attachments/assets/216dbd8d-4bd5-43b6-abc5-082aea543888) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47548 from wayneguow/latex_error. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 August 2024, 01:13:07 UTC
a9daed1 [MINOR][DOCS] Fix typos in `docs/sql-data-sources-xml.md` ### What changes were proposed in this pull request? This PR aims to fix typos in `docs/sql-data-sources-xml.md`. It seems that the relevant content was copied from the document of the `json` data source, but I forgot to modify it to `xml`. ### Why are the changes needed? Fix typos. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47542 from wayneguow/xml_typos. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 August 2024, 00:48:58 UTC
f15570d Revert "SPARK-49004" This reverts commit c8db813644f947c84ec64ca8440897e7b5756b8e. 01 August 2024, 20:47:09 UTC
a0f8de5 Revert "Touch-ups after cherry-pick" This reverts commit 275d3c27aa0d10d1d017a792de6b73e10152c6b7. 01 August 2024, 20:46:57 UTC
275d3c2 Touch-ups after cherry-pick 01 August 2024, 20:46:11 UTC
c8db813 SPARK-49004 01 August 2024, 20:33:36 UTC
6706c41 [SPARK-49070][SS][SQL] TransformWithStateExec.initialState is rewritten incorrectly to produce invalid query plan ### What changes were proposed in this pull request? This patch fixes `TransformWithStateExec` so when its `hasInitialState` is false, the `initialState` won't be rewritten by planner incorrectly to produce invalid query plan which will cause unexpected errors for extension rules that rely on the correctness of query plan. ### Why are the changes needed? [SPARK-47363](https://issues.apache.org/jira/browse/SPARK-47363) added the support for users to provide initial state for streaming query. Such query operators like `TransformWithStateExec` might have `hasInitialState` as false which means the initial state related parameters are not used. But when query planner applies rules on the query, it will still apply on the initial state query plan. When `hasInitialState` is false, some related parameters like `initialStateGroupingAttrs` are invalid and some rules will use these invalid parameters to transform the initial state query plan. For example, `EnsureRequirements` may apply invalid Sort and Exchange on the initial query plan. We encountered these invalid query plan in our extension rules. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47546 from viirya/fix_initial_state. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: huaxingao <huaxin.gao11@gmail.com> 01 August 2024, 19:24:43 UTC
08b1fb5 [SPARK-49077][SQL][TESTS] Remove `bouncycastle-related` test dependencies from `hive-thriftserver` module ### What changes were proposed in this pull request? After SPARK-49066 merged, other than `OrcEncryptionSuite`, the test cases for writing Orc data no longer require the use of `FakeKeyProvider`. As a result, `hive-thriftserver` no longer needs these test dependencies. ### Why are the changes needed? Clean up the test dependencies that are no longer needed by `hive-thriftserver`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual Test with this pr. ``` build/mvn -Phive -Phive-thriftserver clean install -DskipTests build/mvn -Phive -Phive-thriftserver clean install -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -pl sql/hive-thriftserver ``` ``` Run completed in 5 minutes, 14 seconds. Total number of tests run: 243 Suites: completed 2, aborted 0 Tests: succeeded 243, failed 0, canceled 0, ignored 20, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47563 from LuciferYang/SPARK-49077. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 01 August 2024, 15:34:17 UTC
abf9bac [SPARK-49076][SQL] Fix the outdated `logical plan name` in `AstBuilder's` comments ### What changes were proposed in this pull request? The pr aims to fix the outdated `logical plan name` in `AstBuilder's` comments. ### Why are the changes needed? - After the pr https://github.com/apache/spark/pull/33609, the name of the logical plan below has been changed: `AlterTableAddColumns` -> `AddColumns` `AlterTableRenameColumn` -> `RenameColumn` `AlterTableAlterColumn` -> `AlterColumn` `AlterTableDropColumns` -> `DropColumns` - After the pr https://github.com/apache/spark/pull/30398 The name of the logical plan `ShowPartitionsStatement` has been changed to `ShowPartitions`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only update comments. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47562 from panbingkun/fix_astbuilder. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 01 August 2024, 15:32:44 UTC
b0f92a9 [SPARK-49002][SQL] Consistently handle invalid locations in WAREHOUSE/SCHEMA/TABLE/PARTITION/DIRECTORY ### What changes were proposed in this pull request? We are now consistently handling invalid location/path values for all database objects in this pull request. Before this PR, we only checked for `null` and `""` for a small group of operations, such as `SetNamespaceLocation` and `CreateNamespace`. However, various other commands or queries involved with location did not undergo verification. Besides, we also didn't apply suitable error classes for other syntax errors like `null` and `""`. In this PR, we add a try-catch block to rethrow INVALID_LOCATION errors for `null`, `""` and all other invalid inputs. And all operations for databases, tables, partitions, raw paths are validated. ### Why are the changes needed? For better and consistent path errors ### Does this PR introduce _any_ user-facing change? Yes, IllegalArgumentException thrown by path parsing is replaced with INVALID_LOCATION error ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47485 from yaooqinn/SPARK-49002. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 01 August 2024, 15:29:35 UTC
63c0863 [SPARK-49065][SQL] Rebasing in legacy formatters/parsers must support non JVM default time zones ### What changes were proposed in this pull request? Explicitly pass the overridden timezone parameter to `rebaseJulianToGregorianMicros` and `rebaseGregorianToJulianMicros`. ### Why are the changes needed? Currently, rebasing timestamp defaults to JVM timezone and so it produces incorrect results when the explicitly over-riden timezone in the TimestampFormatter library is not the same as JVM timezone. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT to capture this scenario ### Was this patch authored or co-authored using generative AI tooling? No Closes #47541 from sumeet-db/rebase_time_zone. Authored-by: Sumeet Varma <sumeet.varma@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 August 2024, 14:29:08 UTC
c3bb64e [MINOR][PS][DOCS] Add `DataFrame.plot.kde` to API reference ### What changes were proposed in this pull request? 1, add `DataFrame.plot.kde` to API reference; 2, sort the plotting function alphabetically; ### Why are the changes needed? `DataFrame.plot.kde` is a public function in Pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.kde.html And it was already implemented in PS, but never exposed to users. ### Does this PR introduce _any_ user-facing change? Yes, doc-only changes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47564 from zhengruifeng/add_kde_to_doc. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 August 2024, 08:01:08 UTC
bf3ad7e [SPARK-49074][SQL] Fix variant with `df.cache()` ### What changes were proposed in this pull request? Currently, the `actualSize` method of the `VARIANT` `columnType` isn't overridden, so we use the default size of 2kb for the `actualSize`. We should define `actualSize` so the cached variant column can correctly be written to the byte buffer. Currently, if the avg per-variant size is greater than 2KB and the total column size is greater than 128KB (the default initial buffer size), an exception will be (incorrectly) thrown. ### Why are the changes needed? to fix caching larger variants (in df.cache()), such as the ones included in the UTs. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added UT ### Was this patch authored or co-authored using generative AI tooling? no Closes #47559 from richardc-db/fix_variant_cache. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 August 2024, 00:48:19 UTC
06ed91a [SPARK-49066][SQL][TESTS] Refactor `OrcEncryptionSuite` and make `spark.hadoop.hadoop.security.key.provider.path` effective only within `OrcEncryptionSuite` ### What changes were proposed in this pull request? This pr moves the global scope test configuration `spark.hadoop.hadoop.security.key.provider.path`, which is configured in the parent `pom.xml` and `SparkBuild.scala`, to `OrcEncryptionSuite` to ensure that it is effective only within `OrcEncryptionSuite`. To achieve this, the pr also refactors `OrcEncryptionSuite`: 1. Overrides `beforeAll` to back up the contents of `CryptoUtils#keyProviderCache`. 2. Overrides `afterAll` to restore the contents of `CryptoUtils#keyProviderCache`. This ensures that `CryptoUtils#keyProviderCache` is isolated during the test process of `OrcEncryptionSuite`. ### Why are the changes needed? The test configuration `spark.hadoop.hadoop.security.key.provider.path` in the parent `pom.xml` and `SparkBuild.scala` is effective globally, which leads to the possibility that other Orc writing test cases, besides `OrcEncryptionSuite`, might also be affected by this configuration and use `test.org.apache.spark.sql.execution.datasources.orc.FakeKeyProvider.Factory`。 ### Does this PR introduce _any_ user-facing change? No, just for test. ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #47543 from LuciferYang/SPARK-49066. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 July 2024, 19:02:33 UTC
8b826ab [SPARK-44638][SQL][TESTS] Add test for Char/Varchar in JDBC custom options ### What changes were proposed in this pull request? Char/Varchar in JDBC `customSchema` option once broke in Spark 3.1 ~ 3.4, but seems to be restored in master by some recent works in the JDBC area, this PR add a test to cover. See report in SPARK-44638 ### Why are the changes needed? test cov ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test added ### Was this patch authored or co-authored using generative AI tooling? no Closes #47551 from yaooqinn/SPARK-44638. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 July 2024, 15:57:21 UTC
dfa2133 [SPARK-49000][SQL] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates ### What changes were proposed in this pull request? Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on DISTINCT literals. Physical plan for `select count(distinct 1) from t`: ``` -- count(distinct 1) == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[count(distinct 1)], output=[count(DISTINCT 1)#2L]) +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], output=[count#6L]) +- HashAggregate(keys=[], functions=[], output=[]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20] +- HashAggregate(keys=[], functions=[], output=[]) +- FileScan parquet spark_catalog.default.t[] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> ``` Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` node yields one row to `partial_count` node, which then captures one row. This four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`. To fix the problem, we're adding `Expand` node which will force non-empty grouping expressions in `HashAggregateExec` nodes. This will in turn enable streaming zero rows to parent `partial_count` node, yielding correct final result. ### Why are the changes needed? Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table `t`: `select count(distinct 1) from t` returns 1, while the correct result should be 0. For reference: `select count(1) from t` returns 0, which is the correct and expected result. ### Does this PR introduce _any_ user-facing change? Yes, this fixes a critical bug in Spark. ### How was this patch tested? New e2e SQL tests for aggregates with DISTINCT literals. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47525 from nikolamand-db/SPARK-49000-spark-expand-approach. Lead-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Co-authored-by: Nikola Mandic <nikola.mandic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 July 2024, 14:37:42 UTC
0e07873 [SPARK-48977][SQL] Optimize string searching under UTF8_LCASE collation ### What changes were proposed in this pull request? Modify string search under UTF8_LCASE collation by utilizing UTF8String character iterator to reduce one order of algorithmic complexity. ### Why are the changes needed? Optimize implementation for `contains`, `startsWith`, `endsWith`, `locate` expressions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47444 from uros-db/optimize-search. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 July 2024, 09:16:17 UTC
5954ed1 [SPARK-48964][SQL][DOCS] Fix the discrepancy between implementation, comment and documentation of option `recursive.fields.max.depth` in ProtoBuf connector ### What changes were proposed in this pull request? This PR aims to fix the discrepancy between implementation, comment and documentation of option `recursive.fields.max.depth` in ProtoBuf connector ### Why are the changes needed? Unify code implementation and documentation description. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47458 from wayneguow/SPARK-48964. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 31 July 2024, 09:04:26 UTC
43c05a7 [SPARK-49064][BUILD] Upgrade Kafka to 3.8.0 ### What changes were proposed in this pull request? The pr aims to upgrade `kafka` from `3.7.1` to `3.8.0`. ### Why are the changes needed? https://downloads.apache.org/kafka/3.8.0/RELEASE_NOTES.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47540 from panbingkun/SPARK-49064. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 July 2024, 08:14:30 UTC
34a52ad [SPARK-49067][SQL] Move utf-8 literal into internal methods of UrlCodec class ### What changes were proposed in this pull request? Move utf-8 literals in url encode/decode functions to internal methods of UrlCodec class ### Why are the changes needed? Remove unnecessary constant expressions ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47544 from wForget/url_decode. Authored-by: wforget <643348094@qq.com> Signed-off-by: Kent Yao <yao@apache.org> 31 July 2024, 05:36:28 UTC
bb5ed95 [SPARK-49056][SQL] ErrorClassesJsonReader cannot handle null properly ### What changes were proposed in this pull request? This PR proposes to make `ErrorClassesJsonReader` handle null properly ### Why are the changes needed? When `ErrorClassesJsonReader` takes null for `getErrorMessage` method, it cannot handle null properly so raises `INTERNAL_ERROR`. For example, given error class example below: ```json { "MISSING_PARAMETER" : { "message" : [ "Parameter ${param} is missing." ] } } ``` and run: ```scala getErrorMessage("MISSING_PARAMETER", Map("param" -> null)) ``` **Before** ``` [INTERNAL_ERROR] Undefined error message parameter for error class: 'MISSING_PARAMETER', MessageTemplate: Parameter ${param} is missing., Parameters: Map(param -> null) SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Undefined error message parameter for error class: 'MISSING_PARAMETER', MessageTemplate: Parameter ${param} is missing., Parameters: Map(param -> null) SQLSTATE: XX000 ``` **After** ``` [MISSING_PARAMETER] Parameter null is missing. ``` ### Does this PR introduce _any_ user-facing change? No API changes, but this PR improves the user-facing error message as described above. ### How was this patch tested? Added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47531 from itholic/SPARK-49056. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com> 31 July 2024, 05:32:09 UTC
98c365f [SPARK-48725][SQL] Integrate CollationAwareUTF8String.lowerCaseCodePoints into string expressions ### What changes were proposed in this pull request? Use `CollationAwareUTF8String.lowerCaseCodePoints` logic to properly lowercase strings according to UTF8_LCASE collation, instead of relying on `UTF8String.toLowerCase()` method calls. ### Why are the changes needed? Avoid correctness issues with respect to code-point logic in UTF8_LCASE (arising when Java to performs string lowercasing) and ensure consistent results. ### Does this PR introduce _any_ user-facing change? Yes, collation aware string function implementations will now rely on `CollationAwareUTF8String` string lowercasing for UTF8_LCASE collation, instead of `UTF8String` logic (which resorts to Java's implementation). ### How was this patch tested? Existing tests, with some new cases in `CollationSupportSuite`. 6 expressions are affected by this change: `contains`, `instr`, `find_in_set`, `replace`, `locate`, `substr_index`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47132 from uros-db/lcase-cp. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 July 2024, 05:13:49 UTC
5929412 [SPARK-49003][SQL] Fix interpreted code path hashing to be collation aware ### What changes were proposed in this pull request? Changed hash function to be collation aware. This change is just for interpreted code path. Codegen hashing was already collation aware. ### Why are the changes needed? We were getting the wrong hash for collated strings because our hash function only used the binary representation of the string. It didn't take collation into account. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests to `HashExpressionsSuite.scala` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47502 from ilicmarkodb/ilicmarkodb/fix_string_hash. Lead-authored-by: Marko <marko.ilic@databricks.com> Co-authored-by: Marko Ilic <marko.ilic@databricks.com> Co-authored-by: Marko Ilić <marko.ilic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 July 2024, 04:26:35 UTC
40d94b6 [SPARK-49031] Implement validation for the TransformWithStateExec operator using OperatorStateMetadataV2 ### What changes were proposed in this pull request? Implementing validation for the TransformWithStateExec operator, so that it can't restart with a different TimeMode and OutputMode, or invalid State Variable transformations. ### Why are the changes needed? If there is an invalid change to the query after a restart, we want the query to fail. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47508 from ericm-db/validation. Authored-by: Eric Marnadi <eric.marnadi@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 30 July 2024, 23:04:07 UTC
acb2fec [SPARK-49059][CONNECT] Move `SessionHolder.forTesting(...)` to the test package ### What changes were proposed in this pull request? Moves `SessionHolder.forTetsing(...)` into the package; the new and equivalent method lies in `SparkConnectTestUtils.createDummySessionHolder(...)`. ### Why are the changes needed? The `SessionHolder.forTesting(...)` method is widely used in several Spark Connect testing suites. However, this test-specific code is located in the source code, and incorrect use may lead to the creation of a partial/buggy Spark Connect session during production runtime. ### Does this PR introduce _any_ user-facing change? Yes. Code running in the Server JVM will no longer be able to create a dummy Session Holder during live execution by calling `SessionHolder.forTesting(...)`. Note that usage of this during live execution is unwanted behaviours as it would lead to creation of a partial `SessionHolder` instance. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47536 from vicennial/SPARK-49059. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 July 2024, 16:27:27 UTC
33e463e [SPARK-48762][SQL] Introduce clusterBy DataFrameWriter API for Python ### What changes were proposed in this pull request? Introduce clusterBy DataFrameWriter API for Python. Also fix the issue that `listColumns` doesn't support `V1Table`. ### Why are the changes needed? Introduce more ways for users to interact with clustered tables. ### Does this PR introduce _any_ user-facing change? Yes, it introduces a new PySpark DataFrame API to specify clustering columns during write operations. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47452 from zedtang/clusterby-python-api. Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 July 2024, 15:51:46 UTC
e573644 [SPARK-49054][SQL] Column default value should support current_* functions ### What changes were proposed in this pull request? This is a regression between Spark 3.5.0 and Spark 4. The following queries work on Spark 3.5.0 while fails on latest master branch: ``` CREATE TABLE test_current_user(i int, s string) USING parquet; ALTER TABLE test_current_user ALTER COLUMN s SET DEFAULT current_user() ``` ``` CREATE TABLE test_current_user(i int, s string default current_user()) USING parquet INSERT INTO test_current_user (i) VALUES ((0)); ``` This PR is to complete fixing this by eagerly executing finish-analysis and constant-folding rules before checking whether the expression is foldable and resolved. ### Why are the changes needed? Bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #47529 from gengliangwang/finishAnalysis. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 30 July 2024, 15:35:16 UTC
1ec211e [SPARK-49053][PYTHON][ML] Make model save/load helper functions accept spark session ### What changes were proposed in this pull request? Make model save/load helper functions accept spark session ### Why are the changes needed? 1, avoid unnecessary spark session creations; 2, to be consistent with scala side changes: https://github.com/apache/spark/pull/47467 and https://github.com/apache/spark/pull/47477 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47527 from zhengruifeng/py_ml_save_metadata_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 July 2024, 13:06:34 UTC
74e200e [MINOR][DOCS] Fix typos in `docs/sql-migration-guide.md` ### What changes were proposed in this pull request? This PR aims to fix typos in `docs/sql-migration-guide.md`. ### Why are the changes needed? Fix typos. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47530 from wayneguow/sql_typos. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 July 2024, 13:05:30 UTC
0894310 [SPARK-48997][SS] Implement individual unloads for maintenance thread pool thread failures ### What changes were proposed in this pull request? Currently, an exception in the maintenance thread pool today can cause the entire executor to exit. This PR changes the maintenance thread pool exception logic to _only_ unload the providers that throw exceptions—the entire thread pool is not stopped. Historically, the way that we bubbled exceptions to the maintenance thread task (which manages the pool) was in the following way: - A thread in the maintenance thread pool sees an exception and it sets `threadPoolException` to be non-null - The next time that `doMaintenance()` gets called by the maintenance task (which is scheduled periodically), it checks to see whether that exception is non-null - If it is non-null, it throws that exception - It then _catches_ that exception, and stops the thread pool But now that we don't need to stop the entire maintenance thread pool and unload _all_ of the providers, when an exception is encountered in a maintenance thread pool thread, we can just have that thread unload itself. Then, we can remove the `onError` callback because it will no longer be needed. ### Why are the changes needed? Please see the JIRA for a full description of the error conditions. We don't want executors to exit because of maintenance thread pool thread failures. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - All existing UTs must pass. One test was removed, though (see below). - Added a new UT where a fake state store provider is used; this state store provider throws exceptions for partitions 0 and 1. The UT asserts that these two providers become unloaded, but any other ones are _not_. There was an existing test, `SPARK-44438: maintenance task should be shutdown on error`, which actually mentions that the `SparkUncaughtExceptionHandler` should not be invoked even if there is an exception. However, this test does _not_ load more than 1 provider. Thus, the only loaded provider is the one that experiences the exception. We know that the root-cause of this issue is that if there exists _another_ provider that is waiting on a lock (i.e. on an RPC in `verifyStateStoreInstanceActive`), then that provider will receive an `InterruptedException`, which will lead to the `SparkUncaughtExceptionHandler` firing. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47475 from neilramaswamy/spark-48997. Authored-by: Neil Ramaswamy <neil.ramaswamy@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 30 July 2024, 08:22:33 UTC
6ff93ea [SPARK-48829][BUILD] Upgrade `RoaringBitmap` to 1.2.1 ### What changes were proposed in this pull request? The pr aims to upgrade `RoaringBitmap` from `1.1.0` to `1.2.1`. ### Why are the changes needed? - The full release notes: https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/1.2.0 - The latest version has brought bug fixes and some improvements: improve: Optimize RoaringBitSet.get(int fromIndex, int toIndex) https://github.com/RoaringBitmap/RoaringBitmap/pull/727 fix: add bitmapOfRange (non-static) in https://github.com/RoaringBitmap/RoaringBitmap/pull/728 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47247 from panbingkun/SPARK-48829. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 30 July 2024, 05:28:53 UTC
76bbcc5 [SPARK-49032][SS] Add schema path in metadata table entry, verify expected version and add operator metadata related test for operator metadata format v2 ### What changes were proposed in this pull request? Add schema path in metadata table entry, verify expected version and add operator metadata related test for operator metadata format v2 ### Why are the changes needed? Changes needed for version verification and for subsequent integration with state data source reader ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.OperatorStateMetadataSuite, threads: Idle Worker Monitor for python3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true) ===== [info] Run completed in 26 seconds, 651 milliseconds. [info] Total number of tests run: 11 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47510 from anishshri-db/task/SPARK-49032. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 30 July 2024, 05:13:25 UTC
f9d6315 [SPARK-48989][SQL] Fix error result of throwing an exception of `SUBSTRING_INDEX` built-in function with Codegen ### What changes were proposed in this pull request? This PR aims to fix error result of throwing an exception of `SUBSTRING_INDEX` built-in function with Codegen after related work about supporting it to work with collated strings. ### Why are the changes needed? Fix a bug. Currently, this function cannot be used because it throws an exception. ![image](https://github.com/user-attachments/assets/32aaddb8-b261-47a2-8eeb-d66ed79a7db2) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA and add related test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47481 from wayneguow/SPARK-48989. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 July 2024, 04:39:27 UTC
332bb0e [SPARK-49049][DOCS] Document MasterPage custom title conf and REST API server-side env variable replacements ### What changes were proposed in this pull request? This PR aims to document the following three recent improvements. - #47491 - #47509 - #47511 ### Why are the changes needed? To provide an updated documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and check the HTML manually. <img width="926" alt="Screenshot 2024-07-29 at 14 10 40" src="https://github.com/user-attachments/assets/6c904ec0-0ece-432a-8e41-aeb88f7baab8"> <img width="932" alt="Screenshot 2024-07-29 at 13 52 20" src="https://github.com/user-attachments/assets/ca3afe9a-dcfe-4258-b455-9ff4781cb4e5"> <img width="940" alt="Screenshot 2024-07-29 at 13 52 29" src="https://github.com/user-attachments/assets/ad9635d4-c66f-4320-8b93-005443d4df2e"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47523 from dongjoon-hyun/SPARK-49049. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 July 2024, 03:53:29 UTC
80c4432 [SPARK-48865][SQL][FOLLOWUP] Add failOnError argument to replace TryEval in TryUrlDecode ### What changes were proposed in this pull request? Add `failOnError` argument for `UrlDecode` to replace `TryEval` in `TryUrlDecode`. ### Why are the changes needed? Address https://github.com/apache/spark/pull/47294#discussion_r1681150787 > I'm not a big fan of TryEval as it catches all the exceptions, including the ones not from UrlDecode, but from its input expressions. > > We should add a boolean flag in UrlDecode to control the null-on-error behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47514 from wForget/try_url_decode. Authored-by: wforget <643348094@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 July 2024, 03:25:39 UTC
308669f [SPARK-48900] Add `reason` field for all internal calls for job/stage cancellation ### What changes were proposed in this pull request? The changes can be grouped into two categories: - **Add `reason` field for all internal calls for job/stage cancellation** - Cancel because of exceptions: - org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala - org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala - sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala - Cancel by user (Web UI): - org/apache/spark/ui/jobs/JobsTab.scala - Cancel when streaming terminates/query ends: - org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala - org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala _(Developers familiar with these components: would appreciate it if you could provide suggestions for more helpful cancellation reason strings! :) )_ - **API Change for `JobWaiter`** Currently, the `.cancel()` function in `JobWaiter` does not allow specifying a reason for cancellation. This limitation prevents us from reporting cancellation reasons in AQE, such as cleanup or canceling unnecessary jobs after query replanning. To address this, we should add an `reason: String` parameter to all relevant APIs along the chain. <img width="936" alt="image" src="https://github.com/user-attachments/assets/239cbcd6-8d78-446a-98d0-456e5e837494"> ### Why are the changes needed? Today it is difficult to determine why a job, stage, or job group was canceled. We should leverage existing Spark functionality to provide a reason string explaining the cancellation cause, and should add new APIs to let us provide this reason when canceling job groups. For more context, please read [this JIRA ticket](https://issues.apache.org/jira/browse/SPARK-48900). This feature can be implemented in two PRs: 1. [Modify the current SparkContext and its downstream APIs to add the reason string, such as cancelJobGroup and cancelJobsWithTag](https://github.com/apache/spark/pull/47361) 2. Add reasons for all internal calls to these methods. **Note: This is the second of the two PRs to implement this new feature** ### Does this PR introduce _any_ user-facing change? It adds reasons for jobs and stages cancelled internally, providing users with clearer explanations for cancellations by Spark, such as exceptions, end of streaming, AQE query replanning, etc. ### How was this patch tested? Tests for the API changes on the `JobWaiter` chain are added in `JobCancellationSuite.scala` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47374 from mingkangli-db/reason_internal_calls. Authored-by: Mingkang Li <mingkang.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 July 2024, 01:02:27 UTC
0a1f5ab [SPARK-48710][PYTHON][FOLLOWUP] PySpark rdd test should not fail on optional dependencies ### What changes were proposed in this pull request? This is a follow-up of #47083 to recover PySpark RDD tests. ### Why are the changes needed? `PySpark Core` test should not fail on optional dependencies. **BEFORE** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy' ``` **AFTER** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... Tests passed in 189 seconds Skipped tests in pyspark.tests.test_memory_profiler with python3: test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.' test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.' test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.' test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.' Skipped tests in pyspark.tests.test_rdd with python3: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed' Skipped tests in pyspark.tests.test_serializers with python3: test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed' test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' ``` ### Does this PR introduce _any_ user-facing change? No. The failure happens during testing. ### How was this patch tested? Pass the CIs and do the manual test without optional dependencies. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47526 from dongjoon-hyun/SPARK-48710. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 July 2024, 00:46:30 UTC
0b48d3f [SPARK-48740][SQL] Catch missing window specification error early ### What changes were proposed in this pull request? Before, aggregate queries containing a window function without a window specification (e.g. `PARTITION BY`) would return a non-descriptive internal error message: `org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to exprId on unresolved object SQLSTATE: XX000` This PR catches the user error early and returns a more accurate description of the issue: `Window specification is not defined in the WINDOW clause for <windowName>. For more information about WINDOW clauses, please refer to '<docroot>/sql-ref-syntax-qry-select-window.html'.` ### Why are the changes needed? This change produces a more descriptive error message for window functions, improving the Spark user experience. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test in `sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47129 from asl3/improve-window-partition-error. Lead-authored-by: Amanda Liu <amanda.liu@databricks.com> Co-authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 30 July 2024, 00:35:28 UTC
7e529c3 [SPARK-49041][PYTHON][CONNECT] Raise proper error for `dropDuplicates` when wrong `subset` is given ### What changes were proposed in this pull request? This PR proposes to raise proper error for `dropDuplicates` when wrong `subset` is given ### Why are the changes needed? Current error message is hard to understand since it raises unrelated `INTERNAL_ERROR`: **Classic**: ```python >>> df.dropDuplicates(None) [INTERNAL_ERROR] Undefined error message parameter for error class: '_LEGACY_ERROR_TEMP_1201', MessageTemplate: Cannot resolve column name "<colName>" among (<fieldNames>)., Parameters: Map(colName -> null, fieldNames -> name, age) SQLSTATE: XX000 ``` **Connect**: ```python >>> df.dropDuplicates(None) TypeError: bad argument type for built-in operation ``` ### Does this PR introduce _any_ user-facing change? No API changes, but the user-facing error message is improved both in classic & connect: ```python >>> df.dropDuplicates(None) [NOT_STR] Argument `subset` should be a str, got NoneType. ``` ### How was this patch tested? Added UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47518 from itholic/SPARK-49041. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 July 2024, 19:31:57 UTC
6d4d764 [SPARK-49040][SQL] Fix doc `sql-ref-syntax-aux-exec-imm.md` ### What changes were proposed in this pull request? The pr aims to fix the sql in the example of the file `sql-ref-syntax-aux-exec-imm.md` cannot be executed ### Why are the changes needed? - Before: <img width="1333" alt="image" src="https://github.com/user-attachments/assets/fa980b52-d233-42bf-9f84-cc55e52dbb23"> - After: <img width="630" alt="image" src="https://github.com/user-attachments/assets/3b4f4326-b628-464f-a0ef-55e66d315653"> ### Does this PR introduce _any_ user-facing change? Yes, the example sql in file `sql-ref-syntax-aux-exec-imm.md` can be copy-paste-run. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47517 from panbingkun/SPARK-49040. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 July 2024, 14:13:33 UTC
db03653 [SPARK-48910][SQL] Use HashSet/HashMap to avoid linear searches in PreprocessTableCreation ### What changes were proposed in this pull request? Use `HashSet`/`HashMap` instead of doing linear searches over the `Seq`. In case of 1000s of partitions this significantly improves the performance. ### Why are the changes needed? To avoid the O(n*m) passes in the `PreprocessTableCreation` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #47484 from vladimirg-db/vladimirg-db/get-rid-of-linear-searches-preprocess-table-creation. Authored-by: Vladimir Golubev <vladimir.golubev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 July 2024, 14:07:02 UTC
efc6a75 [SPARK-48901][SPARK-48916][SS][PYTHON] Introduce clusterBy DataStreamWriter API ### What changes were proposed in this pull request? Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. ### Why are the changes needed? Provides another way for users to create clustered tables for streaming writes. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. ### How was this patch tested? See new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 29 July 2024, 06:52:16 UTC
112a52d [SPARK-49035][PYTHON] Eliminate TypeVar `ColumnOrName_` ### What changes were proposed in this pull request? Eliminate TypeVar `ColumnOrName_` ### Why are the changes needed? unify the usage of `ColumnOrName` ### Does this PR introduce _any_ user-facing change? No, internal change ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47512 from zhengruifeng/hint_CoN_. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 July 2024, 17:44:42 UTC
80223bb [SPARK-48985][CONNECT] Connect Compatible Expression Constructors ### What changes were proposed in this pull request? There are a number of hard coded expressions in the SparkConnectPlanner. Most of these expressions are hardcoded because they are missing a proper constructor, or because they are not registered in the FunctionRegistry. The Column API has a similar problem. This PR fixes most of these exceptions. ### Why are the changes needed? Reduce the number of hard coded expressions in the SparkConnectPlanner and the Column API. This will make it significantly easier to create an implementation agnostic Column API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47464 from hvanhovell/SPARK-48985. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> 28 July 2024, 02:19:29 UTC
c463e07 [SPARK-49013] Change key in collationsMap for Map and Array types in scala ### What changes were proposed in this pull request? When deserializing map/array that is not part of the struct field, the key in collation map should just be `{"element": collation}` instead of `{".element": collation}`. ### Why are the changes needed? To be consistent with the behavior on the pyspark side (#46737). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47497 from stefankandic/complexTypeDeSer. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 July 2024, 20:06:26 UTC
10849d9 [SPARK-49034][CORE] Support server-side `sparkProperties` replacement in REST Submission API ### What changes were proposed in this pull request? Like SPARK-49033, this PR aims to support server-side `sparkProperties` replacement in REST Submission API. - For example, ephemeral Spark clusters with server-side environment variables can provide backend-resource and information without touching client-side applications and configurations. - The place holder pattern is `{{SERVER_ENVIRONMENT_VARIABLE_NAME}}` style like the following. https://github.com/apache/spark/blob/163e512c53208301a8511310023d930d8b77db96/docs/configuration.md?plain=1#L694 https://github.com/apache/spark/blob/163e512c53208301a8511310023d930d8b77db96/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L233-L234 ### Why are the changes needed? A user can submits an environment variable holder like `{{AWS_ENDPOINT_URL}}` in order to use server-wide environment variables of Spark Master. ``` $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true" \ AWS_ENDPOINT_URL=ENDPOINT_FOR_THIS_CLUSTER \ sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077 ``` ``` curl -s -k -XPOST http://localhost:6066/v1/submissions/create \ --header "Content-Type:application/json;charset=UTF-8" \ --data '{ "appResource": "", "sparkProperties": { "spark.master": "spark://localhost:7077", "spark.app.name": "", "spark.submit.deployMode": "cluster", "spark.hadoop.fs.s3a.endpoint": "{{AWS_ENDPOINT_URL}}", "spark.jars": "/Users/dongjoon/APACHE/spark-merge/examples/target/scala-2.13/jars/spark-examples_2.13-4.0.0-SNAPSHOT.jar" }, "clientSparkVersion": "", "mainClass": "org.apache.spark.examples.SparkPi", "environmentVariables": {}, "action": "CreateSubmissionRequest", "appArgs": [ "10000" ] }' ``` - http://localhost:4040/environment/ ![Screenshot 2024-07-26 at 22 00 26](https://github.com/user-attachments/assets/20ea5d98-2503-4969-8cdb-82938c706029) ### Does this PR introduce _any_ user-facing change? No. This is a new feature and disabled by default via `spark.master.rest.enabled (default: false)` ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47511 from dongjoon-hyun/SPARK-49034-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 July 2024, 07:38:18 UTC
388ca1e [SPARK-45891][SQL][PYTHON][VARIANT] Add support for interval types in the Variant Spec ### What changes were proposed in this pull request? This PR adds support for the [YearMonthIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/YearMonthIntervalType.html) and [DayTimeIntervalType](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DayTimeIntervalType.html) as new primitive types in the Variant spec. As part of this task, the PR adds support for casting between intervals and variants and support for interval types in all the relevant variant expressions. This PR also adds support for these types on the PySpark side. ### Why are the changes needed? The variant spec should be compatible with all SQL Standard data types. ### Does this PR introduce _any_ user-facing change? Yes, it allows users to cast interval types to variants and vice versa. ### How was this patch tested? Unit tests in VariantExpressionSuite.scala and test_types.py ### Was this patch authored or co-authored using generative AI tooling? Yes, I used perplexity.ai to get guidance on converting some Scala code to Java code and Java code to Python code. Generated-by: perplexity.ai Closes #47473 from harshmotw-db/variant_interval. Authored-by: Harsh Motwani <harsh.motwani@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 July 2024, 05:31:48 UTC
d7966cf [SPARK-49010][SQL][XML] Add unit tests for XML schema inference case sensitivity ### What changes were proposed in this pull request? Currently, XML respects the case sensitivity SQLConf (default to false) in the schema inference but we lack unit tests to verify the behavior. This PR adds more unit tests to it. ### Why are the changes needed? This is a test-only change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a test-only change. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47494 from shujingyang-db/xml-schema-inference-case-sensitivity. Authored-by: Shujing Yang <shujing.yang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 July 2024, 05:29:41 UTC
d23ff11 [SPARK-49033][CORE] Support server-side `environmentVariables` replacement in REST Submission API ### What changes were proposed in this pull request? This PR aims to support server-side environment variable replacement in REST Submission API. - For example, ephemeral Spark clusters with server-side environment variables can provide backend-resource and information without touching client-side applications and configurations. - The place holder pattern is `{{SERVER_ENVIRONMENT_VARIABLE_NAME}}` style like the following. https://github.com/apache/spark/blob/163e512c53208301a8511310023d930d8b77db96/docs/configuration.md?plain=1#L694 https://github.com/apache/spark/blob/163e512c53208301a8511310023d930d8b77db96/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L233-L234 ### Why are the changes needed? A user can submits an environment variable holder like `{{AWS_CA_BUNDLE}}` and `{{AWS_ENDPOINT_URL}}` in order to use server-wide environment variables of Spark Master. ``` $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true" \ AWS_ENDPOINT_URL=ENDPOINT_FOR_THIS_CLUSTER \ sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077 ``` ``` curl -s -k -XPOST http://localhost:6066/v1/submissions/create \ --header "Content-Type:application/json;charset=UTF-8" \ --data '{ "appResource": "", "sparkProperties": { "spark.master": "spark://localhost:7077", "spark.app.name": "", "spark.submit.deployMode": "cluster", "spark.jars": "/Users/dongjoon/APACHE/spark-merge/examples/target/scala-2.13/jars/spark-examples_2.13-4.0.0-SNAPSHOT.jar" }, "clientSparkVersion": "", "mainClass": "org.apache.spark.examples.SparkPi", "environmentVariables": { "AWS_ACCESS_KEY_ID": "A", "AWS_SECRET_ACCESS_KEY": "B", "AWS_ENDPOINT_URL": "{{AWS_ENDPOINT_URL}}" }, "action": "CreateSubmissionRequest", "appArgs": [ "10000" ] }' ``` - http://localhost:4040/environment/ ![Screenshot 2024-07-26 at 16 58 26](https://github.com/user-attachments/assets/c52daf4e-02ce-4015-bda6-895fb39a39a9) ### Does this PR introduce _any_ user-facing change? No. This is a new feature and disabled by default via `spark.master.rest.enabled (default: false)` ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47509 from dongjoon-hyun/SPARK-49033. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 July 2024, 04:50:35 UTC
163e512 [SPARK-48986][CONNECT][SQL] Add ColumnNode Intermediate Representation ### What changes were proposed in this pull request? This PR introduces an intermediate representation for Column operations. It also adds a converter from this IR to Catalyst Expression. This is a first step in sharing Column API between Classic and Connect. It is not integrated with any of the pre-existing code base. ### Why are the changes needed? We want to share the Scala Column API between the Classic and Connect Scala DataFrame API implementations. For this we need to decouple the Column API from Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test suite that tests the conversion from `ColumnNode` to `Expression`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47466 from hvanhovell/SPARK-48986. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> 26 July 2024, 19:33:19 UTC
7e678a0 [SPARK-49009][SQL][PYTHON] Make Column APIs and functions accept Enums ### What changes were proposed in this pull request? Make Column APIs and functions accept `Enum`s. ### Why are the changes needed? `Enum`s can be accepted in Column APIs and functions using its `value`. ```py >>> from pyspark.sql import functions as F >>> from enum import Enum >>> class A(Enum): ... X = "x" ... Y = "y" ... >>> F.lit(A.X) Column<'x'> >>> F.lit(A.X) + A.Y Column<'`+`(x, y)'> ``` ### Does this PR introduce _any_ user-facing change? Yes, Python's `Enum`s will be used as literal values. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47495 from ueshin/issues/SPARK-49009/enum. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Takuya Ueshin <ueshin@databricks.com> 26 July 2024, 18:49:27 UTC
b02eeff [SPARK-48999][SS] Divide PythonStreamingDataSourceSimpleSuite ### What changes were proposed in this pull request? Divide PythonStreamingDataSourceSuite into PythonStreamingDataSourceSuite, PythonStreamingDataSourceSimpleSuite and PythonStreamingDataSourceWriteSuite ### Why are the changes needed? It is reported that PythonStreamingDataSourceSuite runs too long. We need to divide it to allow better parallelism. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Watch CI to cover the three new tests and pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47479 from siying/divide_python_stream_suite. Authored-by: Siying Dong <siying.dong@databricks.com> Signed-off-by: allisonwang-db <allison.wang@databricks.com> 26 July 2024, 17:44:45 UTC
d98480e [SPARK-49012][SQL][BUILD] Add bouncycastle-related test dependencies to the `hive-thriftserver` module to fix the Maven daily test ### What changes were proposed in this pull request? This pr add bouncycastle-related test dependencies to the `hive-thrift` module to fix the Maven daily test. ### Why are the changes needed? `sql-on-files.sql` added the following statement in https://github.com/apache/spark/pull/47480, which caused the Maven daily test to fail https://github.com/apache/spark/blob/2363aec0c14ead24ade2bfa23478a4914f179c00/sql/core/src/test/resources/sql-tests/inputs/sql-on-files.sql#L10 - https://github.com/apache/spark/actions/runs/10094638521/job/27943309504 - https://github.com/apache/spark/actions/runs/10095571472/job/27943298802 ``` - sql-on-files.sql *** FAILED *** "" did not contain "Exception" Exception did not match for query #6 CREATE TABLE sql_on_files.test_orc USING ORC AS SELECT 1, expected: , but got: java.sql.SQLException org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8542.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8542.0 (TID 8594) (localhost executor driver): java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider at test.org.apache.spark.sql.execution.datasources.orc.FakeKeyProvider$Factory.createProvider(FakeKeyProvider.java:127) at org.apache.hadoop.crypto.key.KeyProviderFactory.get(KeyProviderFactory.java:96) at org.apache.hadoop.crypto.key.KeyProviderFactory.getProviders(KeyProviderFactory.java:68) at org.apache.orc.impl.HadoopShimsCurrent.createKeyProvider(HadoopShimsCurrent.java:97) at org.apache.orc.impl.HadoopShimsCurrent.getHadoopKeyProvider(HadoopShimsCurrent.java:131) at org.apache.orc.impl.CryptoUtils$HadoopKeyProviderFactory.create(CryptoUtils.java:158) at org.apache.orc.impl.CryptoUtils.getKeyProvider(CryptoUtils.java:141) at org.apache.orc.impl.WriterImpl.setupEncryption(WriterImpl.java:1015) at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:164) at org.apache.orc.OrcFile.createWriter(OrcFile.java:1078) at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:49) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:89) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:180) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:165) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:391) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:107) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374) at org.apache.spark.rdd.RDD.iterator(RDD.scala:338) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171) at org.apache.spark.scheduler.Task.run(Task.scala:146) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) ... 32 more ``` Because we have configured `hadoop.security.key.provider.path` as `test:///` in the parent `pom.xml`, https://github.com/apache/spark/blob/5ccf9ba958f492c1eb4dde22a647ba75aba63d8e/pom.xml#L3165-L3166 `KeyProviderFactory#getProviders` will use `FakeKeyProvider$Factory` to create instances of `FakeKeyProvider`. https://github.com/apache/spark/blob/5ccf9ba958f492c1eb4dde22a647ba75aba63d8e/sql/core/src/test/resources/META-INF/services/org.apache.hadoop.crypto.key.KeyProviderFactory#L18 During the initialization of `FakeKeyProvider`, it first initializes its superclass `org.apache.hadoop.crypto.key.KeyProvider`, which leads to the loading of the `BouncyCastleProvider` class. Therefore, we need to add bouncycastle-related test dependencies in the `hive-thrift` module. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual Test with this pr. ``` build/mvn -Phive -Phive-thriftserver clean install -DskipTests build/mvn -Phive -Phive-thriftserver clean install -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -pl sql/hive-thriftserver ``` ``` Run completed in 6 minutes, 52 seconds. Total number of tests run: 243 Suites: completed 2, aborted 0 Tests: succeeded 243, failed 0, canceled 0, ignored 20, pending 0 All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47496 from LuciferYang/thrift-bouncycastle. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 July 2024, 14:46:26 UTC
5ccf9ba [SPARK-48998][ML] Meta algorithms save/load model with SparkSession ### What changes were proposed in this pull request? 1. add overloads with SparkSession of following helper functions: - SharedReadWrite.saveImpl - SharedReadWrite.load - DefaultParamsWriter.getMetadataToSave - DefaultParamsReader.loadParamsInstance - DefaultParamsReader.loadParamsInstanceReader 2. deprecate old functions 3. apply the new functions in ML ### Why are the changes needed? Meta algorithms save/load model with SparkSession After this PR, all `.ml` implementations save and load models with SparkSession, while the old helper functions with `sc` are still available (just deprecated) for eco-system. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47477 from zhengruifeng/ml_meta_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 26 July 2024, 10:17:52 UTC
2363aec [SPARK-49007][CORE] Improve `MasterPage` to support custom title ### What changes were proposed in this pull request? This PR aims to improve `MasterPage` to support custom title. ### Why are the changes needed? When there exists multiple Spark clusters, custom title can be more helpful than the spark master address because it can contain semantics like the role of the clusters. In addition, the URL field in the same page already provides the spark master information even when we use a custom title. **BEFORE** ``` sbin/start-master.sh ``` ![Screenshot 2024-07-25 at 14 01 11](https://github.com/user-attachments/assets/7055d700-4bd6-4785-a535-2f8ce6dba47d) **AFTER** ``` SPARK_MASTER_OPTS='-Dspark.master.ui.title="Projext X Staging Cluster"' sbin/start-master.sh ``` ![Screenshot 2024-07-25 at 14 05 38](https://github.com/user-attachments/assets/f7e45fd6-fa2b-4547-ae39-1403b1e910d9) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Pass the CIs with newly added test case. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47491 from dongjoon-hyun/SPARK-49007. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 July 2024, 03:23:48 UTC
e73ede7 [SPARK-45787][SQL] Support Catalog.listColumns for clustering columns ### What changes were proposed in this pull request? Support listColumns API for clustering columns. ### Why are the changes needed? Clustering columns should be supported, just like partition and bucket columns, for listColumns API. ### Does this PR introduce _any_ user-facing change? Yes, listColumns will now show an additional field `isCluster` to indicate whether the column is a clustering column. Old output for `spark.catalog.listColumns`: ``` +----+-----------+--------+--------+-----------+--------+ |name|description|dataType|nullable|isPartition|isBucket| +----+-----------+--------+--------+-----------+--------+ | a| null| int| true| false| false| | b| null| string| true| false| false| | c| null| int| true| false| false| | d| null| string| true| false| false| +----+-----------+--------+--------+-----------+--------+ ``` New output: ``` +----+-----------+--------+--------+-----------+--------+---------+ |name|description|dataType|nullable|isPartition|isBucket|isCluster| +----+-----------+--------+--------+-----------+--------+---------+ | a| null| int| true| false| false| false| | b| null| string| true| false| false| false| | c| null| int| true| false| false| false| | d| null| string| true| false| false| false| +----+-----------+--------+--------+-----------+--------+---------+ ``` ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47451 from zedtang/list-clustering-columns. Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 July 2024, 01:44:05 UTC
f3b819e [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in https://github.com/apache/spark/pull/46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 July 2024, 01:37:28 UTC
78b83fa [SPARK-48996][SQL][PYTHON] Allow bare literals for __and__ and __or__ of Column ### What changes were proposed in this pull request? Allows bare literals for `__and__` and `__or__` of Column API in Spark Classic. ### Why are the changes needed? Currently bare literals are not allowed for `__and__` and `__or__` of Column API in Spark Classic and need to wrap with `lit()` function. It should be allowed similar to other similar operators. ```py >>> from pyspark.sql.functions import * >>> c = col("c") >>> c & True Traceback (most recent call last): ... py4j.Py4JException: Method and([class java.lang.Boolean]) does not exist >>> c & lit(True) Column<'and(c, true)'> ``` whereas other operators: ```py >>> c + 1 Column<'`+`(c, 1)'> >>> c + lit(1) Column<'`+`(c, 1)'> ``` Spark Connect allows this. ```py >>> c & True Column<'and(c, True)'> >>> c & lit(True) Column<'and(c, True)'> ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47474 from ueshin/issues/SPARK-48996/literal_and_or. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Takuya Ueshin <ueshin@databricks.com> 25 July 2024, 20:47:01 UTC
4999469 [SPARK-48849][SS] Create OperatorStateMetadataV2 for the TransformWithStateExec operator ### What changes were proposed in this pull request? Introducing the OperatorStateMetadataV2 format that integrates with the TransformWithStateExec operator. This is used to keep information about the TWS operator, will be used to enforce invariants in between query runs. Each OperatorStateMetadataV2 has a pointer to the StateSchemaV3 file for the corresponding operator. Will introduce purging in this PR: https://github.com/apache/spark/pull/47286 ### Why are the changes needed? This is needed for State Metadata integration with the TransformWithState operator. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Added unit tests to StateStoreSuite and TransformWithStateSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #47445 from ericm-db/metadata-v2. Authored-by: Eric Marnadi <eric.marnadi@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 25 July 2024, 12:42:05 UTC
cf95e75 [MINOR][DOCS] Update doc `sql/README.md` ### What changes were proposed in this pull request? The pr aims to update doc `sql/README.md`. ### Why are the changes needed? After https://github.com/apache/spark/pull/41426, We have added a subproject `API` to our `SQL moudle`, so we need to update the doc `sql/README.md` synchronously. ### Does this PR introduce _any_ user-facing change? Yes, make the doc clearer and more accurate. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47476 from panbingkun/minor_docs. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 July 2024, 11:29:21 UTC
5c19505 [SPARK-48844][FOLLOWUP][TESTS] Cleanup duplicated data resource files in hive-thriftserver test ### What changes were proposed in this pull request? A follow up of SPARK-48844 to cleanup duplicated data resource files in hive-thriftserver test ### Why are the changes needed? code refactoring ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47480 from yaooqinn/SPARK-48844-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 25 July 2024, 09:39:11 UTC
back to top