https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
c293958 Preparing Spark release v3.5.0-rc4 05 September 2023, 20:15:44 UTC
dc6af11 [SPARK-45082][DOC] Review and fix issues in API docs for 3.5.0 ### What changes were proposed in this pull request? Compare the 3.4 API doc with the 3.5 RC3 cut. Fix the following issues: - Remove the leaking class/object in API doc ### Why are the changes needed? Fix the issues in the Spark 3.5.0 release API docs. ### Does this PR introduce _any_ user-facing change? No, API doc changes only. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? no Closes #42819 from xuanyuanking/SPARK-45082. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Yuanjian Li <yuanjian.li@databricks.com> (cherry picked from commit e0a6af988df3f52e95d46ac4c333825d2940065f) Signed-off-by: Yuanjian Li <yuanjian.li@databricks.com> 05 September 2023, 19:45:59 UTC
9389a2c [SPARK-45072][CONNECT] Fix outer scopes for ammonite classes ### What changes were proposed in this pull request? Ammonite places all user code inside Helper classes which are nested inside the class it creates for each command. This PR adds a custom code class wrapper for the Ammonite REPL. It makes sure the Helper classes generated by ammonite are always registered as an outer scope immediately. This way we can instantiate classes defined inside the Helper class, even when we execute Spark code as part of the Helper's constructor. ### Why are the changes needed? When you currently define a class and execute a Spark command using that class inside the same cell/line this will fail with an NullPointerException. The reason for that is that we cannot resolve the outer scope needed to instantiate the class. This PR fixes that issue. The following code will now execute successfully (include the curly braces): ```scala { case class Thing(val value: String) val r = (0 to 10).map( value => Thing(value.toString) ) spark.createDataFrame(r) } ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added more tests to the `ReplE2ESuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42807 from hvanhovell/SPARK-45072. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 40943c2748fdd28d970d017cb8ee86c294ee62df) Signed-off-by: Herman van Hovell <herman@databricks.com> 05 September 2023, 13:35:23 UTC
0dea7db [SPARK-44940][SQL][3.5] Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled ### What changes were proposed in this pull request? Backport of https://github.com/apache/spark/pull/42667 to branch-3.5. The PR improves JSON parsing when `spark.sql.json.enablePartialResults` is enabled: - Fixes the issue when using nested arrays `ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow` - Improves parsing of the nested struct fields, e.g. `{"a1": "AAA", "a2": [{"f1": "", "f2": ""}], "a3": "id1", "a4": "XXX"}` used to be parsed as `|AAA|NULL |NULL|NULL|` and now is parsed as `|AAA|[{NULL, }]|id1|XXX|`. - Improves performance of nested JSON parsing. The initial implementation would throw too many exceptions when multiple nested fields failed to parse. When the config is disabled, it is not a problem because the entire record is marked as NULL. The internal benchmarks show the performance improvement from slowdown of over 160% to an improvement of 7-8% compared to the master branch when the flag is enabled. I will create a follow-up ticket to add a benchmark for this regression. ### Why are the changes needed? Fixes some corner cases in JSON parsing and improves performance when `spark.sql.json.enablePartialResults` is enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added tests to verify nested structs, maps, and arrays can be parsed without affecting the subsequent fields in the JSON. I also updated the existing tests when `spark.sql.json.enablePartialResults` is enabled because we parse more data now. I added a benchmark to check performance. Before the change (master, https://github.com/apache/spark/commit/a45a3a3d60cb97b107a177ad16bfe36372bc3e9b): ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws [info] Intel(R) Xeon(R) Platinum 8375C CPU 2.90GHz [info] Partial JSON results: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] parse invalid JSON 9537 9820 452 0.0 953651.6 1.0X ``` After the change (this PR): ``` OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws Intel(R) Xeon(R) Platinum 8375C CPU 2.90GHz Partial JSON results: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ parse invalid JSON 3100 3106 6 0.0 309967.6 1.0X ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42790 from sadikovi/SPARK-44940-3.5. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 September 2023, 21:37:32 UTC
950b2f2 [SPARK-45042][BUILD][3.5] Upgrade jetty to 9.4.52.v20230823 ### What changes were proposed in this pull request? The pr aims to Upgrade jetty from 9.4.51.v20230217 to 9.4.52.v20230823. (Backport to Spark 3.5.0) ### Why are the changes needed? - This is a release of the https://github.com/eclipse/jetty.project/issues/7958 that was sponsored by a [support contract from Webtide.com](mailto:saleswebtide.com) - The newest version fix a possible security issue: This release provides a workaround for Security Advisory https://github.com/advisories/GHSA-58qw-p7qm-5rvh - The release note as follows: https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.52.v20230823 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42795 from panbingkun/branch-3.5_SPARK-45042. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> 04 September 2023, 14:01:50 UTC
5c801fc [SPARK-44846][SQL] Convert the lower redundant Aggregate to Project in RemoveRedundantAggregates ### What changes were proposed in this pull request? This PR provides a safe way to remove a redundant `Aggregate` in rule `RemoveRedundantAggregates`. Just convert the lower redundant `Aggregate` to `Project`. ### Why are the changes needed? The aggregate contains complex grouping expressions after `RemoveRedundantAggregates`, if `aggregateExpressions` has (if / case) branches, it is possible that `groupingExpressions` is no longer a subexpression of `aggregateExpressions` after execute `PushFoldableIntoBranches` rule, Then cause `boundReference` error. For example ``` SELECT c * 2 AS d FROM ( SELECT if(b > 1, 1, b) AS c FROM ( SELECT if(a < 0, 0, a) AS b FROM VALUES (-1), (1), (2) AS t1(a) ) t2 GROUP BY b ) t3 GROUP BY c ``` Before pr ``` == Optimized Logical Plan == Aggregate [if ((b#0 > 1)) 1 else b#0], [if ((b#0 > 1)) 2 else (b#0 * 2) AS d#2] +- Project [if ((a#3 < 0)) 0 else a#3 AS b#0] +- LocalRelation [a#3] ``` ``` == Error == Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7] java.lang.IllegalStateException: Couldn't find b#0 in [if ((b#0 > 1)) 1 else b#0#7] at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:466) at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1241) at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1240) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:653) ...... ``` After pr ``` == Optimized Logical Plan == Aggregate [c#1], [(c#1 * 2) AS d#2] +- Project [if ((b#0 > 1)) 1 else b#0 AS c#1] +- Project [if ((a#3 < 0)) 0 else a#3 AS b#0] +- LocalRelation [a#3] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #42633 from zml1206/SPARK-44846-2. Authored-by: zml1206 <zhuml1206@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 32a87f03da7eef41161a5a7a3aba4a48e0421912) Signed-off-by: Yuming Wang <yumwang@ebay.com> 04 September 2023, 12:23:54 UTC
6112d78 [SPARK-45052][SQL][PYTHON][CONNECT][3.5] Make function aliases output column name consistent with SQL ### What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/42775 to 3.5 ### Why are the changes needed? to make `func(col)` consistent with `expr(func(col))` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #42786 from zhengruifeng/try_column_name_35. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 04 September 2023, 08:24:43 UTC
fe3a20a [SPARK-44876][PYTHON][FOLLOWUP][3.5] Fix Arrow-optimized Python UDF to delay wrapping the function with fail_on_stopiteration ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/42784. Fixes Arrow-optimized Python UDF to delay wrapping the function with `fail_on_stopiteration`. Also removed unnecessary verification `verify_result_type`. ### Why are the changes needed? For Arrow-optimized Python UDF, `fail_on_stopiteration` can be applied to only the wrapped function to avoid unnecessary overhead. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added the related test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42785 from ueshin/issues/SPARK-44876/3.5/fail_on_stopiteration. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 04 September 2023, 07:25:33 UTC
44ab0fc [SPARK-45045][SS] Revert back the behavior of idle progress for StreamingQuery API from SPARK-43183 ### What changes were proposed in this pull request? This PR proposes to revert back the behavior of idle progress for StreamingQuery API from [SPARK-43183](https://issues.apache.org/jira/browse/SPARK-43183), to avoid breakage of tests from 3rd party data sources. ### Why are the changes needed? We indicated that the behavioral change from SPARK-43183 broke many tests in 3rd party data sources. (Short summary of SPARK-43183: we changed the behavior of idle progress to only provide idle event callback, instead of making progress update callback as well as adding progress for StreamingQuery API to provide as recent progresses/last progress.) The main rationale of SPARK-43183 was to avoid making progress update callback for idle event, which had been confused users. That is more about streaming query listener, and not necessarily had to change the behavior of StreamingQuery API as well. ### Does this PR introduce _any_ user-facing change? Yes, but the user-facing change is technically reduced before this PR, as we revert back the behavioral change partially from SPARK-43183, which wasn't released yet. ### How was this patch tested? Modified tests. Also manually ran 3rd party data source tests which were broken with Spark 3.5.0 RC which succeeded with this change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42773 from HeartSaVioR/SPARK-45045. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit cf0a5cb472efebb4350e48bd82a4f834e8607333) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 04 September 2023, 02:42:11 UTC
fb5495f [SPARK-45061][SS][CONNECT] Clean up Running python StreamingQueryLIstener processes when session expires ### What changes were proposed in this pull request? Clean up all running python StreamingQueryLIstener processes when session expires ### Why are the changes needed? Improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test will be added in SPARK-44462. Currently there is no way to test this because the session will never expire. This is because the started python listener process (on the server) will establish a connection with the server process with the same session id and ping it all the time. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42687 from WweiL/SPARK-44433-followup-listener-cleanup. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7a01ba65b7408bc3b907aa7b0b27279913caafe9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 September 2023, 00:36:58 UTC
215f2d1 [SPARK-45054][SQL] HiveExternalCatalog.listPartitions should restore partition statistics ### What changes were proposed in this pull request? Call `restorePartitionMetadata` in `listPartitions` to restore Spark SQL statistics. ### Why are the changes needed? Currently when `listPartitions` is called, it doesn't restore Spark SQL statistics stored in metastore, such as `spark.sql.statistics.totalSize`. This means callers who rely on stats from the method call may wrong results. In particular, when `spark.sql.statistics.size.autoUpdate.enabled` is turned on, during insert overwrite Spark will first list partitions and get old statistics, and then compare them with new statistics and see which partitions need to be updated. This issue will sometimes cause it to update all partitions instead of only those partitions that have been touched. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test. ### Was this patch authored or co-authored using generative AI tooling? Closes #42777 from sunchao/list-partition-stat. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> 02 September 2023, 03:23:25 UTC
16c9f86 [SPARK-44750][PYTHON][CONNECT][TESTS][FOLLOW-UP] Avoid creating session twice in `SparkConnectSessionWithOptionsTest` ### What changes were proposed in this pull request? Avoid creating session twice in `SparkConnectSessionWithOptionsTest` ### Why are the changes needed? the session created in `ReusedConnectTestCase#setUpClass` is not used, so no need to inherit ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #42747 from zhengruifeng/minor_test_ut. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8c27de68756d4b0e5940211340a0b323d808aead) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 September 2023, 17:01:43 UTC
b41fd4c [SPARK-44577][SQL] Fix INSERT BY NAME returns nonsensical error message ### What changes were proposed in this pull request? Fix INSERT BY NAME returns nonsensical error message on v1 datasource. eg: ```scala CREATE TABLE bug(c1 INT); INSERT INTO bug BY NAME SELECT 1 AS c2; ==> Multi-part identifier cannot be empty. ``` After PR: ```scala [INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_FIND_DATA] Cannot write incompatible data for the table `spark_catalog`.`default`.`bug`: Cannot find data for the output column `c1`. ``` Also fixed the same issue when throwing other INCOMPATIBLE_DATA_FOR_TABLE type errors ### Why are the changes needed? Fix the error msg nonsensical. ### Does this PR introduce _any_ user-facing change? Yes, the error msg in v1 insert by name will be changed. ### How was this patch tested? add new test. Closes #42220 from Hisoka-X/SPARK-44577_insert_by_name_bug_fix. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 00f66994c802faf9ccc0d40ed4f6ff32992ba00f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 September 2023, 12:27:35 UTC
b2b948d [SPARK-45029][CONNECT][TESTS] Ignore `from_protobuf messageClassName/from_protobuf messageClassName options` in `PlanGenerationTestSuite` ### What changes were proposed in this pull request? This pr aims ignore `from_protobuf messageClassName` and `from_protobuf messageClassName options` in `PlanGenerationTestSuite` and remove the related golden files, after this change `from_protobuf_messageClassName` and `from_protobuf_messageClassName_options` in `ProtoToParsedPlanTestSuite` be ignored too. ### Why are the changes needed? SPARK-43646 | (https://github.com/apache/spark/pull/42236) makes both Maven and SBT use the shaded `spark-protobuf` module when testing the connect module, this allows `mvn clean install` and `mvn package test` to successfully pass tests. But if `mvn clean test` is executed directly, an error `package org.sparkproject.spark_protobuf.protobuf does not exist` will occur. This is because `mvn clean test` directly uses the classes file of the `spark-protobuf` module for testing, without the 'package', hence it does not `shade` and `relocate` protobuf. On the other hand, the change of SPARK-43646 breaks the usability of importing Spark as a Maven project into IDEA(https://github.com/apache/spark/pull/42236#issuecomment-1700493815). So https://github.com/apache/spark/pull/42746 revert the change of [SPARK-43646](https://issues.apache.org/jira/browse/SPARK-43646). It's difficult to find a perfect solution to solve this maven test issues now, as in certain scenarios tests would use the `shaded spark-protobuf jar`, like `mvn package test`, while in some other scenarios it will use the `unshaded classes directory`, such as `mvn clean test`. so this pr ignores the relevant tests first and leaves a TODO(SPARK-45030), to re-enable these tests when we find a better solution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #42751 from LuciferYang/SPARK-45029. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 10f31904636da727e439d5f1792c3ab7e8e1d24b) Signed-off-by: yangjie01 <yangjie01@baidu.com> 01 September 2023, 02:53:03 UTC
bc215ab Revert "[SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use `spark-proto` uber jar to test the `connect` module" ### What changes were proposed in this pull request? This reverts commit df63adf734370f5c2d71a348f9d36658718b302c. ### Why are the changes needed? As [reported](https://github.com/apache/spark/pull/42236#issuecomment-1700493815) by MaxGekk , the solution for https://github.com/apache/spark/pull/42236 is not perfect, and it breaks the usability of importing Spark as a Maven project into idea. On the other hand, if `mvn clean test` is executed, test failures will also occur like ``` [ERROR] [Error] /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46: error: package org.sparkproject.spark_protobuf.protobuf does not exist ``` Therefore, this pr will revert the change of SPARK-43646, and `from_protobuf messageClassName` and `from_protobuf messageClassName options` in `PlanGenerationTestSuite` will be ignored in a follow-up. At present, it is difficult to make the maven testing of the `spark-protobuf` function in the `connect` module as good as possible. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #42746 from LuciferYang/Revert-SPARK-43646. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 723a0aa30f9a901140d0f97d580d39db56b0729f) Signed-off-by: yangjie01 <yangjie01@baidu.com> 31 August 2023, 11:43:00 UTC
86c1be5 [SPARK-45014][CONNECT] Clean up fileserver when cleaning up files, jars and archives in SparkContext This PR proposes to clean up the files, jars and archives added via Spark Connect sessions. In [SPARK-44348](https://issues.apache.org/jira/browse/SPARK-44348), we clean up Spark Context's added files but we don't clean up the ones in fileserver. Yes, it will avoid slowly growing memory within the file server. Manually tested. Also existing tests should not be broken. No. Closes #42731 from HyukjinKwon/SPARK-45014. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9a023c479c6a91a602f96ccabba398223c04b3d1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 31 August 2023, 06:22:02 UTC
7be69bf [SPARK-44971][PYTHON] StreamingQueryProgress event fromJson bug fix ### What changes were proposed in this pull request? The `fromJson` method for `StreamingQueryProgress` excepts the field `batchDuration` is in the dict. That method is used internally for converting a json representation of `StreamingQueryProgress` into python object, commonly created in the Scala side `json` method of the same object. But the `batchDuration` field is not there before https://github.com/apache/spark/pull/42077, which is only merged to 4.0. Therefore we add a catch there to prevent this method from failing. ### Why are the changes needed? Necessary bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #42686 from WweiL/SPARK-44971-fromJson-bugfix. Lead-authored-by: Wei Liu <wei.liu@databricks.com> Co-authored-by: Wei Liu <z920631580@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 31 August 2023, 00:28:24 UTC
61679cb [SPARK-44742][PYTHON][DOCS][FOLLOWUP] Upgrade `pydata_sphinx_theme` to 0.8.0 in `spark-rm` Dockerfile ### What changes were proposed in this pull request? The pr is followup https://github.com/apache/spark/pull/42428. ### Why are the changes needed? To fix issue: When our `pydata_sphinx_theme` version is `0.4.1`, there may be issues with not recognizing some configuration items. <img width="927" alt="image" src="https://github.com/apache/spark/assets/15246973/7ec54bb2-8c28-4863-8374-b3a5369873fc"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42730 from panbingkun/SPARK-44742_FOLLOWUP. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 171251518b469824589c498c5f202cc55dacb128) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 August 2023, 18:26:05 UTC
5d12625 [SPARK-45016][PYTHON][CONNECT] Add missing `try_remote_functions` annotations ### What changes were proposed in this pull request? Add missing `try_remote_functions` annotations ### Why are the changes needed? to enable these functions in Connect ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? reused doctest ### Was this patch authored or co-authored using generative AI tooling? NO Closes #42734 from zhengruifeng/add_missing_annotation. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit caceb888510a34b9684259914470448fab29493b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 August 2023, 18:20:03 UTC
627c1ce [SPARK-44990][SQL] Reduce the frequency of get `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` ### What changes were proposed in this pull request? This PR move get config `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` to lazy val of `UnivocityGenerator`. To reduce the frequency of get it. As report, it will affect performance. ### Why are the changes needed? Reduce the frequency of get `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist test ### Was this patch authored or co-authored using generative AI tooling? No Closes #42738 from Hisoka-X/SPARK-44990_csv_null_value_config. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit dac750b855c35a88420b6ba1b943bf0b6f0dded1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 August 2023, 17:55:09 UTC
daf36e9 [SPARK-45021][BUILD] Remove `antlr4-maven-plugin` configuration from `sql/catalyst/pom.xml` ### What changes were proposed in this pull request? SPARK-44475(https://github.com/apache/spark/pull/41928) has already moved the `antlr4-maven-plugin` relevant configuration to `sql/api/pom.xml`, so the configuration in the `catalyst` module is unused now, so this pr remove it from `sql/catalyst/pom.xml` ### Why are the changes needed? Clean up unused Maven `antlr4-maven-plugin` from catalyst module. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual verification ``` build/mvn clean install -DskipTests -pl sql/catalyst -am ``` Before We can see ``` [INFO] [INFO] --- antlr4-maven-plugin:4.9.3:antlr4 (default) spark-catalyst_2.12 --- [INFO] No ANTLR 4 grammars to compile in /Users/yangjie01/SourceCode/git/spark-mine-12/sql/catalyst/src/main/antlr4 [INFO] ``` After no relevant messages. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42739 from LuciferYang/remove-antlr4-catalyst. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5df4e16546095a8931e7e87998470603e01c6695) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 August 2023, 15:40:55 UTC
24bd29c [SPARK-43438][SQL] Error on missing input columns in `INSERT` ### What changes were proposed in this pull request? In the PR, I propose to raise an error when an user uses V1 `INSERT` without a list of columns, and the number of inserting columns doesn't match to the number of actual table columns. At the moment Spark inserts data successfully in such case after the PR https://github.com/apache/spark/pull/41262 which changed the behaviour of Spark 3.4.x. ### Why are the changes needed? 1. To conform the SQL standard which requires the number of columns must be the same: ![Screenshot 2023-08-07 at 11 01 27 AM](https://github.com/apache/spark/assets/1580697/c55badec-5716-490f-a83a-0bb6b22c84c7) Apparently, the insertion below must not succeed: ```sql spark-sql (default)> CREATE TABLE tabtest(c1 INT, c2 INT); spark-sql (default)> INSERT INTO tabtest SELECT 1; ``` 2. To have the same behaviour as **Spark 3.4**: ```sql spark-sql (default)> INSERT INTO tabtest SELECT 1; `spark_catalog`.`default`.`tabtest` requires that the data to be inserted have the same number of columns as the target table: target table has 2 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s). ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```sql spark-sql (default)> INSERT INTO tabtest SELECT 1; [INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS] Cannot write to `spark_catalog`.`default`.`tabtest`, the reason is not enough data columns: Table columns: `c1`, `c2`. Data columns: `1`. ``` ### How was this patch tested? By running the modified tests: ``` $ build/sbt "test:testOnly *InsertSuite" $ build/sbt "test:testOnly *ResolveDefaultColumnsSuite" $ build/sbt -Phive "test:testOnly *HiveQuerySuite" ``` Closes #42393 from MaxGekk/fix-num-cols-insert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit a7eef2116919bd0c1a1b52adaf49de903e8c9c46) Signed-off-by: Max Gekk <max.gekk@gmail.com> 29 August 2023, 20:04:56 UTC
40e65d3 [SPARK-44981][PYTHON][CONNECT][FOLLOW-UP] Explicitly pass runtime configurations only ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/42694 that only allows to pass runtime configurations. ### Why are the changes needed? Excluding static SQL configurations cannot exclude core configurations. For example, if you pass `spark.jars` with `local-cluster` mode, it shows unneccesary warnings as below: ```bash ./bin/pyspark --remote "local-cluster[1,2,1024]" ``` it shows warnings as below: ``` 23/08/29 16:58:08 ERROR ErrorUtils: Spark Connect RPC error during: config. UserId: hyukjin.kwon. SessionId: 5c331d52-bf65-4f1c-9416-899e00d4a7d9. org.apache.spark.sql.AnalysisException: [CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: "spark.jars". See also 'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'. at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfSparkConfigError(QueryCompilationErrors.scala:3233) at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:166) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1(SparkConnectConfigHandler.scala:67) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1$adapted(SparkConnectConfigHandler.scala:65) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handleSet(SparkConnectConfigHandler.scala:65) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handle(SparkConnectConfigHandler.scala:40) at org.apache.spark.sql.connect.service.SparkConnectService.config(SparkConnectService.scala:120) at org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:751) at org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at org.sparkproject.connect.grpc.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) at org.sparkproject.connect.grpc.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) at org.sparkproject.connect.grpc.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at org.sparkproject.connect.grpc.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) /Users/hyukjin.kwon/workspace/forked/spark/python/pyspark/sql/connect/session.py:186: UserWarning: [CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: "spark.jars". See also 'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'. warnings.warn(str(e)) 23/08/29 16:58:08 ERROR ErrorUtils: Spark Connect RPC error during: config. UserId: hyukjin.kwon. SessionId: 5c331d52-bf65-4f1c-9416-899e00d4a7d9. org.apache.spark.sql.AnalysisException: [CANNOT_MODIFY_CONFIG] Cannot modify the value of the Spark config: "spark.jars". See also 'https://spark.apache.org/docs/latest/sql-migration-guide.html#ddl-statements'. at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfSparkConfigError(QueryCompilationErrors.scala:3233) at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:166) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1(SparkConnectConfigHandler.scala:67) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1$adapted(SparkConnectConfigHandler.scala:65) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handleSet(SparkConnectConfigHandler.scala:65) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handle(SparkConnectConfigHandler.scala:40) at org.apache.spark.sql.connect.service.SparkConnectService.config(SparkConnectService.scala:120) at org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:751) at org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at org.sparkproject.connect.grpc.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) at org.sparkproject.connect.grpc.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) at org.sparkproject.connect.grpc.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at org.sparkproject.connect.grpc.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No, the original change has not been released out yet. ### How was this patch tested? Manually tested as described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42718 from HyukjinKwon/SPARK-44981-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 281f174304a5b1d9a146502dfdfd000d15924327) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 August 2023, 13:45:10 UTC
cecd79a Preparing development version 3.5.1-SNAPSHOT 29 August 2023, 05:57:11 UTC
9f137aa Preparing Spark release v3.5.0-rc3 29 August 2023, 05:57:06 UTC
bbe12e1 Revert "[SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site" This reverts commit 319dff11c373cc872aab4e7d55745561ee5d7b0e. 29 August 2023, 05:43:50 UTC
179aaab [SPARK-43646][CONNECT][TESTS] Make both SBT and Maven use `spark-proto` uber jar to test the `connect` module ### What changes were proposed in this pull request? Before this pr, when we tested the `connect` module, Maven used the shaded `spark-protobuf` jar for testing, while SBT used the original jar for testing, which also led to inconsistent testing behavior. So some tests passed when using SBT, but failed when using Maven: run ``` build/mvn clean install -DskipTests build/mvn test -pl connector/connect/server ``` there will be two test failed as follows: ``` - from_protobuf_messageClassName *** FAILED *** org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could not load Protobuf class with name org.apache.spark.connect.proto.StorageLevel. org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with Protobuf classes needs to be shaded (com.google.protobuf.* --> org.sparkproject.spark_protobuf.protobuf.*). at org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417) at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193) at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151) at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58) at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57) at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43) at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194) at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) - from_protobuf_messageClassName_options *** FAILED *** org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could not load Protobuf class with name org.apache.spark.connect.proto.StorageLevel. org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with Protobuf classes needs to be shaded (com.google.protobuf.* --> org.sparkproject.spark_protobuf.protobuf.*). at org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417) at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193) at org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151) at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58) at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57) at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43) at org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194) at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) ``` So this pr make SBT also use `spark-proto` uber jar(`spark-protobuf-assembly-**-SNAPSHOT.jar`) for the above tests and refactor the test cases to make them pass both SBT and Maven after this pr. ### Why are the changes needed? Make connect server module can test pass using both SBT and maven. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Actions - Manual check ``` build/mvn clean install -DskipTests build/mvn test -pl connector/connect/server ``` all test passed after this pr. Closes #42236 from LuciferYang/protobuf-test. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit df63adf734370f5c2d71a348f9d36658718b302c) Signed-off-by: yangjie01 <yangjie01@baidu.com> 29 August 2023, 03:15:39 UTC
fd07239 [SPARK-44832][CONNECT] Make transitive dependencies work properly for Scala Client ### What changes were proposed in this pull request? This PR cleans up the Maven build for the Spark Connect Client and Spark Connect Common. The most important change is that we move `sql-api` from a `provided` to `compile` dependency. The net effect of this is that when a user takes a dependency on the client, all of its required (transitive) dependencies are automatically added. Please note that this does not address concerns around creating an Ă¼berjar and shading. That is for a different day :) ### Why are the changes needed? When you take a dependency on the connect scala client you need to manually add the `sql-api` module as a dependency. This is rather poor UX. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually running maven, checking dependency tree, ... Closes #42518 from hvanhovell/SPARK-44832. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 50d9a56f824ae51d10543f4573753ff60dc9053b) Signed-off-by: Herman van Hovell <herman@databricks.com> 28 August 2023, 17:53:46 UTC
12964c2 [SPARK-44867][CONNECT][DOCS] Refactor Spark Connect Docs to incorporate Scala setup ### What changes were proposed in this pull request? This PR refactors the Spark Connect overview docs to include an Interactive (shell/REPL) section and a Standalone application section as well as incorporates new Scala documentation into each of these sections. ### Why are the changes needed? Currently, there isn't much Scala-relevant documentation available to set up the Scala shell/project/application. ### Does this PR introduce _any_ user-facing change? Yes, the documentation for the Spark Connect [overview](https://spark.apache.org/docs/latest/spark-connect-overview.html) page is updated. ### How was this patch tested? Manually generating the docs locally. Closes #42556 from vicennial/sparkConnectDocs. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit d95e8f3c65e5ae0bf39c0ccc477b7b0910513066) Signed-off-by: Herman van Hovell <herman@databricks.com> 28 August 2023, 14:38:34 UTC
c230a50 [SPARK-44974][CONNECT] Null out SparkSession/Dataset/KeyValueGroupedDatset on serialization ### What changes were proposed in this pull request? This PR changes the serialization for connect `SparkSession`, `Dataset`, and `KeyValueGroupedDataset`. While these were marked as serializable they were not, because they refer to bits and pieces that are not serializable. Even if we were to fix this, then we still have a class clash problem with server side classes that have the same name, but have different structure. the latter can be fixed with serialization proxies, but I am going to hold that until someone actually needs/wants this. After this PR these classes are serialized as null. This is a somewhat suboptimal solution compared to throwing exceptions on serialization, however this is more compatible compared to the old situation, and makes accidental capture of these classes less of an issue for UDFs. ### Why are the changes needed? More compatible with the old situation. Improved UX when working with UDFs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests to `ClientDatasetSuite`, `KeyValueGroupedDatasetE2ETestSuite`, `SparkSessionSuite`, and `UserDefinedFunctionE2ETestSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42688 from hvanhovell/SPARK-44974. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit f0b04286022e0774d78b9adcf4aeabc181a3ec89) Signed-off-by: Herman van Hovell <herman@databricks.com> 28 August 2023, 13:05:29 UTC
c831bd7 [SPARK-44982][CONNECT] Mark Spark Connect server configurations as static This PR proposes to mark all Spark Connect server configurations as static configurations. They are already static configurations, and cannot be set in runtime configuration (by default), see also https://github.com/apache/spark/blob/4a4856207d414ba88a8edabeb70e20765460ef1a/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L164-L167 No, they are already static configurations. Existing unittests. No. Closes #42695 from HyukjinKwon/SPARK-44982. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5b69dfd67e35f8be742a58cbd55f33088b4c7704) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 August 2023, 08:37:00 UTC
b86a4a8 [SPARK-44981][PYTHON][CONNECT] Filter out static configurations used in local mode ### What changes were proposed in this pull request? This PR is a kind of a followup of https://github.com/apache/spark/pull/42548. This PR proposes to filter static configurations out in remote=local mode. ### Why are the changes needed? Otherwise, it shows a bunch of warnings as below: ``` 23/08/28 11:39:42 ERROR ErrorUtils: Spark Connect RPC error during: config. UserId: hyukjin.kwon. SessionId: 424674ef-af95-4b12-b10e-86479413f9fd. org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.connect.copyFromLocalToFs.allowDestLocal. at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfStaticConfigError(QueryCompilationErrors.scala:3227) at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:162) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1(SparkConnectConfigHandler.scala:67) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1$adapted(SparkConnectConfigHandler.scala:65) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handleSet(SparkConnectConfigHandler.scala:65) at org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handle(SparkConnectConfigHandler.scala:40) at org.apache.spark.sql.connect.service.SparkConnectService.config(SparkConnectService.scala:120) at org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:751) at org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at org.sparkproject.connect.grpc.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) at org.sparkproject.connect.grpc.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) at org.sparkproject.connect.grpc.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at org.sparkproject.connect.grpc.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` In fact, we do support to set static configurations (and all other configurations) when `remote` is specific to `local`. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released out yet. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42694 from HyukjinKwon/SPARK-44981. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 64636aff61aa473c8fc81c0bb3311e1fe824dc20) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 August 2023, 08:34:42 UTC
c9e5e62 [MINOR][SQL][DOC] Fix incorrect link in sql menu and typo ### What changes were proposed in this pull request? Fix incorrect link in sql menu and typo. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? run `SKIP_API=1 bundle exec jekyll build` ![image](https://github.com/apache/spark/assets/17894939/7cc564ec-41cd-4e92-b19e-d33a53188a10) ### Was this patch authored or co-authored using generative AI tooling? No Closes #42697 from wForget/doc. Authored-by: wforget <643348094@qq.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 421ff4e3c047865a1887cae94b85dbf40bb7bac9) Signed-off-by: Kent Yao <yao@apache.org> 28 August 2023, 08:09:35 UTC
f33a13c [SPARK-44980][PYTHON][CONNECT] Fix inherited namedtuples to work in createDataFrame ### What changes were proposed in this pull request? This PR fixes the bug in createDataFrame with Python Spark Connect client. Now it respects inherited namedtuples as below: ```python from collections import namedtuple MyTuple = namedtuple("MyTuple", ["zz", "b", "a"]) class MyInheritedTuple(MyTuple): pass df = spark.createDataFrame([MyInheritedTuple(1, 2, 3), MyInheritedTuple(11, 22, 33)]) df.collect() ``` Before: ``` [Row(zz=None, b=None, a=None), Row(zz=None, b=None, a=None)] ``` After: ``` [Row(zz=1, b=2, a=3), Row(zz=11, b=22, a=33)] ``` ### Why are the changes needed? This is already supported without Spark Connect. We should match the behaviour for consistent API support. ### Does this PR introduce _any_ user-facing change? Yes, as described above. It fixes a bug, ### How was this patch tested? Manually tested as described above, and unittests were added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42693 from HyukjinKwon/SPARK-44980. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5291c6c9274aaabd4851d70e4c1baad629e12cca) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 August 2023, 06:47:07 UTC
5cc990f [SPARK-44091][YARN][TESTS] Introduce `withResourceTypes` to `ResourceRequestTestHelper` to restore `resourceTypes` as default value after testing ### What changes were proposed in this pull request? This pr convert `ResourceRequestTestHelper` from `object` to `trait` and introduce a new function named `withResourceTypes` to `ResourceRequestTestHelper` to restore `resourceTypes` as default value after testing. ### Why are the changes needed? When test yarn module with command `build/sbt "yarn/test" -Pyarn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest`, there will some test failed in `YarnClusterSuite` like ``` [info] YarnClusterSuite: [info] - run Spark in yarn-client mode *** FAILED *** (3 seconds, 125 milliseconds) [info] FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:238) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.checkResult(BaseYarnClusterSuite.scala:238) [info] at org.apache.spark.deploy.yarn.YarnClusterSuite.testBasicYarnApp(YarnClusterSuite.scala:350) [info] at org.apache.spark.deploy.yarn.YarnClusterSuite.$anonfun$new$1(YarnClusterSuite.scala:95) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.$anonfun$test$1(BaseYarnClusterSuite.scala:77) [info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) [info] at org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282) [info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231) [info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230) [info] at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69) [info] at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) ... ``` and the following error logs will in `unit-tests.log`: ``` 23/06/20 16:56:38.056 IPC Server handler 10 on default port 49553 WARN Server: IPC Server handler 10 on default port 49553, call Call#3 Retry#0 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 127.0.0.1:49561 org.apache.hadoop.yarn.exceptions.ResourceNotFoundException: The resource manager encountered a problem that should not occur under normal circumstances. Please report this error to the Hadoop community by opening a JIRA ticket at http://issues.apache.org/jira and including the following information: * Resource type requested: custom-resource-type-1 * Resource object: <memory:-1, vCores:-1> * The stack trace for this exception: java.lang.Exception at org.apache.hadoop.yarn.exceptions.ResourceNotFoundException.<init>(ResourceNotFoundException.java:47) at org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:264) at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.setResourceInformation(ResourcePBImpl.java:213) at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:463) at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationResourceUsageReportPBImpl.convertToProtoFormat(ApplicationResourceUsageReportPBImpl.java:289) at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationResourceUsageReportPBImpl.mergeLocalToBuilder(ApplicationResourceUsageReportPBImpl.java:91) at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationResourceUsageReportPBImpl.mergeLocalToProto(ApplicationResourceUsageReportPBImpl.java:122) at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationResourceUsageReportPBImpl.getProto(ApplicationResourceUsageReportPBImpl.java:63) at org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:247) at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationReportPBImpl.convertToProtoFormat(ApplicationReportPBImpl.java:560) at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationReportPBImpl.setApplicationResourceUsageReport(ApplicationReportPBImpl.java:100) at org.apache.hadoop.yarn.server.utils.BuilderUtils.newApplicationReport(BuilderUtils.java:406) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:779) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:429) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:247) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:615) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026) ... ``` then mvn test will also have similar failure results ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest ``` ``` YarnClusterSuite: 2023-06-20 09:47:45.531:INFO::ScalaTest-main-running-DiscoverySuite: Logging initialized 7792ms to org.eclipse.jetty.util.log.StdErrLog - run Spark in yarn-client mode *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:238) - run Spark in yarn-cluster mode *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:238) - run Spark in yarn-client mode with unmanaged am - run Spark in yarn-client mode with different configurations, ensuring redaction - run Spark in yarn-cluster mode with different configurations, ensuring redaction *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:238) - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:238) ``` The call to `ResourceUtils.reinitializeResources` will fill private static variable contents of `ResourceUtils`, like: ``` private static final Map<String, Integer> RESOURCE_NAME_TO_INDEX = new ConcurrentHashMap<String, Integer>(); private static volatile Map<String, ResourceInformation> resourceTypes; private static volatile Map<String, ResourceInformation> nonCountableResourceTypes; private static volatile ResourceInformation[] resourceTypesArray; private static volatile Map<String, ResourceInformation> readOnlyNodeResources; ``` and these static variable will not be cleaned up automatically, this may cause different test cases to misuse these shared variables and unexpectedly fail in certain specific scenarios, so this pr use the new function `withResourceTypes` to restore `resourceTypes` as default value after testing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual checked ``` build/sbt "yarn/test" -Pyarn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest ``` and ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest ``` can run successfully after this pr Closes #41673 from LuciferYang/SPARK-44091. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit de96a86d9dc3b32d87deb5a49a4a2d0f6add98a0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 August 2023, 04:35:36 UTC
89a608b [SPARK-42944][PYTHON][FOLLOW-UP][3.5] Rename tests from foreachBatch to foreach_batch This PR cherry-picks https://github.com/apache/spark/pull/42675 to branch-3.5. --- ### What changes were proposed in this pull request? This PR proposes to rename tests from foreachBatch to foreach_batch. ### Why are the changes needed? Non-API should follow snake_naming rule per PEP 8. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in this PR should test it out. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42692 from HyukjinKwon/SPARK-42944-3.5. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 August 2023, 04:32:44 UTC
2f4a712 [SPARK-44897][SQL] Propagating local properties to subquery broadcast exec ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-32748 previously proposed propagating these local properties to the subquery broadcast exec threads but was then reverted since it was said that local properties would already be propagated to the broadcast threads. I believe this is not always true. In the scenario where a separate `BroadcastExchangeExec` is the first to compute the broadcast, this is fine. However, in the scenario where the `SubqueryBroadcastExec` is the first to compute the broadcast, then the local properties that are propagated to the broadcast threads would not have been propagated correctly. This is because the local properties from the subquery broadcast exec were not propagated to its Future thread. It is difficult to write a unit test that reproduces this behavior because usually `BroadcastExchangeExec` is the first computing the broadcast variable. However, by adding a `Thread.sleep(10)` to `SubqueryBroadcastExec.doPrepare` after `relationFuture` is initialized, the added test will consistently fail. ### Why are the changes needed? Local properties are not propagated correctly to `SubqueryBroadcastExec` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Following test can reproduce the bug and test the solution by adding sleep to `SubqueryBroadcastExec.doPrepare` ``` protected override def doPrepare(): Unit = { relationFuture Thread.sleep(10) } ``` ```test("SPARK-44897 propagate local properties to subquery broadcast execuction thread") { withSQLConf(StaticSQLConf.BROADCAST_EXCHANGE_MAX_THREAD_THRESHOLD.key -> "1") { withTable("a", "b") { val confKey = "spark.sql.y" val confValue1 = UUID.randomUUID().toString() val confValue2 = UUID.randomUUID().toString() Seq((confValue1, "1")).toDF("key", "value") .write .format("parquet") .partitionBy("key") .mode("overwrite") .saveAsTable("a") val df1 = spark.table("a") def generateBroadcastDataFrame(confKey: String, confValue: String): Dataset[String] = { val df = spark.range(1).mapPartitions { _ => Iterator(TaskContext.get.getLocalProperty(confKey)) }.filter($"value".contains(confValue)).as("c") df.hint("broadcast") } // set local property and assert val df2 = generateBroadcastDataFrame(confKey, confValue1) spark.sparkContext.setLocalProperty(confKey, confValue1) val checkDF = df1.join(df2).where($"a.key" === $"c.value").select($"a.key", $"c.value") val checks = checkDF.collect() assert(checks.forall(_.toSeq == Seq(confValue1, confValue1))) // change local property and re-assert Seq((confValue2, "1")).toDF("key", "value") .write .format("parquet") .partitionBy("key") .mode("overwrite") .saveAsTable("b") val df3 = spark.table("b") val df4 = generateBroadcastDataFrame(confKey, confValue2) spark.sparkContext.setLocalProperty(confKey, confValue2) val checks2DF = df3.join(df4).where($"b.key" === $"c.value").select($"b.key", $"c.value") val checks2 = checks2DF.collect() assert(checks2.forall(_.toSeq == Seq(confValue2, confValue2))) assert(checks2.nonEmpty) } } } ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #42587 from ChenMichael/SPARK-44897-local-property-propagation-to-subquery-broadcast-exec. Authored-by: Michael Chen <mike.chen@workday.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4a4856207d414ba88a8edabeb70e20765460ef1a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 August 2023, 02:57:11 UTC
9b00d36 Preparing development version 3.5.1-SNAPSHOT 27 August 2023, 23:45:57 UTC
d5423e7 Preparing Spark release v3.5.0-rc3 27 August 2023, 23:45:52 UTC
fa2e53f [SPARK-44784][CONNECT] Make SBT testing hermetic ### What changes were proposed in this pull request? This PR makes a bunch of changes to connect testing for the scala client: - We do not start the connect server with the `SPARK_DIST_CLASSPATH ` environment variable. This is set by the build system, but its value for SBT and Maven is different. For SBT it also contained the client code. - We use dependency upload to add the dependencies needed for the tests. Currently this entails: the compiled test classes (class files), scalatest jars, and scalactic jars. - The use of classfile sync unearthed an issue with stubbing and the `ExecutorClassLoader`. If they load classes in the same namespace then stubbing will generate stubs for classes that can be loaded by the `ExecutorClassLoader`. Since this is mostly a testing issue I decided to move the test code to a different namespace. We should definitely fix this later on. - A bunch of tiny fixes. ### Why are the changes needed? SBT testing for connect leaked client side code into the server. This is a problem because tests pass and we sign-off on features that do not work when well in a normal environment. Stubbing was an example of this. Maven did not have this problem and was therefore more correct. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It are mostly tests. ### Was this patch authored or co-authored using generative AI tooling? No. I write my own code thank you... Closes #42591 from hvanhovell/investigate-stubbing. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 9326615592eac14c7cab3dd126b3c21222b7778f) Signed-off-by: yangjie01 <yangjie01@baidu.com> 27 August 2023, 03:29:12 UTC
2a66771 [SPARK-44968][BUILD] Downgrade ivy from 2.5.2 to 2.5.1 ### What changes were proposed in this pull request? After upgrading Ivy from 2.5.1 to 2.5.2 in SPARK-44914, daily tests for Java 11 and Java 17 began to experience ABORTED in the `HiveExternalCatalogVersionsSuite` test. Java 11 - https://github.com/apache/spark/actions/runs/5953716283/job/16148657660 - https://github.com/apache/spark/actions/runs/5966131923/job/16185159550 Java 17 - https://github.com/apache/spark/actions/runs/5956925790/job/16158714165 - https://github.com/apache/spark/actions/runs/5969348559/job/16195073478 ``` 2023-08-23T23:00:49.6547573Z [info] 2023-08-23 16:00:48.209 - stdout> : java.lang.RuntimeException: problem during retrieve of org.apache.spark#spark-submit-parent-4c061f04-b951-4d06-8909-cde5452988d9: java.lang.RuntimeException: Multiple artifacts of the module log4j#log4j;1.2.17 are retrieved to the same file! Update the retrieve pattern to fix this error. 2023-08-23T23:00:49.6548745Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:238) 2023-08-23T23:00:49.6549572Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:89) 2023-08-23T23:00:49.6550334Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.ivy.Ivy.retrieve(Ivy.java:551) 2023-08-23T23:00:49.6551079Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1464) 2023-08-23T23:00:49.6552024Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:138) 2023-08-23T23:00:49.6552884Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) 2023-08-23T23:00:49.6553755Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:138) 2023-08-23T23:00:49.6554705Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:65) 2023-08-23T23:00:49.6555637Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:64) 2023-08-23T23:00:49.6556554Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:443) 2023-08-23T23:00:49.6557340Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:356) 2023-08-23T23:00:49.6558187Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:71) 2023-08-23T23:00:49.6559061Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:70) 2023-08-23T23:00:49.6559962Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:224) 2023-08-23T23:00:49.6560766Z [info] 2023-08-23 16:00:48.209 - stdout> at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) 2023-08-23T23:00:49.6561584Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) 2023-08-23T23:00:49.6562510Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:224) 2023-08-23T23:00:49.6563435Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) 2023-08-23T23:00:49.6564323Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140) 2023-08-23T23:00:49.6565340Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:45) 2023-08-23T23:00:49.6566321Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:60) 2023-08-23T23:00:49.6567363Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:118) 2023-08-23T23:00:49.6568372Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:118) 2023-08-23T23:00:49.6569393Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:490) 2023-08-23T23:00:49.6570685Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:155) 2023-08-23T23:00:49.6571842Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) 2023-08-23T23:00:49.6572932Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) 2023-08-23T23:00:49.6573996Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) 2023-08-23T23:00:49.6575045Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) 2023-08-23T23:00:49.6576066Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) 2023-08-23T23:00:49.6576937Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) 2023-08-23T23:00:49.6577807Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) 2023-08-23T23:00:49.6578620Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) 2023-08-23T23:00:49.6579432Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) 2023-08-23T23:00:49.6580357Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) 2023-08-23T23:00:49.6581331Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) 2023-08-23T23:00:49.6582239Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) 2023-08-23T23:00:49.6583101Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) 2023-08-23T23:00:49.6584088Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) 2023-08-23T23:00:49.6585236Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) 2023-08-23T23:00:49.6586519Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) 2023-08-23T23:00:49.6587686Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) 2023-08-23T23:00:49.6588898Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) 2023-08-23T23:00:49.6590014Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) 2023-08-23T23:00:49.6590993Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) 2023-08-23T23:00:49.6591930Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:93) 2023-08-23T23:00:49.6592914Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:80) 2023-08-23T23:00:49.6593856Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:78) 2023-08-23T23:00:49.6594687Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.Dataset.<init>(Dataset.scala:219) 2023-08-23T23:00:49.6595379Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) 2023-08-23T23:00:49.6596103Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) 2023-08-23T23:00:49.6596807Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) 2023-08-23T23:00:49.6597520Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618) 2023-08-23T23:00:49.6598276Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) 2023-08-23T23:00:49.6599022Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613) 2023-08-23T23:00:49.6599819Z [info] 2023-08-23 16:00:48.209 - stdout> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2023-08-23T23:00:49.6600723Z [info] 2023-08-23 16:00:48.209 - stdout> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) 2023-08-23T23:00:49.6601707Z [info] 2023-08-23 16:00:48.209 - stdout> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2023-08-23T23:00:49.6602513Z [info] 2023-08-23 16:00:48.209 - stdout> at java.base/java.lang.reflect.Method.invoke(Method.java:568) 2023-08-23T23:00:49.6603272Z [info] 2023-08-23 16:00:48.209 - stdout> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) 2023-08-23T23:00:49.6604007Z [info] 2023-08-23 16:00:48.209 - stdout> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 2023-08-23T23:00:49.6604724Z [info] 2023-08-23 16:00:48.209 - stdout> at py4j.Gateway.invoke(Gateway.java:282) 2023-08-23T23:00:49.6605416Z [info] 2023-08-23 16:00:48.209 - stdout> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 2023-08-23T23:00:49.6606209Z [info] 2023-08-23 16:00:48.209 - stdout> at py4j.commands.CallCommand.execute(CallCommand.java:79) 2023-08-23T23:00:49.6606969Z [info] 2023-08-23 16:00:48.209 - stdout> at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) 2023-08-23T23:00:49.6607743Z [info] 2023-08-23 16:00:48.209 - stdout> at py4j.ClientServerConnection.run(ClientServerConnection.java:106) 2023-08-23T23:00:49.6608415Z [info] 2023-08-23 16:00:48.209 - stdout> at java.base/java.lang.Thread.run(Thread.java:833) 2023-08-23T23:00:49.6609288Z [info] 2023-08-23 16:00:48.209 - stdout> Caused by: java.lang.RuntimeException: Multiple artifacts of the module log4j#log4j;1.2.17 are retrieved to the same file! Update the retrieve pattern to fix this error. 2023-08-23T23:00:49.6610288Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:426) 2023-08-23T23:00:49.6611332Z [info] 2023-08-23 16:00:48.209 - stdout> at org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:122) 2023-08-23T23:00:49.6612046Z [info] 2023-08-23 16:00:48.209 - stdout> ... 66 more 2023-08-23T23:00:49.6612498Z [info] 2023-08-23 16:00:48.209 - stdout> ``` So this pr downgrade ivy from 2.5.2 to 2.5.1 to restore Java 11/17 daily tests. ### Why are the changes needed? To restore Java 11/17 daily tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By changing the default Java version in `build_and_test.yml` to 17 for verification, the tests succeed after downgrading the Ivy to 2.5.1. - https://github.com/LuciferYang/spark/actions/runs/5972232677/job/16209970934 <img width="1116" alt="image" src="https://github.com/apache/spark/assets/1475305/cd4002d8-893d-4845-8b2e-c01ff3106f7f"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #42668 from LuciferYang/test-java17. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 4f8a1991e793bba2a6620760b6ee2cdc8f3ff21d) Signed-off-by: yangjie01 <yangjie01@baidu.com> 26 August 2023, 09:31:21 UTC
074c92f [SPARK-44840][SQL][FOLLOWUP] Change the version from 3.5.0 to 3.4.2 for `spark.sql.legacy.negativeIndexInArrayInsert` ### What changes were proposed in this pull request? After the PR https://github.com/apache/spark/pull/42655, the earliest version when the SQL config `spark.sql.legacy.negativeIndexInArrayInsert` appears is `3.4.2`. This PR update configs version according to the recent changes. ### Why are the changes needed? To don't confuse users. The doc should contain actual info with the earliest version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42681 from MaxGekk/fix-array_insert-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bf3bef1664f431d5e951b9f6682b3df57c6a0143) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 August 2023, 08:59:24 UTC
cf3d101 [SPARK-44930][SQL] Deterministic ApplyFunctionExpression should be foldable ### What changes were proposed in this pull request? Currently, ApplyFunctionExpression is unfoldable because inherits the default value from Expression. However, it should be foldable for a deterministic ApplyFunctionExpression. ### Why are the changes needed? This could help optimize the usage for V2 UDF applying to constant expressions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42629 from ConeyLiu/constant-fold-v2-udf. Authored-by: xianyangliu <xianyangliu@tencent.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 994389f42a40d292a72482e3d76d29bada82d8ec) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 August 2023, 07:01:51 UTC
e986578 [SPARK-44957][PYTHON][SQL][TESTS] Make PySpark (pyspark-sql module) tests passing without any dependency ### What changes were proposed in this pull request? This PR proposes to fix the tests to properly run or skip when there aren't optional dependencies installed. ### Why are the changes needed? Currently, it fails as below: ``` Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3'] Will test the following Python modules: ['pyspark-sql'] python3 python_implementation is CPython python3 version is: Python 3.10.12 Starting test(python3): pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state (temp output: /.../spark/python/target/8e530108-4d5e-46e4-88fb-8f0dfb7b47e2/python3__pyspark.sql.tests.pandas.test_pandas_grouped_map_with_state__jggatex7.log) Starting test(python3): pyspark.sql.tests.pandas.test_pandas_grouped_map (temp output: /.../spark/python/target/3b6e9e5a-c479-408c-9365-8286330e8e7c/python3__pyspark.sql.tests.pandas.test_pandas_grouped_map__1lrovmur.log) Starting test(python3): pyspark.sql.tests.pandas.test_pandas_cogrouped_map (temp output: /.../spark/python/target/68c7cf56-ed7a-453e-8d6d-3a0eb519d997/python3__pyspark.sql.tests.pandas.test_pandas_cogrouped_map__sw2875dr.log) Starting test(python3): pyspark.sql.tests.pandas.test_pandas_map (temp output: /.../spark/python/target/90712186-a104-4491-ae0d-2b5ab973991b/python3__pyspark.sql.tests.pandas.test_pandas_map__ysp4911q.log) Traceback (most recent call last): File "/.../miniconda3/envs/vanilla-3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/.../miniconda3/envs/vanilla-3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/.../workspace/forked/spark/python/pyspark/sql/tests/pandas/test_pandas_map.py", line 27, in <module> from pyspark.testing.sqlutils import ( File "/.../workspace/forked/spark/python/pyspark/testing/__init__.py", line 19, in <module> from pyspark.testing.pandasutils import assertPandasOnSparkEqual File "/.../workspace/forked/spark/python/pyspark/testing/pandasutils.py", line 22, in <module> import pandas as pd ModuleNotFoundError: No module named 'pandas' ``` PySpark tests should pass without optional dependencies. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran as described above. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42670 from HyukjinKwon/SPARK-44957. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fb45476d58c7936518cea1b9510145ecd5ec6fd1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 August 2023, 06:39:47 UTC
e8405a7 [SPARK-44547][CORE] Ignore fallback storage for cached RDD migration ### What changes were proposed in this pull request? Fix bugs that makes the RDD decommissioner never finish ### Why are the changes needed? The cached RDD decommissioner is in a forever retry loop when the only viable peer is the fallback storage, which it doesn't know how to handle. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests are added and tested using Spark jobs. Closes #42155 from ukby1234/franky.SPARK-44547. Authored-by: Frank Yin <franky@ziprecruiter.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 47555da2ae292b07488ba181db1aceac8e7ddb3a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 August 2023, 05:19:48 UTC
bd076eb [SPARK-44822][PYTHON][FOLLOW-UP] Make Python UDTFs by default non-deterministic ### What changes were proposed in this pull request? This PR is a follow up for SPARK-44822. It modifies one more default value for Python UDTF to make it by default non-deterministic. ### Why are the changes needed? To prevent future issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #42665 from allisonwang-db/spark-44822-follow-up. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 492064dd95eaa50bd30c363c97d3a703fd39c872) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 August 2023, 02:32:32 UTC
79c466c [SPARK-44820][DOCS] Switch languages consistently across docs for all code snippets ### What changes were proposed in this pull request? The pr aims to fix bug for `Switch languages consistently across docs for all code snippets`. ### Why are the changes needed? When a user chooses a different language for a code snippet, all code snippets on that page should switch to the chosen language. This was the behavior for, for example, Spark 2.0 doc: https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html But it was broken for later docs, for example the Spark 3.4.1 doc: https://spark.apache.org/docs/latest/quick-start.html We should fix this behavior change and possibly add test cases to prevent future regressions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test: ``` cd docs SKIP_API=1 bundle exec jekyll serve --watch ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42657 from panbingkun/SPARK-44820. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 4748d858b4478ea7503b792050d4735eae83b3cd) Signed-off-by: Gengliang Wang <gengliang@apache.org> 25 August 2023, 00:48:31 UTC
6c2da61 Revert "[SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener" This reverts commit 311a497224db4a00b5cdf928cba8ef30545ee911. 24 August 2023, 17:35:20 UTC
66af3b7 [SPARK-44941][SQL][TESTS] Turn off hive.conf.validation in tests ### What changes were proposed in this pull request? This PR turns off hive.conf.validation in tests to remove the noisy HiveConf logs for removed Hive ConfVars. For example, a single test in hive.SQLQuerySuite repeats four times. ```java [info] SQLQuerySuite: 14:49:45.855 WARN org.apache.spark.sql.internal.WithTestConf$$anon$4: The SQL config 'spark.sql.hive.convertCTAS' has been deprecated in Spark v3.1 and may be removed in the future. Set 'spark.sql.legacy.createHiveTableByDefault' to false instead. 14:49:45.983 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 14:49:45.983 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 14:49:47.651 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 14:49:47.651 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore hzyaoqin127.0.0.1 14:49:47.656 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException 14:49:47.922 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:49:47.998 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:49:47.998 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 14:49:47.998 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 14:49:48.003 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/hzyaoqin/spark/sql/hive/target/tmp/hive_execution_test_group/warehouse-cd8178a1-6b68-426f-aa19-d47d211df0c9/explodetest specified for non-external table:explodetest 14:49:49.264 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 14:49:49.408 WARN org.apache.spark.sql.internal.SQLConf: The SQL config 'spark.sql.hive.convertCTAS' has been deprecated in Spark v3.1 and may be removed in the future. Set 'spark.sql.legacy.createHiveTableByDefault' to false instead. 14:49:49.808 WARN org.apache.spark.sql.internal.WithTestConf$$anon$4: The SQL config 'spark.sql.hive.convertCTAS' has been deprecated in Spark v3.1 and may be removed in the future. Set 'spark.sql.legacy.createHiveTableByDefault' to false instead. [info] - logical.Project should not be resolved if it contains aggregates or generators (6 seconds, 33 milliseconds) 14:49:49.997 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:49:49.997 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 14:49:49.997 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist 14:49:50.004 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:49:50.113 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 14:49:50.113 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 14:49:50.113 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist ``` ### Why are the changes needed? The matcher for `org.apache.hadoop.hive.conf.HiveConf: HiveConf of name * does not exist` exceeds 5000 lines. It can be saved. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run a simple hive test locally and see the log gone. ``` [info] SQLQuerySuite: 14:57:13.274 WARN org.apache.spark.sql.internal.WithTestConf$$anon$4: The SQL config 'spark.sql.hive.convertCTAS' has been deprecated in Spark v3.1 and may be removed in the future. Set 'spark.sql.legacy.createHiveTableByDefault' to false instead. 14:57:15.155 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 14:57:15.155 WARN org.apache.hadoop.hive.metastore.ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore hzyaoqin127.0.0.1 14:57:15.162 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException 14:57:15.419 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 14:57:15.499 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/Users/hzyaoqin/spark/sql/hive/target/tmp/hive_execution_test_group/warehouse-06b4c2ad-728d-4460-ada3-bc1f80b7be0b/explodetest specified for non-external table:explodetest 14:57:16.664 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 14:57:16.809 WARN org.apache.spark.sql.internal.SQLConf: The SQL config 'spark.sql.hive.convertCTAS' has been deprecated in Spark v3.1 and may be removed in the future. Set 'spark.sql.legacy.createHiveTableByDefault' to false instead. 14:57:17.199 WARN org.apache.spark.sql.internal.WithTestConf$$anon$4: The SQL config 'spark.sql.hive.convertCTAS' has been deprecated in Spark v3.1 and may be removed in the future. Set 'spark.sql.legacy.createHiveTableByDefault' to false instead. [info] - logical.Project should not be resolved if it contains aggregates or generators (5 seconds, 953 milliseconds) 14:57:17.393 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. [info] Run completed in 9 seconds, 203 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 140 s (02:20), completed 2023-8-24 14:57:17 ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #42647 from yaooqinn/SPARK-44941. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit c6d4912341fdd70a145c5cf0d7fd1bfc36ede092) Signed-off-by: Kent Yao <yao@apache.org> 24 August 2023, 15:23:16 UTC
f90a4f8 [SPARK-44934][SQL] Use outputSet instead of output to check if column pruning occurred in PushdownPredicateAndPruneColumnsForCTEDef ### What changes were proposed in this pull request? Originally, when a CTE has duplicate expression IDs in its output, the rule PushdownPredicatesAndPruneColumnsForCTEDef wrongly assesses that the columns in the CTE were pruned, as it compares the size of the attribute set containing the union of columns (which is unique) and the original output of the CTE (which contains duplicate columns) and notices that the former is less than the latter. This causes incorrect pruning of the CTE output, resulting in a missing reference and causing the error as documented in the ticket. This PR changes the logic to use the needsPruning function to assess whether a CTE has been pruned, which uses the outputSet to check if any columns has been pruned instead of the output. ### Why are the changes needed? The incorrect behaviour of PushdownPredicatesAndPruneColumnsForCTEDef in CTEs with duplicate expression IDs in its output causes a crash when such a query is run. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test for the crashing case was added. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42635 from wenyuen-db/SPARK-44934. Authored-by: Wen Yuen Pang <wenyuen.pang@databricks.com> Signed-off-by: Peter Toth <peter.toth@gmail.com> (cherry picked from commit 3b405948ee47702e5a7250dc27430836145b0e19) Signed-off-by: Peter Toth <peter.toth@gmail.com> 24 August 2023, 15:01:45 UTC
311a497 [SPARK-44435][SS][CONNECT] Tests for foreachBatch and Listener ### What changes were proposed in this pull request? Add several new test cases for streaming foreachBatch and streaming query listener events to test various scenarios. ### Why are the changes needed? More tests is better ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test only change Closes #42521 from WweiL/SPARK-44435-tests-foreachBatch-listener. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 2d44848f12cf818a0fe54fb03075cd9cca485ecb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 August 2023, 10:18:52 UTC
37a6d87 [SPARK-44121][CONNECT][TESTS] Renable Arrow-based connect tests in Java 21 ### What changes were proposed in this pull request? This PR aims to re-enable Arrow-based connect tests in Java 21. This depends on #42181. ### Why are the changes needed? To have Java 21 test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` $ java -version openjdk version "21-ea" 2023-09-19 OpenJDK Runtime Environment (build 21-ea+32-2482) OpenJDK 64-Bit Server VM (build 21-ea+32-2482, mixed mode, sharing) $ build/sbt "connect/test" -Phive ... [info] Run completed in 14 seconds, 136 milliseconds. [info] Total number of tests run: 858 [info] Suites: completed 20, aborted 0 [info] Tests: succeeded 858, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 44 s, completed Aug 23, 2023, 9:42:53 PM $ build/sbt "connect-client-jvm/test" -Phive ... [info] Run completed in 1 minute, 24 seconds. [info] Total number of tests run: 1220 [info] Suites: completed 24, aborted 0 [info] Tests: succeeded 1220, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [info] Passed: Total 1222, Failed 0, Errors 0, Passed 1222 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42643 from dongjoon-hyun/SPARK-44121. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a824a6de89fdd2ecc119a9bb48bca64da5db72bd) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 August 2023, 07:53:42 UTC
a9db96e [SPARK-44928][PYTHON][DOCS][3.5] Replace the module alias 'sf' instead of 'F' in pyspark.sql import functions ### What changes were proposed in this pull request? cherry-pick https://github.com/apache/spark/pull/42628 for 3.5 ### Why are the changes needed? for better doc ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes #42640 from zhengruifeng/replace_F_35. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 24 August 2023, 06:57:55 UTC
af5a24c [SPARK-43943][FOLLOWUP] Correct a function alias ### What changes were proposed in this pull request? Correct a function alias ### Why are the changes needed? it should be `sign` ### Does this PR introduce _any_ user-facing change? actually no, since `pyspark.sql.connect.function` shares the same namespace with `pyspark.sql.function` also manually check (before this PR) ``` Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 4.0.0.dev0 /_/ Using Python version 3.10.11 (main, May 17 2023 14:30:36) Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. In [1]: from pyspark.sql import functions as sf In [2]: sf.sign Out[2]: <function pyspark.sql.functions.signum(col: 'ColumnOrName') -> pyspark.sql.column.Column> In [3]: sf.sigh --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[3], line 1 ----> 1 sf.sigh AttributeError: module 'pyspark.sql.functions' has no attribute 'sigh' ``` ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes #42642 from zhengruifeng/spark_43943_followup. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit c660eb367c0b1447230025bb9165a1bbc00b6fc3) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 24 August 2023, 06:56:17 UTC
e07e291 [SPARK-44929][TESTS] Standardize log output for console appender in tests ### What changes were proposed in this pull request? This PR set a character length limit for the error message and a stack depth limit for error stack traces to the console appender in tests. The original patterns are - %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex - %t: %m%n%ex And they're adjusted to the new consistent pattern - `%d{HH:mm:ss.SSS} %p %c: %maxLen{%m}{512}%n%ex{8}%n` ### Why are the changes needed? In testing, intentional and unintentional failures are created to generate extensive log volumes. For instance, a single FileNotFound error may be logged multiple times in the writer, task runner, task set manager, and other areas, resulting in thousands of lines per failure. For example, tests in ParquetRebaseDatetimeSuite will be run with V1 and V2 Datasource, two or more specific values, and multiple configuration pairs. I have seen the SparkUpgradeException all over the CI logs ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ``` build/sbt "sql/testOnly *ParquetRebaseDatetimeV1Suite" ``` ``` 15:59:55.446 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Job job_202308230059551630377040190578321_1301 aborted. 15:59:55.446 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1301.0 (TID 1595) org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1. at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) 15:59:55.446 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1. at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420) at org.apache.spark.sql.execution.datasources.... 15:59:55.446 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 1301.0 failed 1 times; aborting job 15:59:55.447 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job 0ead031e-c9dd-446b-b20b-c76ec54978b1. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1301.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1301.0 (TID 1595) (10.221.97.38 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/Users/hzyaoqin/spark/target/tmp/spark-67cce58e-dfb2-4811-a9c0-50ec4c90d1f1. at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:765) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:420) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) 15:59:55.579 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1303.0 (TID 1597) ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #42627 from yaooqinn/SPARK-44929. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 830500150f7e3972d1fa5b47d0ab564bfa7e4b12) Signed-off-by: Kent Yao <yao@apache.org> 24 August 2023, 05:51:41 UTC
e1b7a26 [SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during creation ### What changes were proposed in this pull request? `SparkSession.Builder` now applies configuration options to the create `SparkSession`. ### Why are the changes needed? It is reasonable to expect PySpark connect `SparkSession.Builder` to behave in the same way as other `SparkSession.Builder`s in Spark Connect. The `SparkSession.Builder` should apply the provided configuration options to the created `SparkSesssion`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests were added to verify that configuration options were applied to the `SparkSession`. Closes #42548 from michaelzhan-db/SPARK-44750. Lead-authored-by: Michael Zhang <m.zhang@databricks.com> Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit c2e3171f3d3887302227edc39ee124bd61561b7d) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 24 August 2023, 00:37:18 UTC
2785207 [SPARK-44935][K8S] Fix `RELEASE` file to have the correct information in Docker images if exists ### What changes were proposed in this pull request? This PR aims to fix `RELEASE` file to have the correct information in Docker images if `RELEASE` file exists. Please note that `RELEASE` file doesn't exists in SPARK_HOME directory when we run the K8s integration test from Spark Git repository. So, we keep the following empty `RELEASE` file generation and use `COPY` conditionally via glob syntax. https://github.com/apache/spark/blob/2a3aec1f9040e08999a2df88f92340cd2710e552/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L37 ### Why are the changes needed? Currently, it's an empty file in the official Apache Spark Docker images. ``` $ docker run -it --rm apache/spark:latest ls -al /opt/spark/RELEASE -rw-r--r-- 1 spark spark 0 Jun 25 03:13 /opt/spark/RELEASE $ docker run -it --rm apache/spark:v3.1.3 ls -al /opt/spark/RELEASE | tail -n1 -rw-r--r-- 1 root root 0 Feb 21 2022 /opt/spark/RELEASE ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually build image and check it with `docker run -it --rm NEW_IMAGE ls -al /opt/spark/RELEASE` I copied this `Dockerfile` into Apache Spark 3.5.0 RC2 binary distribution and tested in the following way. ``` $ cd spark-3.5.0-rc2-bin-hadoop3 $ cp /tmp/Dockerfile kubernetes/dockerfiles/spark/Dockerfile $ bin/docker-image-tool.sh -t SPARK-44935 build $ docker run -it --rm docker.io/library/spark:SPARK-44935 ls -al /opt/spark/RELEASE | tail -n1 -rw-r--r-- 1 root root 165 Aug 18 21:10 /opt/spark/RELEASE $ docker run -it --rm docker.io/library/spark:SPARK-44935 cat /opt/spark/RELEASE | tail -n2 Spark 3.5.0 (git revision 010c4a6a05) built for Hadoop 3.3.4 Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-3 -Phive -Phive-thriftserver ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42636 from dongjoon-hyun/SPARK-44935. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d382c6b3aef28bde6adcdf62b7be565ff1152942) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 August 2023, 23:01:03 UTC
2d7e05d [SPARK-44816][CONNECT] Improve error message when UDF class is not found ### What changes were proposed in this pull request? Improve the error messaging on the connect client when using a UDF whose corresponding class has not been sync'ed with the spark connect service. Prior to this change, the client receives a cryptic error: ``` Exception in thread "main" org.apache.spark.SparkException: Main$ ``` With this change, the message is improved to be: ``` Exception in thread "main" org.apache.spark.SparkException: Failed to load class: Main$. Make sure the artifact where the class is defined is installed by calling session.addArtifact. ``` ### Why are the changes needed? This change makes it clear to the user on what the error is. ### Does this PR introduce _any_ user-facing change? Yes. The error message is improved. See details above. ### How was this patch tested? Manually by running a connect server and client. Closes #42500 from nija-at/improve-error. Authored-by: Niranjan Jayakar <nija@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 2d0a0a00cb5dde6bcb8e561278357b6bb8b76dcc) Signed-off-by: Herman van Hovell <herman@databricks.com> 23 August 2023, 16:42:15 UTC
a7941f1 [SPARK-44861][CONNECT] jsonignore SparkListenerConnectOperationStarted.planRequest ### What changes were proposed in this pull request? Add `JsonIgnore` to `SparkListenerConnectOperationStarted.planRequest` ### Why are the changes needed? `SparkListenerConnectOperationStarted` was added as part of [SPARK-43923](https://issues.apache.org/jira/browse/SPARK-43923). `SparkListenerConnectOperationStarted.planRequest` cannot be serialized & deserialized from json as it has recursive objects which causes failures when attempting these operations. ``` com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Direct self-reference leading to cycle (through reference chain: org.apache.spark.sql.connect.service.SparkListenerConnectOperationStarted["planRequest"]->org.apache.spark.connect.proto.ExecutePlanRequest["unknownFields"]->grpc_shaded.com.google.protobuf.UnknownFieldSet["defaultInstanceForType"]) at com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:77) at com.fasterxml.jackson.databind.SerializerProvider.reportBadDefinition(SerializerProvider.java:1308) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit Closes #42550 from jdesjean/SPARK-44861. Authored-by: jdesjean <jf.gauthier@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit dd6cda5b614b4ede418afb4c5b1fdeea9613d32c) Signed-off-by: Herman van Hovell <herman@databricks.com> 23 August 2023, 16:40:25 UTC
aee36e0 [SPARK-44914][BUILD] Upgrade `Apache ivy` from 2.5.1 to 2.5.2 ### What changes were proposed in this pull request? Upgrade Apache ivy from 2.5.1 to 2.5.2 [Release notes](https://lists.apache.org/thread/9gcz4xrsn8c7o9gb377xfzvkb8jltffr) ### Why are the changes needed? [CVE-2022-46751](https://www.cve.org/CVERecord?id=CVE-2022-46751) The fix https://github.com/apache/ant-ivy/commit/2be17bc18b0e1d4123007d579e43ba1a4b6fab3d ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42613 from bjornjorgensen/ivy-2.5.2. Authored-by: Bjørn Jørgensen <bjornjorgensen@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 611e17e89260cd8d2b12edfc060f31a73773fa02) Signed-off-by: yangjie01 <yangjie01@baidu.com> 23 August 2023, 12:58:26 UTC
40ccabf [SPARK-44908][ML][CONNECT] Fix cross validator foldCol param functionality ### What changes were proposed in this pull request? Fix cross validator foldCol param functionality. In main branch the code calls `df.rdd` APIs but it is not supported in spark connect ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42605 from WeichenXu123/fix-tuning-connect-foldCol. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 0d1b5975b2d308c616312d53b9f7ad754348a266) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 23 August 2023, 10:19:32 UTC
4f61662 [SPARK-44909][ML] Skip starting torch distributor log streaming server when it is not available ### What changes were proposed in this pull request? Skip starting torch distributor log streaming server when it is not available. In some cases, e.g., in a databricks connect cluster, there is some network limitation that casues starting log streaming server failure, but, this does not need to break torch distributor training routine. In this PR, it captures exception raised from log server `start` method, and set server port to be -1 if `start` failed. ### Why are the changes needed? In some cases, e.g., in a databricks connect cluster, there is some network limitation that casues starting log streaming server failure, but, this does not need to break torch distributor training routine. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42606 from WeichenXu123/fix-torch-log-server-in-connect-mode. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 80668dc1a36ac0def80f3c18f981fbdacfb2904d) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 23 August 2023, 07:31:18 UTC
aea9421 [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase ### What changes were proposed in this pull request? https://github.com/apache/spark/commit/8ff6b7a04cbaef9c552789ad5550ceab760cb078#diff-f4df4ce19570230091c3b2432e3c84cd2db7059c7b2a03213d272094bd940454 refactors antlr4 files to `sql/api` but checked in `SqlBaseLexer.tokens`. This file is generated so we do not need to check it in. ### Why are the changes needed? Remove file that do not need to be checked in. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing Test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42620 from amaliujia/remove_checked_in_token_file. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 3d83978fc4d2d11b7477c5b0e3a164a37ca23d01) Signed-off-by: Kent Yao <yao@apache.org> 23 August 2023, 05:59:00 UTC
2b935a9 [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient() ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see https://github.com/apache/spark/pull/16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42619 from JoshRosen/remove-awaitUninterruptibly. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 2137606a6686b33bed57800c6b166059b134a089) Signed-off-by: Kent Yao <yao@apache.org> 23 August 2023, 05:54:35 UTC
786a218 [SPARK-44925][K8S] K8s default service token file should not be materialized into token ### What changes were proposed in this pull request? This PR aims to stop materializing `OAuth token` from the default service token file, `/var/run/secrets/kubernetes.io/serviceaccount/token`, because the content of volumes varies which means being renewed or expired by K8s control plane. We need to read the content in a on-demand manner to be in the up-to-date status. Note the followings: - Since we use `autoConfigure` for K8s client, K8s client still uses the default service tokens if exists and needed. https://github.com/apache/spark/blob/13588c10cbc380ecba1231223425eaad2eb9ec80/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/SparkKubernetesClientFactory.scala#L91 - This PR doesn't change Spark's behavior for the user-provided token file location. Spark will load the content of the user-provided token file locations to get `OAuth token` because Spark cannot assume that the files of that locations are refreshed or not in the future. ### Why are the changes needed? [BoundServiceAccountTokenVolume](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume) became `Stable` at K8s 1.22. - [KEP-1205 Bound Service Account Tokens](https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md#boundserviceaccounttokenvolume-1) : **BoundServiceAccountTokenVolume** Alpha | Beta | GA -- | -- | -- 1.13 | 1.21 | 1.22 - [EKS Service Account with 90 Days Expiration](https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html) > For Amazon EKS clusters, the extended expiry period is 90 days. Your Amazon EKS cluster's Kubernetes API server rejects requests with tokens that are greater than 90 days old. - As of today, [all supported EKS clusters are from 1.23 to 1.27](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html) which means we always use `BoundServiceAccountTokenVolume`. ### Does this PR introduce _any_ user-facing change? No. This fixes only the bugs caused by some outdated tokens where K8s control plane denies Spark's K8s API invocation. ### How was this patch tested? Pass the CIs with the all existing unit tests and integration tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42624 from dongjoon-hyun/SPARK-44925. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7b1a3494107b304a93f571920fc3816cde71f706) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 August 2023, 05:46:02 UTC
5ffefd0 [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests ### What changes were proposed in this pull request? This PR disable InternalParquetRecordWriter logs for SQL tests ### Why are the changes needed? The InternalParquetRecordWriter creates over 1,800 records, which equates to more than 80,000 lines of code. This accounts for 80% of the volume of the "slow tests" before the GitHub action truncates it. #### A record for example ``` 2023-08-22T13:38:14.5103112Z 13:38:14.406 WARN org.apache.parquet.hadoop.InternalParquetRecordWriter: Too much memory used: Store { 2023-08-22T13:38:14.5103259Z [dummy_col] optional int32 dummy_col { 2023-08-22T13:38:14.5103352Z r:0 bytes 2023-08-22T13:38:14.5103444Z d:0 bytes 2023-08-22T13:38:14.5103580Z data: FallbackValuesWriter{ 2023-08-22T13:38:14.5103739Z data: initial: DictionaryValuesWriter{ 2023-08-22T13:38:14.5103854Z data: initial: dict:8 2023-08-22T13:38:14.5103976Z data: initial: values:400 2023-08-22T13:38:14.5104079Z data: initial:} 2023-08-22T13:38:14.5104087Z 2023-08-22T13:38:14.5104325Z data: fallback: PLAIN CapacityByteArrayOutputStream 0 slabs, 0 bytes 2023-08-22T13:38:14.5104418Z data:} 2023-08-22T13:38:14.5104426Z 2023-08-22T13:38:14.5104695Z pages: ColumnChunkPageWriter ConcatenatingByteArrayCollector 0 slabs, 0 bytes 2023-08-22T13:38:14.5104795Z total: 400/408 2023-08-22T13:38:14.5104936Z } 2023-08-22T13:38:14.5105129Z [expected_rowIdx_col] optional int32 expected_rowIdx_col { 2023-08-22T13:38:14.5105291Z r:0 bytes 2023-08-22T13:38:14.5105383Z d:0 bytes 2023-08-22T13:38:14.5105518Z data: FallbackValuesWriter{ 2023-08-22T13:38:14.5105679Z data: initial: DictionaryValuesWriter{ 2023-08-22T13:38:14.5105797Z data: initial: dict:400 2023-08-22T13:38:14.5105919Z data: initial: values:400 2023-08-22T13:38:14.5106022Z data: initial:} 2023-08-22T13:38:14.5106030Z 2023-08-22T13:38:14.5106267Z data: fallback: PLAIN CapacityByteArrayOutputStream 0 slabs, 0 bytes 2023-08-22T13:38:14.5106356Z data:} 2023-08-22T13:38:14.5106364Z 2023-08-22T13:38:14.5106636Z pages: ColumnChunkPageWriter ConcatenatingByteArrayCollector 0 slabs, 0 bytes 2023-08-22T13:38:14.5106736Z total: 400/800 2023-08-22T13:38:14.5106820Z } 2023-08-22T13:38:14.5106942Z [id] required int64 id { 2023-08-22T13:38:14.5107037Z r:0 bytes 2023-08-22T13:38:14.5107133Z d:0 bytes 2023-08-22T13:38:14.5107275Z data: FallbackValuesWriter{ 2023-08-22T13:38:14.5107436Z data: initial: DictionaryValuesWriter{ 2023-08-22T13:38:14.5107557Z data: initial: dict:800 2023-08-22T13:38:14.5107677Z data: initial: values:400 2023-08-22T13:38:14.5107779Z data: initial:} 2023-08-22T13:38:14.5107787Z 2023-08-22T13:38:14.5108022Z data: fallback: PLAIN CapacityByteArrayOutputStream 0 slabs, 0 bytes 2023-08-22T13:38:14.5108113Z data:} 2023-08-22T13:38:14.5108121Z 2023-08-22T13:38:14.5108384Z pages: ColumnChunkPageWriter ConcatenatingByteArrayCollector 0 slabs, 0 bytes 2023-08-22T13:38:14.5108485Z total: 800/1,200 2023-08-22T13:38:14.5108570Z } 2023-08-22T13:38:14.5108655Z } 2023-08-22T13:38:14.5108664Z ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? see log volume of SQL tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #42614 from yaooqinn/log. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 34b51cc784478f65c4c9a6cf3fdc0ed4ca31f745) Signed-off-by: Kent Yao <yao@apache.org> 23 August 2023, 05:41:41 UTC
a8f78ac [SPARK-44840][SQL][3.5] Make `array_insert()` 1-based for negative indexes ### What changes were proposed in this pull request? In the PR, I propose to make the `array_insert` function 1-based for negative indexes. So, the maximum index is -1 should point out to the last element, and the function should insert new element at the end of the given array for the index -1. The old behaviour can be restored via the SQL config `spark.sql.legacy.negativeIndexInArrayInsert`. This is a backport of https://github.com/apache/spark/pull/42564 ### Why are the changes needed? 1. To match the behaviour of functions such as `substr()` and `element_at()`. ```sql spark-sql (default)> select element_at(array('a', 'b'), -1), substr('ab', -1); b b ``` 2. To fix an inconsistency in `array_insert` in which positive indexes are 1-based, but negative indexes are 0-based. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","c","b"] ``` After: ```sql spark-sql (default)> select array_insert(array('a', 'b'), -1, 'c'); ["a","b","c"] ``` ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *CollectionExpressionsSuite" $ build/sbt "test:testOnly *DataFrameFunctionsSuite" $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` Closes #42616 from MaxGekk/fix-array_insert-3.5-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 August 2023, 04:42:56 UTC
fc1e004 [SPARK-44905][SQL] Stateful lastRegex causes NullPointerException on eval for regexp_replace ### What changes were proposed in this pull request? This PR resets lastRegex to null on failure for subsequential calls to regenerate pattern value ### Why are the changes needed? Fix the NPE of accessing a null value pattern. see https://github.com/apache/spark/pull/42481#issuecomment-1687474449 for more infomation ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #42601 from yaooqinn/SPARK-44905. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 13588c10cbc380ecba1231223425eaad2eb9ec80) Signed-off-by: Kent Yao <yao@apache.org> 23 August 2023, 02:36:57 UTC
319dff1 [SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site ### What changes were proposed in this pull request? The pr aims to add Spark version drop down to the PySpark doc site. ### Why are the changes needed? Currently, PySpark documentation does not have a version dropdown. While by default we want people to land on the latest version, it will be helpful and easier for people to find docs if we have this version dropdown. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing. ``` cd python/docs make html ``` Closes #42428 from panbingkun/SPARK-44742. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 49ba2443c118fc0322daf903b6c3370b73bcb0ce) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 August 2023, 02:31:31 UTC
2f162e5 [SPARK-44907][PYTHON][CONNECT] DataFrame.join` should throw IllegalArgumentException for invalid join types ### What changes were proposed in this pull request? `DataFrame.join` should throw IllegalArgumentException for invalid join types ### Why are the changes needed? all valid join types have already been supported, for an unknown one, should throw `IllegalArgumentException` now ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? enabled UT ### Was this patch authored or co-authored using generative AI tooling? NO Closes #42603 from zhengruifeng/test_df_join_type. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 7e5250a797e385dea65bdc3315e9bcfd827afbfb) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 23 August 2023, 00:01:29 UTC
99c9e63 [SPARK-44871][SQL] Fix percentile_disc behaviour This PR fixes `percentile_disc()` function as currently it returns inforrect results in some cases. E.g.: ``` SELECT percentile_disc(0.0) WITHIN GROUP (ORDER BY a) as p0, percentile_disc(0.1) WITHIN GROUP (ORDER BY a) as p1, percentile_disc(0.2) WITHIN GROUP (ORDER BY a) as p2, percentile_disc(0.3) WITHIN GROUP (ORDER BY a) as p3, percentile_disc(0.4) WITHIN GROUP (ORDER BY a) as p4, percentile_disc(0.5) WITHIN GROUP (ORDER BY a) as p5, percentile_disc(0.6) WITHIN GROUP (ORDER BY a) as p6, percentile_disc(0.7) WITHIN GROUP (ORDER BY a) as p7, percentile_disc(0.8) WITHIN GROUP (ORDER BY a) as p8, percentile_disc(0.9) WITHIN GROUP (ORDER BY a) as p9, percentile_disc(1.0) WITHIN GROUP (ORDER BY a) as p10 FROM VALUES (0), (1), (2), (3), (4) AS v(a) ``` currently returns: ``` +---+---+---+---+---+---+---+---+---+---+---+ | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| +---+---+---+---+---+---+---+---+---+---+---+ |0.0|0.0|0.0|1.0|1.0|2.0|2.0|2.0|3.0|3.0|4.0| +---+---+---+---+---+---+---+---+---+---+---+ ``` but after this PR it returns the correct: ``` +---+---+---+---+---+---+---+---+---+---+---+ | p0| p1| p2| p3| p4| p5| p6| p7| p8| p9|p10| +---+---+---+---+---+---+---+---+---+---+---+ |0.0|0.0|0.0|1.0|1.0|2.0|2.0|3.0|3.0|4.0|4.0| +---+---+---+---+---+---+---+---+---+---+---+ ``` Bugfix. Yes, fixes a correctness bug, but the old behaviour can be restored with `spark.sql.legacy.percentileDiscCalculation=true`. Added new UTs. Closes #42559 from peter-toth/SPARK-44871-fix-percentile-disc-behaviour. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com> (cherry picked from commit bd8fbf465796038b1975c228325b354888eb9c0b) Signed-off-by: Peter Toth <peter.toth@gmail.com> 22 August 2023, 12:19:08 UTC
97f2081 [MINOR][PYTHON][DOCS] Remove duplicated versionchanged per versionadded ### What changes were proposed in this pull request? This PR addresses all the cases of duplicated `versionchanged` directives with `versionadded` directives, see also https://github.com/apache/spark/pull/42597. Also, this PR mentions that all functions support Spark Connect from Apache Spark 3.5.0. ### Why are the changes needed? To remove duplicated information in docstring. ### Does this PR introduce _any_ user-facing change? Yes, it removes duplicated information in PySpark API Reference page. ### How was this patch tested? CI in this PR should validate them. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42602 from HyukjinKwon/minor-versionchanges. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 65b8ca2694c2443b4f97963de9398ac0ff779d0c) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 August 2023, 11:09:17 UTC
b59db1e [SPARK-44885][SQL] NullPointerException is thrown when column with ROWID type contains NULL values ### What changes were proposed in this pull request? If a `rowid` column is `null`, do not call `toString` on it. ### Why are the changes needed? A column with the `rowid` type may contain NULL values. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42576 from tindzk/fix/rowid-null. Authored-by: Tim Nieradzik <tim@sparse.tech> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 16607a5fd03f562dc8ea3825e90c40e80e8063e6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 August 2023, 10:48:18 UTC
4823e03 [SPARK-44786][SQL][CONNECT] Convert common Spark exceptions - Convert common Spark exceptions - Extend common Spark exceptions to support single message parameter constructor - Achieve similar exception conversion coverage as [Python Client](https://github.com/apache/spark/blob/master/python/pyspark/errors/exceptions/connect.py#L57-L89) No - Existing tests Closes #42472 from heyihong/SPARK-44786. Authored-by: Yihong He <yihong.he@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit dc900b47556dc432f494ad465abdd59fc645734d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 August 2023, 10:45:53 UTC
d52582b [SPARK-42768][SQL] Enable cached plan apply AQE by default ### What changes were proposed in this pull request? This pr enables the `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` by default. ### Why are the changes needed? We have fixed all known issues when enable cache + AQE since SPARK-42101. There is no reason to skip AQE optimizing cached plan. ### Does this PR introduce _any_ user-facing change? yes, the default config changed ### How was this patch tested? Pass CI Closes #40390 from ulysses-you/SPARK-42768. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1569ab543d2b560fa8b59ec9df17b3cf76cf1d7f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 August 2023, 08:02:49 UTC
57353c2 [SPARK-44904][PYTHON][DOCS] Correct the `versionadded` of `sql.functions.approx_percentile` to 3.5.0 ### What changes were proposed in this pull request? `sql.functions.approx_percentile` was introduced in SPARK-43941 (https://github.com/apache/spark/pull/41588). This is a pr for the Spark 3.5.0 and it does not belong to Spark 3.4.0. Therefore, this pr corrects the `versionadded` of `sql.functions.approx_percentile` to 3.5.0 and removed `versionchanged`. ### Why are the changes needed? Correct the `versionadded` of `sql.functions.approx_percentile`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #42597 from LuciferYang/SPARK-44904. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 534f9ea31196bd447449f5ea9dc9b5a80a4c4699) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 August 2023, 06:09:39 UTC
bf5ef23 [SPARK-44854][PYTHON] Python timedelta to DayTimeIntervalType edge case bug ### What changes were proposed in this pull request? This PR proposes to change the way that python `datetime.timedelta` objects are converted to `pyspark.sql.types.DayTimeIntervalType` objects. Specifically, it modifies the logic inside `toInternal` which returns the timedelta as a python integer (would be int64 in other languages) storing the timedelta as microseconds. The current logic inadvertently adds an extra second when doing the conversion for certain python timedelta objects, thereby returning an incorrect value. An illustrative example is as follows: ``` from datetime import timedelta from pyspark.sql.types import DayTimeIntervalType, StructField, StructType spark = ...spark session setup here... td = timedelta(days=4498031, seconds=16054, microseconds=999981) df = spark.createDataFrame([(td,)], StructType([StructField(name="timedelta_col", dataType=DayTimeIntervalType(), nullable=False)])) df.show(truncate=False) > +------------------------------------------------+ > |timedelta_col | > +------------------------------------------------+ > |INTERVAL '4498031 04:27:35.999981' DAY TO SECOND| > +------------------------------------------------+ print(str(td)) > '4498031 days, 4:27:34.999981' ``` In the above example, look at the seconds. The original python timedelta object has 34 seconds, the pyspark DayTimeIntervalType column has 35 seconds. ### Why are the changes needed? To fix a bug. It is a bug because the wrong value is returned after conversion. Adding the above timedelta entry to existing unit tests causes the test to fail. ### Does this PR introduce _any_ user-facing change? Yes. Users should now see the correct timedelta values in pyspark dataframes for similar such edge cases. ### How was this patch tested? Illustrative edge case examples were added to the unit test (`python/pyspark/sql/tests/test_types.py` the `test_daytime_interval_type` test), verified that the existing code failed the test, new code was added, and verified that the unit test now passes. ### JIRA ticket link This PR should close https://issues.apache.org/jira/browse/SPARK-44854 Closes #42541 from hdaly0/SPARK-44854. Authored-by: Ocean <haghighidaly@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9fdf65aefc552c909f6643f8a31405d0622eeb7e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 August 2023, 02:54:48 UTC
8ad4ddd [SPARK-44776][CONNECT] Add ProducedRowCount to SparkListenerConnectOperationFinished ### What changes were proposed in this pull request? Add ProducedRowCount field to SparkListenerConnectOperationFinished ### Why are the changes needed? Needed for showing number of rows getting produced ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test Closes #42454 from gjxdxh/SPARK-44776. Authored-by: Lingkai Kong <lingkai.kong@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4646991abd7f4a47a1b8712e2017a2fae98f7c5a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 August 2023, 01:06:58 UTC
83556c4 [SPARK-44891][PYTHON][CONNECT] Enable Doctests of `rand`, `randn` and `log` ### What changes were proposed in this pull request? I roughly went thought all the skipped doctests in `pyspark.sql.functions`, and find we can enabled doctests of `rand`, `randn` and `log`, by making them deterministic: - specify the `numPartitions` in `spark.range` for `rand` `randn`; - changes the input values for `log` ### Why are the changes needed? Enable Doctests of `rand`, `randn` and `log`, improve test coverage ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? enabled doctests ### Was this patch authored or co-authored using generative AI tooling? No Closes #42584 from zhengruifeng/make_doctest_deterministic. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 290b6327faadb5bbe25e9243955d3cf0c4ca4cfa) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 21 August 2023, 23:48:09 UTC
38d244c [SPARK-44879][PYTHON][DOCS][3.5] Refine the docstring of spark.createDataFrame ### What changes were proposed in this pull request? Cherry-picked from 536ac30d3ca4bc81dca6a31d1211e61f25cbbc14. This PR refines the examples in the docstring of `spark.createDataFrame`. It also removes the examples using RDDs, as RDDs are outdated and Spark Connect won't support RDD: pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] sparkContext() is not implemented. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doctest. Closes #42590 from allisonwang-db/spark-44879-3.5. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 21 August 2023, 23:46:50 UTC
12fd2a0 [SPARK-44889][PYTHON][CONNECT] Fix docstring of `monotonically_increasing_id` ### What changes were proposed in this pull request? Fix docstring of `monotonically_increasing_id` ### Why are the changes needed? 1, using `from pyspark.sql import functions as F` to avoid implicit wildcard import; 2, using dataframe APIs instead of RDD, so the docstring can be reused in Connect; after this fix, all dostrings are reused between vanilla PySpark and Spark Connect Python Client ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #42582 from zhengruifeng/fix_monotonically_increasing_id_docstring. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 72c62b6596d21e975c5597f8fff84b1a9d070a02) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 21 August 2023, 10:21:52 UTC
b179c0c [SPARK-44785][SQL][CONNECT] Convert common alreadyExistsExceptions and noSuchExceptions ### What changes were proposed in this pull request? - Convert common alreadyExistsExceptions and noSuchExceptions - Extend common alreadyExistsExceptions and noSuchExceptions to support (message, cause) constructors ### Why are the changes needed? - Better compatibility with the existing control flow - Better readability of errors ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"` Closes #42471 from heyihong/SPARK-44785. Authored-by: Yihong He <yihong.he@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e73487c4348c47571a3ea083a0903a7997b64a47) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 August 2023, 03:14:02 UTC
71fdba4 [SPARK-44877][CONNECT][PYTHON] Support python protobuf functions for Spark Connect ### What changes were proposed in this pull request? Support python protobuf functions for Spark Connect ### Why are the changes needed? Support python protobuf functions for Spark Connect ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? added doctest and did manual test ``` bo.gaoPF2WXGJ3KT spark % bin/pyspark --remote "local[*]" --jars connector/protobuf/target/scala-2.12/spark-protobuf_2.12-4.0.0-SNAPSHOT.jar Python 3.9.6 (default, May 7 2023, 23:32:44) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. 23/08/18 10:47:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). /Users/bo.gao/workplace/spark/python/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. warnings.warn( Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 4.0.0.dev0 /_/ Using Python version 3.9.6 (default, May 7 2023 23:32:44) Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. >>> from pyspark.sql.protobuf.functions import from_protobuf, to_protobuf >>> import tempfile >>> data = [([(2, "Alice", 13093020)])] >>> ddl_schema = "value struct<age: INTEGER, name: STRING, score: LONG>" >>> df = spark.createDataFrame(data, ddl_schema) >>> desc_hex = str('0ACE010A41636F6E6E6563746F722F70726F746F6275662F7372632F746573742F726' ... '5736F75726365732F70726F746F6275662F7079737061726B5F746573742E70726F746F121D6F72672E61' ... '70616368652E737061726B2E73716C2E70726F746F627566224B0A0D53696D706C654D657373616765121' ... '00A03616765180120012805520361676512120A046E616D6518022001280952046E616D6512140A057363' ... '6F7265180320012803520573636F72654215421353696D706C654D65737361676550726F746F736206707' ... '26F746F33') >>> with tempfile.TemporaryDirectory() as tmp_dir: ... desc_file_path = "%s/pyspark_test.desc" % tmp_dir ... with open(desc_file_path, "wb") as f: ... _ = f.write(bytearray.fromhex(desc_hex)) ... f.flush() ... message_name = 'SimpleMessage' ... proto_df = df.select( ... to_protobuf(df.value, message_name, desc_file_path).alias("value")) ... proto_df.show(truncate=False) ... proto_df_1 = proto_df.select( # With file name for descriptor ... from_protobuf(proto_df.value, message_name, desc_file_path).alias("value")) ... proto_df_1.show(truncate=False) ... proto_df_2 = proto_df.select( # With binary for descriptor ... from_protobuf(proto_df.value, message_name, ... binaryDescriptorSet = bytearray.fromhex(desc_hex)) ... .alias("value")) ... proto_df_2.show(truncate=False) ... +-------------------------------------------+ |value | +-------------------------------------------+ |[08 02 12 05 41 6C 69 63 65 18 9C 91 9F 06]| +-------------------------------------------+ +--------------------+ |value | +--------------------+ |{2, Alice, 13093020}| +--------------------+ +--------------------+ |value | +--------------------+ |{2, Alice, 13093020}| +--------------------+ ``` ``` >>> data = [([(1668035962, 2020)])] >>> ddl_schema = "value struct<seconds: LONG, nanos: INT>" >>> df = spark.createDataFrame(data, ddl_schema) >>> message_class_name = "org.sparkproject.spark_protobuf.protobuf.Timestamp" >>> to_proto_df = df.select(to_protobuf(df.value, message_class_name).alias("value")) >>> from_proto_df = to_proto_df.select( ... from_protobuf(to_proto_df.value, message_class_name).alias("value")) >>> from_proto_df.show(truncate=False) +------------------+ |value | +------------------+ |{1668035962, 2020}| +------------------+ ``` ``` >>> import tempfile >>> data = [([(2, "Alice", 13093020)])] >>> ddl_schema = "value struct<age: INTEGER, name: STRING, score: LONG>" >>> df = spark.createDataFrame(data, ddl_schema) >>> desc_hex = str('0ACE010A41636F6E6E6563746F722F70726F746F6275662F7372632F746573742F726' ... '5736F75726365732F70726F746F6275662F7079737061726B5F746573742E70726F746F121D6F72672E61' ... '70616368652E737061726B2E73716C2E70726F746F627566224B0A0D53696D706C654D657373616765121' ... '00A03616765180120012805520361676512120A046E616D6518022001280952046E616D6512140A057363' ... '6F7265180320012803520573636F72654215421353696D706C654D65737361676550726F746F736206707' ... '26F746F33') >>> with tempfile.TemporaryDirectory() as tmp_dir: ... desc_file_path = "%s/pyspark_test.desc" % tmp_dir ... with open(desc_file_path, "wb") as f: ... _ = f.write(bytearray.fromhex(desc_hex)) ... f.flush() ... message_name = 'SimpleMessage' ... proto_df = df.select( # With file name for descriptor ... to_protobuf(df.value, message_name, desc_file_path).alias("suite")) ... proto_df.show(truncate=False) ... proto_df_2 = df.select( # With binary for descriptor ... to_protobuf(df.value, message_name, ... binaryDescriptorSet=bytearray.fromhex(desc_hex)) ... .alias("suite")) ... proto_df_2.show(truncate=False) ... +-------------------------------------------+ |suite | +-------------------------------------------+ |[08 02 12 05 41 6C 69 63 65 18 9C 91 9F 06]| +-------------------------------------------+ +-------------------------------------------+ |suite | +-------------------------------------------+ |[08 02 12 05 41 6C 69 63 65 18 9C 91 9F 06]| +-------------------------------------------+ ``` ``` >>> data = [([(1668035962, 2020)])] >>> ddl_schema = "value struct<seconds: LONG, nanos: INT>" >>> df = spark.createDataFrame(data, ddl_schema) >>> message_class_name = "org.sparkproject.spark_protobuf.protobuf.Timestamp" >>> proto_df = df.select(to_protobuf(df.value, message_class_name).alias("suite")) >>> proto_df.show(truncate=False) +----------------------------+ |suite | +----------------------------+ |[08 FA EA B0 9B 06 10 E4 0F]| +----------------------------+ ``` Closes #42563 from bogao007/python-connect-protobuf. Authored-by: bogao007 <bo.gao@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 5151b5ba9630b042c659e98b9fd3f7bdb6fc19bd) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 21 August 2023, 01:12:32 UTC
b0de706 [SPARK-44858][PYTHON][DOCS] Refine dostring of DataFrame.isEmpty ### What changes were proposed in this pull request? This PR refines the docstring of DataFrame.isEmpty and adds more examples. ### Why are the changes needed? To make PySpark documentation better ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest Closes #42547 from allisonwang-db/spark-44858-refine-is-empty. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 52c4673ed04f377b49299f078835437a1deeaf99) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 21 August 2023, 00:31:24 UTC
6465620 [SPARK-44876][PYTHON] Fix Arrow-optimized Python UDF on Spark Connect ### What changes were proposed in this pull request? Fixes Arrow-optimized Python UDF on Spark Connect. Also enables the missing test `pyspark.sql.tests.connect.test_parity_arrow_python_udf`. ### Why are the changes needed? `pyspark.sql.tests.connect.test_parity_arrow_python_udf` is not listed in `dev/sparktestsupport/modules.py`, and it fails when running manually. ``` ====================================================================== ERROR [0.072s]: test_register (pyspark.sql.tests.connect.test_parity_arrow_python_udf.ArrowPythonUDFParityTests) ---------------------------------------------------------------------- Traceback (most recent call last): ... pyspark.errors.exceptions.base.PySparkRuntimeError: [SCHEMA_MISMATCH_FOR_PANDAS_UDF] Result vector from pandas_udf was not the required length: expected 1, got 38. ``` The failure had not been captured because the test is missing in the `module.py` file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #42568 from ueshin/issues/SPARK-44876/test_parity_arrow_python_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 75c0b8b61ca53c3763c8e43e83b93b34688ea246) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 August 2023, 00:20:26 UTC
950ca5d [SPARK-44882][PYTHON][CONNECT] Remove function uuid/random/chr from PySpark ### What changes were proposed in this pull request? Remove function uuid/random/chr from PySpark ### Why are the changes needed? The three functions are controversial and needs further discussion, we can add them back in the future. - `uuid` and `random` are also the names of [built-in modules](https://docs.python.org/3/library/index.html), - `chr` is the name of a [built-in function](https://docs.python.org/3/library/functions.html) This PR is to avoid namespace conflict which may break existing workloads, e.g. ``` import uuid from pyspark.sql.functions import * print(uuid.uuid4()) ``` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? updated CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes #42573 from zhengruifeng/del_functions. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 169af1191fcea318087ab8a528a7c98b16e2aea7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 August 2023, 00:16:21 UTC
ec8c22b [SPARK-44880][UI] Remove unnecessary right curly brace at the end of the thread locks info ### What changes were proposed in this pull request? Remove unnecessary curly braces at the end of the thread locks info. ### Why are the changes needed? <img width="1405" alt="image" src="https://github.com/apache/spark/assets/8326978/34bb71e9-1982-4856-bb55-e10cd6fcf707"> the right curly brace is dangling. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing ut ### Was this patch authored or co-authored using generative AI tooling? no Closes #42571 from yaooqinn/SPARK-44880. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 3f380b9ecc8b27f6965b554061572e0990f05132) Signed-off-by: Yuming Wang <yumwang@ebay.com> 20 August 2023, 01:39:39 UTC
cb8de35 [SPARK-44873] Support alter view with nested columns in Hive client ### What changes were proposed in this pull request? Previously, if a view's schema contains a nested struct, alterTable using Hive client will fail. This change supports a view with a nested struct. The mechanism is to store an empty schema when we call Hive client, since we already store the actual schema in table properties. This fix is similar to https://github.com/apache/spark/pull/37364 ### Why are the changes needed? This supports using view with nested structs in Hive metastore. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test. Closes #42532 from kylerong-db/hive_view. Authored-by: kylerong-db <kyle.rong@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit c7d63eac48d3a81099456360f1c30e6049824749) Signed-off-by: Gengliang Wang <gengliang@apache.org> 18 August 2023, 18:07:30 UTC
b24dbc9 [SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again ### What changes were proposed in this pull request? This PR creates an alternative to the assign_issue function in jira.client.JIRA. The original one has an issue that it will search users again and only choose the assignee from 20 candidates. If it's unmatched, it picks the head blindly. For example, ```python >>> assignee = asf_jira.user("yao") >>> "SPARK-44801" 'SPARK-44801' >>> asf_jira.assign_issue(issue.key, assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'issue' is not defined >>> asf_jira.assign_issue("SPARK-44801", assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 123, in wrapper result = func(*arg_list, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 1891, in assign_issue self._session.put(url, data=json.dumps(payload)) File "/Users/hzyaoqin/python/lib/python3.11/site-packages/requests/sessions.py", line 649, in put return self.request("PUT", url, data=data, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 246, in request elif raise_on_error(response, **processed_kwargs): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 71, in raise_on_error raise JIRAError( jira.exceptions.JIRAError: JiraError HTTP 400 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-44801/assignee response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' cannot be assigned issues."}} ``` The Jira userid 'yao' fails to return my JIRA profile as a candidate(20 in total) to match. So, 'airhot' from the head replaces me as an assignee. ### Why are the changes needed? bugfix for merge_spark_pr ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test locally ```python >>> def assign_issue(client: jira.client.JIRA, issue: int, assignee: str) -> bool: ... """Assign an issue to a user. ... ... Args: ... issue (Union[int, str]): the issue ID or key to assign ... assignee (str): the user to assign the issue to. None will set it to unassigned. -1 will set it to Automatic. ... ... Returns: ... bool ... """ ... url = getattr(client, "_get_latest_url")(f"issue/{issue}/assignee") ... payload = {"name": assignee} ... getattr(client, "_session").put(url, data=json.dumps(payload)) ... return True ... >>> >>> assign_issue(asf_jira, "SPARK-44801", "yao") True ``` Closes #42496 from yaooqinn/SPARK-44813. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 8fb799d47bbd5d5ce9db35283d08ab1a31dc37b9) Signed-off-by: Kent Yao <yao@apache.org> 18 August 2023, 18:03:45 UTC
7864697 Revert "[SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again" This reverts commit f7dd0a95727259ff4b7a2f849798f8a93cf78b69. 18 August 2023, 17:54:42 UTC
7ffb4a1 [SPARK-44875][INFRA] Fix spelling for commentator to test SPARK-44813 ### What changes were proposed in this pull request? Fix a typo to verify SPARK-44813 ### Why are the changes needed? Fix a typo and verify SPARK-44813 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #42561 from yaooqinn/SPARK-44875. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 54dd18b5e0953df37e5f0937f1f79e65db70b787) Signed-off-by: Kent Yao <yao@apache.org> 18 August 2023, 17:49:00 UTC
c380da1 [SPARK-44433][3.5X] Terminate foreach batch runner when streaming query terminates [This is 3.5x port of #42460 in master. It resolves couple of conflicts. ] This terminates Python worker created for `foreachBatch` when the streaming query terminate. All of the tracking is done inside connect server (inside `StreamingForeachBatchHelper`). How this works: * (A) The helper class returns a cleaner (an `AutoCloseable`) to connect server when foreachBatch function is set up (happens before starting the query). * (B) If the query fails to start, server directly invokes the cleaner. * (C) If the query starts up, the server registers the cleaner with `streamingRunnerCleanerCache` in the `SessionHolder`. * (D) The cache keeps a mapping of query to cleaner. * It registers a streaming listener (only once per session), which invokes the cleaner when a query terminates. * There is also finally cleanup when SessionHolder expires. This ensures Python process created for a streaming query is properly terminated when a query terminates. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Unit tests are added for `CleanerCache` - Existing unit tests for foreachBatch. - Manual test to verify python process is terminated in different cases. - Unit tests don't really verify that the process is terminated. There will be a follow up PR to verify this. Closes #42555 from rangadi/pr-terminate-3.5x. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 18 August 2023, 17:39:42 UTC
f7dd0a9 [SPARK-44813][INFRA] The Jira Python misses our assignee when it searches users again ### What changes were proposed in this pull request? This PR creates an alternative to the assign_issue function in jira.client.JIRA. The original one has an issue that it will search users again and only choose the assignee from 20 candidates. If it's unmatched, it picks the head blindly. For example, ```python >>> assignee = asf_jira.user("yao") >>> "SPARK-44801" 'SPARK-44801' >>> asf_jira.assign_issue(issue.key, assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'issue' is not defined >>> asf_jira.assign_issue("SPARK-44801", assignee.name) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 123, in wrapper result = func(*arg_list, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/client.py", line 1891, in assign_issue self._session.put(url, data=json.dumps(payload)) File "/Users/hzyaoqin/python/lib/python3.11/site-packages/requests/sessions.py", line 649, in put return self.request("PUT", url, data=data, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 246, in request elif raise_on_error(response, **processed_kwargs): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/hzyaoqin/python/lib/python3.11/site-packages/jira/resilientsession.py", line 71, in raise_on_error raise JIRAError( jira.exceptions.JIRAError: JiraError HTTP 400 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-44801/assignee response text = {"errorMessages":[],"errors":{"assignee":"User 'airhot' cannot be assigned issues."}} ``` The Jira userid 'yao' fails to return my JIRA profile as a candidate(20 in total) to match. So, 'airhot' from the head replaces me as an assignee. ### Why are the changes needed? bugfix for merge_spark_pr ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test locally ```python >>> def assign_issue(client: jira.client.JIRA, issue: int, assignee: str) -> bool: ... """Assign an issue to a user. ... ... Args: ... issue (Union[int, str]): the issue ID or key to assign ... assignee (str): the user to assign the issue to. None will set it to unassigned. -1 will set it to Automatic. ... ... Returns: ... bool ... """ ... url = getattr(client, "_get_latest_url")(f"issue/{issue}/assignee") ... payload = {"name": assignee} ... getattr(client, "_session").put(url, data=json.dumps(payload)) ... return True ... >>> >>> assign_issue(asf_jira, "SPARK-44801", "yao") True ``` Closes #42496 from yaooqinn/SPARK-44813. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 00255bc63b1a3bbe80bedc639b88d4a8e3f88f72) Signed-off-by: Sean Owen <srowen@gmail.com> 18 August 2023, 15:02:54 UTC
366e741 [SPARK-44729][PYTHON][DOCS] Add canonical links to the PySpark docs page ### What changes were proposed in this pull request? The pr aims to add canonical links to the PySpark docs page. ### Why are the changes needed? We should add the canonical link to the PySpark docs page https://spark.apache.org/docs/latest/api/python/index.html so that the search engine can return the latest PySpark docs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing. ``` cd python/docs make html ``` Closes #42425 from panbingkun/SPARK-44729. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit c88ced88af9a502a9e5352e31bb2963506ecb172) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 18 August 2023, 12:37:04 UTC
94ccbf2 [SPARK-44740][CONNECT][FOLLOW] Fix metadata values for Artifacts ### What changes were proposed in this pull request? This is a followup for a previous fix where we did not properly propagate the metadata from the main client into the dependent stubs. ### Why are the changes needed? compatibility ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #42537 from grundprinzip/spark-44740-follow. Authored-by: Martin Grund <martin.grund@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit b37daf5695e59ef2f29c6e084230ac89153cca26) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 18 August 2023, 12:31:25 UTC
3f1a0a5 [SPARK-44853][PYTHON][DOCS] Refine docstring of DataFrame.columns property ### What changes were proposed in this pull request? This PR refines the docstring of `df.columns` and adds more examples. ### Why are the changes needed? To make PySpark documentation better. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? doctest Closes #42540 from allisonwang-db/spark-44853-refine-df-columns. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit fc0be7ebace3aaf22954f1311532db5c33f4d8fa) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 18 August 2023, 09:31:42 UTC
2682cfb Preparing development version 3.5.1-SNAPSHOT 18 August 2023, 08:38:03 UTC
010c4a6 Preparing Spark release v3.5.0-rc2 18 August 2023, 08:37:59 UTC
41e7234 [SPARK-44834][PYTHON][SQL][TESTS][FOLLOW-UP] Update the analyzer results of the udtf tests ### What changes were proposed in this pull request? This is a follow up for https://github.com/apache/spark/pull/42517. We need to re-generate the analyzer results for udtf tests after https://github.com/apache/spark/pull/42519 is merged. Also updated PythonUDTFSuite after https://github.com/apache/spark/pull/42520 is merged. ### Why are the changes needed? To fix test failures ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test only change Closes #42543 from allisonwang-db/spark-44834-fix. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit bb41cd889efdd0602385e70b4c8f1c93740db332) Signed-off-by: Yuming Wang <yumwang@ebay.com> 18 August 2023, 08:32:38 UTC
7786d0b [SPARK-43205][DOC] identifier clause docs ### What changes were proposed in this pull request? Document the IDENTIFIER() clause ### Why are the changes needed? Docs are good! ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? <!-- <img width="892" alt="Screenshot 2023-08-15 at 4 26 27 PM" src="https://github.com/apache/spark/assets/3514644/6ce43330-668e-4c84-b72b-bf1e2679d736"> If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> See attached <img width="892" alt="Screenshot 2023-08-15 at 4 26 27 PM" src="https://github.com/apache/spark/assets/3514644/55823375-8d1a-4473-bf19-74796d273416"> <img width="747" alt="Screenshot 2023-08-15 at 4 45 23 PM" src="https://github.com/apache/spark/assets/3514644/0ee852a9-6a11-4c87-bed9-43531c55fc31"> Closes #42506 from srielau/SPARK-43205-3.5-IDENTIFIER-clause-docs. Authored-by: srielau <serge@rielau.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 August 2023, 01:26:32 UTC
back to top