https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
f39ad61 Preparing Spark release v3.4.0-rc5 30 March 2023, 02:18:27 UTC
ce36692 [SPARK-42631][CONNECT][FOLLOW-UP] Expose Column.expr to extensions ### What changes were proposed in this pull request? This PR is a follow-up to https://github.com/apache/spark/pull/40234, which makes it possible for extensions to create custom `Dataset`s and `Column`s. It exposes `Dataset.plan`, but unfortunately it does not expose `Column.expr`. This means that extensions cannot build custom `Column`s that provide a user provider `Column` as input. ### Why are the changes needed? See above. ### Does this PR introduce _any_ user-facing change? No. This only adds a change for a Developer API. ### How was this patch tested? Existing tests to make sure nothing breaks. Closes #40590 from tomvanbussel/SPARK-42631. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit c3716c4ec68c2dea07e8cd896d79bd7175517a31) Signed-off-by: Herman van Hovell <herman@databricks.com> 29 March 2023, 19:28:21 UTC
4e20467 [SPARK-42957][INFRA][FOLLOWUP] Use 'cyclonedx' instead of file extensions ### What changes were proposed in this pull request? This PR is a follow-up of #40585 which aims to use `cyclonedx` instead of file extension. ### Why are the changes needed? When we use file extensions `xml` and `json`, `maven-metadata-local.xml` are missed. ``` spark-rm1ea0f8a3e397:/opt/spark-rm/output/spark/spark-repo-glCsK/org/apache/spark$ find . | grep xml ./spark-core_2.13/3.4.1-SNAPSHOT/maven-metadata-local.xml ./spark-core_2.13/3.4.1-SNAPSHOT/spark-core_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-core_2.13/maven-metadata-local.xml ``` We need to use `cyclonedx` specifically. ``` spark-rm1ea0f8a3e397:/opt/spark-rm/output/spark/spark-repo-glCsK/org/apache/spark$ find . -type f |grep -v \.jar |grep -v \.pom ./spark-catalyst_2.13/3.4.1-SNAPSHOT/spark-catalyst_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-catalyst_2.13/3.4.1-SNAPSHOT/spark-catalyst_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-core_2.13/3.4.1-SNAPSHOT/spark-core_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-core_2.13/3.4.1-SNAPSHOT/spark-core_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-graphx_2.13/3.4.1-SNAPSHOT/spark-graphx_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-graphx_2.13/3.4.1-SNAPSHOT/spark-graphx_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-kvstore_2.13/3.4.1-SNAPSHOT/spark-kvstore_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-kvstore_2.13/3.4.1-SNAPSHOT/spark-kvstore_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-launcher_2.13/3.4.1-SNAPSHOT/spark-launcher_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-launcher_2.13/3.4.1-SNAPSHOT/spark-launcher_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-mllib-local_2.13/3.4.1-SNAPSHOT/spark-mllib-local_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-mllib-local_2.13/3.4.1-SNAPSHOT/spark-mllib-local_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-network-common_2.13/3.4.1-SNAPSHOT/spark-network-common_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-network-common_2.13/3.4.1-SNAPSHOT/spark-network-common_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-network-shuffle_2.13/3.4.1-SNAPSHOT/spark-network-shuffle_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-network-shuffle_2.13/3.4.1-SNAPSHOT/spark-network-shuffle_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-parent_2.13/3.4.1-SNAPSHOT/spark-parent_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-parent_2.13/3.4.1-SNAPSHOT/spark-parent_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-sketch_2.13/3.4.1-SNAPSHOT/spark-sketch_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-sketch_2.13/3.4.1-SNAPSHOT/spark-sketch_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-streaming_2.13/3.4.1-SNAPSHOT/spark-streaming_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-streaming_2.13/3.4.1-SNAPSHOT/spark-streaming_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-tags_2.13/3.4.1-SNAPSHOT/spark-tags_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-tags_2.13/3.4.1-SNAPSHOT/spark-tags_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ./spark-unsafe_2.13/3.4.1-SNAPSHOT/spark-unsafe_2.13-3.4.1-SNAPSHOT-cyclonedx.json ./spark-unsafe_2.13/3.4.1-SNAPSHOT/spark-unsafe_2.13-3.4.1-SNAPSHOT-cyclonedx.xml ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #40587 from dongjoon-hyun/SPARK-42957-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7aa0195e46792ddfd2ec295ad439dab70f8dbe1d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 29 March 2023, 08:09:48 UTC
dc834d4 [SPARK-42895][CONNECT] Improve error messages for stopped Spark sessions ### What changes were proposed in this pull request? This PR improves error messages when users attempt to invoke session operations on a stopped Spark session. ### Why are the changes needed? To make the error messages more user-friendly. For example: ```python spark.stop() spark.sql("select 1") ``` Before this PR, this code will throw two exceptions: ``` ValueError: Cannot invoke RPC: Channel closed! During handling of the above exception, another exception occurred: Traceback (most recent call last): ... return e.code() == grpc.StatusCode.UNAVAILABLE AttributeError: 'ValueError' object has no attribute 'code' ``` After this PR, it will show this exception: ``` [NO_ACTIVE_SESSION] No active Spark session found. Please create a new Spark session before running the code. ``` ### Does this PR introduce _any_ user-facing change? Yes. This PR modifies the error messages. ### How was this patch tested? New unit test. Closes #40536 from allisonwang-db/spark-42895-stopped-session. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e9a87825f737211c2ab0fa6d02c1c6f2a47b0024) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 March 2023, 08:02:18 UTC
2256459 [SPARK-42957][INFRA] `release-build.sh` should not remove SBOM artifacts ### What changes were proposed in this pull request? This PR aims to prevent `release-build.sh` from removing SBOM artifacts. ### Why are the changes needed? According to the snapshot publishing result, we are publishing `.json` and `.xml` files successfully. - https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.12/3.4.1-SNAPSHOT/spark-core_2.12-3.4.1-20230324.001223-34-cyclonedx.json - https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.12/3.4.1-SNAPSHOT/spark-core_2.12-3.4.1-20230324.001223-34-cyclonedx.xml However, `release-build.sh` removes them during release. The following is the result of Apache Spark 3.4.0 RC4. - https://repository.apache.org/content/repositories/orgapachespark-1438/org/apache/spark/spark-core_2.12/3.4.0/ ### Does this PR introduce _any_ user-facing change? Yes, the users will see the SBOM on released artifacts. ### How was this patch tested? This should be tested during release process. Closes #40585 from dongjoon-hyun/SPARK-42957. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f5c5124c30ecf987b3114f0a991c9ee9831ce42a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 29 March 2023, 06:30:25 UTC
a9cacc1 [SPARK-42946][SQL] Redact sensitive data which is nested by variable substitution ### What changes were proposed in this pull request? Redact sensitive data which is nested by variable substitution #### Case 1 by SET syntax's key part ```sql spark-sql> set ${spark.ssl.keyPassword}; abc <undefined> ``` #### Case 2 by SELECT as String literal ```sql spark-sql> set spark.ssl.keyPassword; spark.ssl.keyPassword *********(redacted) Time taken: 0.009 seconds, Fetched 1 row(s) spark-sql> select '${spark.ssl.keyPassword}'; abc ``` ### Why are the changes needed? data security ### Does this PR introduce _any_ user-facing change? yes, sensitive data can not be extracted by variable substitution ### How was this patch tested? new tests Closes #40576 from yaooqinn/SPARK-42946. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit c227d789a5fedb2178858768e1fe425169f489d2) Signed-off-by: Kent Yao <yao@apache.org> 29 March 2023, 04:37:41 UTC
3124470 [SPARK-42927][CORE] Change the access scope of `o.a.spark.util.Iterators#size` to `private[util]` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/37353 introduce `o.a.spark.util.Iterators#size` to speed up get `Iterator` size when using Scala 2.13. It will only be used by `o.a.spark.util.Utils#getIteratorSize`, and will disappear when Spark only supports Scala 2.13. It should not be public, so this pr change it access scope to `private[util]`. ### Why are the changes needed? `o.a.spark.util.Iterators#size` should not public. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions Closes #40556 from LuciferYang/SPARK-42927. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 6e4c352d5f91f8343cec748fea4723178d5ae9af) Signed-off-by: Sean Owen <srowen@gmail.com> 28 March 2023, 14:06:58 UTC
a2dd949 [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true ### What changes were proposed in this pull request? Change `PlanSubqueries` to set `shouldBroadcast` to true when instantiating an `InSubqueryExec` instance. ### Why are the changes needed? The below left outer join gets an error: ``` create or replace temp view v1 as select * from values (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, value9, value10); create or replace temp view v2 as select * from values (1, 2), (3, 8), (7, 9) as v2(a, b); create or replace temp view v3 as select * from values (3), (8) as v3(col1); set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 set spark.sql.adaptive.enabled=false; select * from v1 left outer join v2 on key = a and key in (select col1 from v3); ``` The join fails during predicate codegen: ``` 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to interpreter mode java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) at org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) ``` It fails again after fallback to interpreter mode: ``` 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) ``` Both the predicate codegen and the evaluation fail for the same reason: `PlanSubqueries` creates `InSubqueryExec` with `shouldBroadcast=false`. The driver waits for the subquery to finish, but it's the executor that uses the results of the subquery (for predicate codegen or evaluation). Because `shouldBroadcast` is set to false, the result is stored in a transient field (`InSubqueryExec#result`), so the result of the subquery is not serialized when the `InSubqueryExec` instance is sent to the executor. The issue occurs, as far as I can tell, only when both whole stage codegen is disabled and adaptive execution is disabled. When wholestage codegen is enabled, the predicate codegen happens on the driver, so the subquery's result is available. When adaptive execution is enabled, `PlanAdaptiveSubqueries` always sets `shouldBroadcast=true`, so the subquery's result is available on the executor, if needed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #40569 from bersprockets/join_subquery_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5b20f3d94095f54017be3d31d11305e597334d8b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 March 2023, 12:32:14 UTC
0620b56 [SPARK-42928][SQL] Make resolvePersistentFunction synchronized ### What changes were proposed in this pull request? This PR makes the function `resolvePersistentFunctionInternal` synchronized. ### Why are the changes needed? To make function resolution thread-safe. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs. Closes #40557 from allisonwang-db/SPARK-42928-sync-func. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f62ffde045771b3275acf3dfb24573804e7daf93) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 March 2023, 08:43:19 UTC
0c4ad50 [SPARK-42908][PYTHON] Raise RuntimeError when SparkContext is required but not initialized ### What changes were proposed in this pull request? Raise RuntimeError when SparkContext is required but not initialized. ### Why are the changes needed? Error improvement. ### Does this PR introduce _any_ user-facing change? Error type and message change. Raise a RuntimeError with a clear message (rather than an AssertionError) when SparkContext is required but not initialized yet. ### How was this patch tested? Unit test. Closes #40534 from xinrong-meng/err_msg. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 70f6206dbcd3c5ff0f4618cf179b7fcf75ae672c) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 March 2023, 05:35:45 UTC
2cd341f [SPARK-42922][SQL] Move from Random to SecureRandom ### What changes were proposed in this pull request? Most uses of `Random` in spark are either in testcases or where we need a pseudo random number which is repeatable. Use `SecureRandom`, instead of `Random` for the cases where it impacts security. ### Why are the changes needed? Use of `SecureRandom` in more security sensitive contexts. This was flagged in our internal scans as well. ### Does this PR introduce _any_ user-facing change? Directly no. Would improve security posture of Apache Spark. ### How was this patch tested? Existing unit tests Closes #40568 from mridulm/SPARK-42922. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 744434358cb0c687b37d37dd62f2e7d837e52b2d) Signed-off-by: Sean Owen <srowen@gmail.com> 28 March 2023, 03:48:13 UTC
61293fa [SPARK-41876][CONNECT][PYTHON] Implement DataFrame.toLocalIterator ### What changes were proposed in this pull request? Implements `DataFrame.toLocalIterator`. The argument `prefetchPartitions` won't take effect for Spark Connect. ### Why are the changes needed? Missing API. ### Does this PR introduce _any_ user-facing change? `DataFrame.toLocalIterator` will be available. ### How was this patch tested? Enabled the related tests. Closes #40570 from ueshin/issues/SPARK-41876/toLocalIterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 31965a06c9f85abf2296971237b1f88065eb67c2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 March 2023, 03:36:04 UTC
46866aa [SPARK-42936][SQL] Fix LCA bug when the having clause can be resolved directly by its child Aggregate ### What changes were proposed in this pull request? The PR fixes the following bug in LCA + having resolution: ```sql select sum(value1) as total_1, total_1 from values(1, 'name', 100, 50) AS data(id, name, value1, value2) having total_1 > 0 SparkException: [INTERNAL_ERROR] Found the unresolved operator: 'UnresolvedHaving (total_1#353L > cast(0 as bigint)) ``` To trigger the issue, the having condition need to be (can be resolved by) an attribute in the select. Without the LCA `total_1`, the query works fine. #### Root cause of the issue `UnresolvedHaving` with `Aggregate` as child can use both the `Aggregate`'s output and the `Aggregate`'s child's output to resolve the having condition. If using the latter, `ResolveReferences` rule will replace the unresolved attribute with a `TempResolvedColumn`. For a `UnresolvedHaving` that actually can be resolved directly by its child `Aggregate`, there will be no `TempResolvedColumn` after the rule `ResolveReferences` applies. This `UnresolvedHaving` still needs to be transformed to `Filter` by rule `ResolveAggregateFunctions`. This rule recognizes the shape: `UnresolvedHaving - Aggregate`. However, the current condition (the plan should not contain `TempResolvedColumn`) that prevents LCA rule to apply between `ResolveReferences` and `ResolveAggregateFunctions` does not cover the above case. It can insert `Project` in the middle and break the shape can be matched by `ResolveAggregateFunctions`. #### Fix The PR adds another condition for LCA rule to apply: the plan should not contain any `UnresolvedHaving`. ### Why are the changes needed? See above reasoning to fix the bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing and added tests. Closes #40558 from anchovYu/lca-having-bug-fix. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2319a31fbf391c87bf8a1eef8707f46bef006c0f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 March 2023, 03:08:49 UTC
604fee6 [SPARK-42906][K8S] Replace a starting digit with `x` in resource name prefix ### What changes were proposed in this pull request? Change the generated resource name prefix to meet K8s requirements > DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?') ### Why are the changes needed? In current implementation, the following app name causes error ``` bin/spark-submit \ --master k8s://https://*.*.*.*:6443 \ --deploy-mode cluster \ --name 你好_187609 \ ... ``` ``` Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://*.*.*.*:6443/api/v1/namespaces/spark/services. Message: Service "187609-f19020870d12c349-driver-svc" is invalid: metadata.name: Invalid value: "187609-f19020870d12c349-driver-svc": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?'). ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT. Closes #40533 from pan3793/SPARK-42906. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0b9a3017005ccab025b93d7b545412b226d4e63c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 March 2023, 22:31:31 UTC
d7f2a6b [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin` ### What changes were proposed in this pull request? When testing `OrcEncryptionSuite` using maven, all test suites are always skipped. So this pr add `spark.hadoop.hadoop.security.key.provider.path` to `systemProperties` of `scalatest-maven-plugin` to make `OrcEncryptionSuite` can test by maven. ### Why are the changes needed? Make `OrcEncryptionSuite` can test by maven. ### Does this PR introduce _any_ user-facing change? No, just for maven test ### How was this patch tested? - Pass GitHub Actions - Manual testing: run ``` build/mvn clean install -pl sql/core -DskipTests -am build/mvn test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite ``` **Before** ``` Discovery starting. Discovery completed in 3 seconds, 218 milliseconds. Run starting. Expected test count is: 4 OrcEncryptionSuite: 21:57:58.344 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - Write and read an encrypted file !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider5af5d76f doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:37) - Write and read an encrypted table !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider5ad6cc21 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:65) - SPARK-35325: Write and read encrypted nested columns !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider691124ee doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:116) - SPARK-35992: Write and read fully-encrypted columns with default masking !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider5403799b doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:166) 21:58:00.035 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) ===== Run completed in 5 seconds, 41 milliseconds. Total number of tests run: 0 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 0, canceled 4, ignored 0, pending 0 No tests were executed. ``` **After** ``` Discovery starting. Discovery completed in 3 seconds, 185 milliseconds. Run starting. Expected test count is: 4 OrcEncryptionSuite: 21:58:46.540 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - Write and read an encrypted file - Write and read an encrypted table - SPARK-35325: Write and read encrypted nested columns - SPARK-35992: Write and read fully-encrypted columns with default masking 21:58:51.933 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) ===== Run completed in 8 seconds, 708 milliseconds. Total number of tests run: 4 Suites: completed 2, aborted 0 Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #40566 from LuciferYang/SPARK-42934-2. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a3d9e0ae0f95a55766078da5d0bf0f74f3c3cfc3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 March 2023, 16:42:48 UTC
c701859 [SPARK-42930][CORE][SQL] Change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]` ### What changes were proposed in this pull request? After [SPARK-41053](https://issues.apache.org/jira/browse/SPARK-41053), Spark supports serializing/ Live UI data to RocksDB using protobuf, but these are internal implementation details, so this pr change the access scope of `ProtobufSerDe` related implementations to `private[protobuf]`. ### Why are the changes needed? Weaker the access scope of Spark internal implementation details. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #40560 from LuciferYang/SPARK-42930. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b8f16bc6c3400dce13795c6dfa176dd793341df0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 March 2023, 16:27:09 UTC
dde9de6 [SPARK-42924][SQL][CONNECT][PYTHON] Clarify the comment of parameterized SQL args ### What changes were proposed in this pull request? In the PR, I propose to clarify the comment of `args` in parameterized `sql()`. ### Why are the changes needed? To make the comment more clear and highlight that input strings are parsed (not evaluated), and considered as SQL literal expressions. Also while parsing the fragments w/ SQL comments in the string values are skipped. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By checking coding style: ``` $ ./dev/lint-python $ ./dev/scalastyle ``` Closes #40508 from MaxGekk/parameterized-sql-doc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit c55c7ea6fc92c3733543d5f3d99eb00921cbe564) Signed-off-by: Max Gekk <max.gekk@gmail.com> 27 March 2023, 05:54:30 UTC
aba1c3b [SPARK-42899][SQL][FOLLOWUP] Project.reconcileColumnType should use KnownNotNull instead of AssertNotNull ### What changes were proposed in this pull request? This is a follow-up of #40526. `Project.reconcileColumnType` should use `KnownNotNull` instead of `AssertNotNull`, also only when `col.nullable`. ### Why are the changes needed? There is a better expression, `KnownNotNull`, for this kind of issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #40546 from ueshin/issues/SPARK-42899/KnownNotNull. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 62b9763a6fd9437647021bbb4433034566ba0a42) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 March 2023, 03:19:45 UTC
31ede73 [SPARK-42920][CONNECT][PYTHON] Enable tests for UDF with UDT ### What changes were proposed in this pull request? Enables tests for UDF with UDT. ### Why are the changes needed? Now that UDF with UDT should work, the related tests should be enabled to see if it works. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled/modified the related tests. Closes #40549 from ueshin/issues/SPARK-42920/udf_with_udt. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 80f8664e8278335788d8fa1dd00654f3eaec8ed6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 March 2023, 00:35:41 UTC
1b95b4d [SPARK-42911][PYTHON][3.4] Introduce more basic exceptions ### What changes were proposed in this pull request? Introduces more basic exceptions. - ArithmeticException - ArrayIndexOutOfBoundsException - DateTimeException - NumberFormatException - SparkRuntimeException ### Why are the changes needed? There are more exceptions that Spark throws but PySpark doesn't capture. We should introduce more basic exceptions; otherwise we still see `Py4JJavaError` or `SparkConnectGrpcException`. ```py >>> spark.conf.set("spark.sql.ansi.enabled", True) >>> spark.sql("select 1/0") DataFrame[(1 / 0): double] >>> spark.sql("select 1/0").show() Traceback (most recent call last): ... py4j.protocol.Py4JJavaError: An error occurred while calling o44.showString. : org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 8) == select 1/0 ^^^ at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:225) ... JVM's stacktrace ``` ```py >>> spark.sql("select 1/0").show() Traceback (most recent call last): ... pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkArithmeticException) [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 8) == select 1/0 ^^^ ``` ### Does this PR introduce _any_ user-facing change? The error message is more readable. ```py >>> spark.sql("select 1/0").show() Traceback (most recent call last): ... pyspark.errors.exceptions.captured.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 8) == select 1/0 ^^^ ``` or ```py >>> spark.sql("select 1/0").show() Traceback (most recent call last): ... pyspark.errors.exceptions.connect.ArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. == SQL(line 1, position 8) == select 1/0 ^^^ ``` ### How was this patch tested? Added the related tests. Closes #40547 from ueshin/issues/SPARK-42911/3.4/exceptions. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 March 2023, 00:26:47 UTC
594c8fe [SPARK-42884][CONNECT] Add Ammonite REPL integration ### What changes were proposed in this pull request? This PR adds Ammonite REPL integration for Spark Connect. This has a couple of benefits: - It makes it a lot less cumbersome for users to start a spark connect REPL. You don't have to add custom scripts, and you can use `coursier` to launch a fully function REPL for you. - It adds REPL integration for to the actual build. This makes it easier to validate the code we add is actually working. ### Why are the changes needed? A REPL is arguably the first entry point for a lot of users. ### Does this PR introduce _any_ user-facing change? Yes it adds REPL integration. ### How was this patch tested? Added tests for the command line parsing. Manually tested the REPL. Closes #40515 from hvanhovell/SPARK-42884. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit acf3065cc9e91e73db335a050289153582c45ce5) Signed-off-by: Herman van Hovell <herman@databricks.com> 24 March 2023, 17:16:01 UTC
f5f53e4 [SPARK-42917][SQL] Correct getUpdateColumnNullabilityQuery for DerbyDialect ### What changes were proposed in this pull request? Fix nullability clause for derby dialect, according to the official derby lang ref guide. ### Why are the changes needed? To fix bugs like: ``` spark-sql ()> create table src2(ID INTEGER NOT NULL, deptno INTEGER NOT NULL); spark-sql ()> alter table src2 ALTER COLUMN ID drop not null; java.sql.SQLSyntaxErrorException: Syntax error: Encountered "NULL" at line 1, column 42. ``` ### Does this PR introduce _any_ user-facing change? yes, but a necessary bugfix ### How was this patch tested? Test manually. Closes #40544 from yaooqinn/SPARK-42917. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit bcadbb69be6ac9759cb79fc4ee7bd93a37a63277) Signed-off-by: Kent Yao <yao@apache.org> 24 March 2023, 14:18:40 UTC
3122d4f [SPARK-42891][CONNECT][PYTHON][3.4] Implement CoGrouped Map API ### What changes were proposed in this pull request? Implement CoGrouped Map API: `applyInPandas`. The PR is a cherry-pick of https://github.com/apache/spark/commit/1fbc7948e57cbf05a46cb0c7fb2fad4ec25540e6, with minor changes on test class names to adapt to branch-3.4. ### Why are the changes needed? Parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. CoGrouped Map API is supported as shown below. ```sh >>> import pandas as pd >>> df1 = spark.createDataFrame( ... [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)], ("time", "id", "v1")) >>> >>> df2 = spark.createDataFrame( ... [(20000101, 1, "x"), (20000101, 2, "y")], ("time", "id", "v2")) >>> >>> def asof_join(l, r): ... return pd.merge_asof(l, r, on="time", by="id") ... >>> df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( ... asof_join, schema="time int, id int, v1 double, v2 string" ... ).show() +--------+---+---+---+ | time| id| v1| v2| +--------+---+---+---+ |20000101| 1|1.0| x| |20000102| 1|3.0| x| |20000101| 2|2.0| y| |20000102| 2|4.0| y| +--------+---+---+---+ ``` ### How was this patch tested? Parity unit tests. Closes #40539 from xinrong-meng/cogroup_map3.4. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org> 24 March 2023, 08:39:21 UTC
b74f792 [SPARK-42861][SQL] Use private[sql] instead of protected[sql] to avoid generating API doc ### What changes were proposed in this pull request? This is the only issue I found during SQL module API auditing via https://github.com/apache/spark-website/pull/443/commits/615986022c573aedaff8d2b917a0d2d9dc2b67ef . Somehow `protected[sql]` also generates API doc which is unexpected. `private[sql]` solves the problem and I generated doc locally to verify it. Another API issue has been fixed by https://github.com/apache/spark/pull/40499 ### Why are the changes needed? fix api doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #40541 from cloud-fan/auditing. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f7421b498a15ea687eaf811a1b2c77091945ef90) Signed-off-by: Kent Yao <yao@apache.org> 24 March 2023, 06:45:58 UTC
d44c7c0 [SPARK-42904][SQL] Char/Varchar Support for JDBC Catalog ### What changes were proposed in this pull request? Add type mapping for spark char/varchar to jdbc types. ### Why are the changes needed? The STANDARD JDBC 1.0 and other modern databases define char/varchar normatively. This is currently a kind of bug for DDLs on JDBCCatalogs for encountering errors like ``` Cause: org.apache.spark.SparkIllegalArgumentException: Can't get JDBC type for varchar(10). [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGetJdbcTypeError(QueryExecutionErrors.scala:1005) ``` ### Does this PR introduce _any_ user-facing change? yes, char/varchar are allow for jdbc catalogs ### How was this patch tested? new ut Closes #40531 from yaooqinn/SPARK-42904. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 18cd8012085510f9febc65e6e35ab79822076089) Signed-off-by: Kent Yao <yao@apache.org> 24 March 2023, 06:43:07 UTC
f20a269 [SPARK-42202][CONNECT][TEST][FOLLOWUP] Loop around command entry in SimpleSparkConnectService ### What changes were proposed in this pull request? With the while loop around service startup, any ENTER hit in the SimpleSparkConnectService console made it loop around, try to start the service anew, and fail with address already in use. Change the loop to be around the `StdIn.readline()` entry. ### Why are the changes needed? Better testing / development / debugging with SparkConnect ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual use of `connector/connect/bin/spark-connect` Closes #40537 from juliuszsompolski/SPARK-42202-followup. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0fde146e8676ab9a4aeafebb1684eb7a44660524) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 March 2023, 01:56:37 UTC
88eaaea [SPARK-42903][PYTHON][DOCS] Avoid documenting None as as a return value in docstring ### What changes were proposed in this pull request? Avoid documenting None as as a return value in docstring. ### Why are the changes needed? In Python, it's idiomatic to don't specify the return for return None. ### Does this PR introduce _any_ user-facing change? No. Closes #40532 from xinrong-meng/py_audit. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ece156f64fdbe0367b1594b1f8ee657234551a03) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 March 2023, 12:29:20 UTC
4d06299 [SPARK-42900][CONNECT][PYTHON] Fix createDataFrame to respect inference and column names ### What changes were proposed in this pull request? Fixes `createDataFrame` to respect inference and column names. ### Why are the changes needed? Currently when a column name list is provided as a schema, the type inference result is not taken care of. As a result, `createDataFrame` from UDT objects with column name list doesn't take the UDT type. For example: ```py >>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(1.0, 1.0, Vectors.dense(0.0, 5.0)), (0.0, 2.0, Vectors.dense(1.0, 2.0))], ["label", "weight", "features"]) >>> df.printSchema() root |-- label: double (nullable = true) |-- weight: double (nullable = true) |-- features: struct (nullable = true) | |-- type: byte (nullable = false) | |-- size: integer (nullable = true) | |-- indices: array (nullable = true) | | |-- element: integer (containsNull = false) | |-- values: array (nullable = true) | | |-- element: double (containsNull = false) ``` , which should be: ```py >>> df.printSchema() root |-- label: double (nullable = true) |-- weight: double (nullable = true) |-- features: vector (nullable = true) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added the related tests. Closes #40527 from ueshin/issues/SPARK-42900/cols. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 39e87d66b07beff91aebed6163ee82a35fbd1fcf) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 23 March 2023, 08:45:56 UTC
827eeb4 [SPARK-42903][PYTHON][DOCS] Avoid documenting None as as a return value in docstring ### What changes were proposed in this pull request? This PR proposes to remove None as as a return value in docstring. ### Why are the changes needed? To be consistent with the current documentation. Also, it's idiomatic to don't specify the return for `return None`. ### Does this PR introduce _any_ user-facing change? Yes, it changes the user-facing documentation. ### How was this patch tested? Doc build in the CI should verify them. Closes #40530 from HyukjinKwon/SPARK-42903. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 07301b4ab96365cfee1c6b7725026ef3f68e7ca1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 March 2023, 06:29:36 UTC
a25c0ea [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options It turns out that `spark.read.option.table` is a valid call chain and the `table` API does accept options when open a table. Existing Spark Connect implementation does not consider it. Feature parity. NO UT Closes #40498 from amaliujia/name_table_support_options. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0e29c8d5eda77ef085269f86b08c0a27420ac1f2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 March 2023, 06:29:05 UTC
253cb7a [SPARK-42901][CONNECT][PYTHON] Move `StorageLevel` into a separate file to avoid potential `file recursively imports` https://github.com/apache/spark/pull/40510 introduce `message StorageLevel` to `base.proto`, but if we try to import `base.proto` in `catalog.proto` to reuse `StorageLevel` in `message CacheTable` and run `build/sbt "connect-common/compile" to compile, there will be following message in compile log: ``` spark/connect/base.proto:23:1: File recursively imports itself: spark/connect/base.proto -> spark/connect/commands.proto -> spark/connect/relations.proto -> spark/connect/catalog.proto -> spark/connect/base.proto spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not found or had errors. spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was not found or had errors. spark/connect/relations.proto:84:5: "Catalog" is not defined. spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was not found or had errors. spark/connect/commands.proto:63:3: "Relation" is not defined. spark/connect/commands.proto:81:3: "Relation" is not defined. spark/connect/commands.proto:142:3: "Relation" is not defined. spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not found or had errors. spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not found or had errors. .... ``` So this pr move `message StorageLevel` to a separate file to avoid this potential file recursively imports. To avoid potential file recursively imports. No - Pass GitHub Actions - Manual check: - Add `import "spark/connect/common.proto";` to `catalog.proto` - run `build/sbt "connect-common/compile"` No compilation logs related to `File recursively imports itself` . Closes #40518 from LuciferYang/SPARK-42889-FOLLOWUP. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 88cc2395786fd2e06f77b897288ac8a48c33c15e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 March 2023, 03:44:15 UTC
a1b853c [SPARK-42899][SQL] Fix DataFrame.to(schema) to handle the case where there is a non-nullable nested field in a nullable field Fixes `DataFrame.to(schema)` to handle the case where there is a non-nullable nested field in a nullable field. `DataFrame.to(schema)` fails when it contains non-nullable nested field in nullable field: ```scala scala> val df = spark.sql("VALUES (1, STRUCT(1 as i)), (NULL, NULL) as t(a, b)") df: org.apache.spark.sql.DataFrame = [a: int, b: struct<i: int>] scala> df.printSchema() root |-- a: integer (nullable = true) |-- b: struct (nullable = true) | |-- i: integer (nullable = false) scala> df.to(df.schema) org.apache.spark.sql.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `b`.`i` is nullable while it's required to be non-nullable. ``` No. Added the related tests. Closes #40526 from ueshin/issues/SPARK-42899/to_schema. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4052058f9d1f60f1539dd227614cd459bfdfd31f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 March 2023, 02:15:43 UTC
f73b5c5 [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client ### What changes were proposed in this pull request? This pr follow SPARK-42889 to support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client ### Why are the changes needed? Add Spark connect jvm client api coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Add new test Closes #40516 from LuciferYang/SPARK-42894. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b58df45d97fce6997d42063f9536a4fceb58125b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 March 2023, 16:53:05 UTC
56d9ccf [SPARK-42893][PYTHON][3.4] Block Arrow-optimized Python UDFs ### What changes were proposed in this pull request? Block the usage of Arrow-optimized Python UDFs in Apache Spark 3.4.0. ### Why are the changes needed? Considering the upcoming improvements on the result inconsistencies between traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better block the feature, otherwise, users who try out the feature will expect behavior changes in the next release. In addition, since Spark Connect Python Client(SCPC) has been introduced in Spark 3.4, we'd better ensure the feature is ready in both vanilla PySpark and SCPC at the same time for compatibility. ### Does this PR introduce _any_ user-facing change? Yes. Arrow-optimized Python UDFs are blocked. ### How was this patch tested? Existing tests. Closes #40513 from xinrong-meng/3.4block. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 March 2023, 07:05:35 UTC
5622981 [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel ### What changes were proposed in this pull request? Implements `DataFrame.cache`, `persist`, `unpersist`, and `storageLevel`. ### Why are the changes needed? Missing APIs. ### Does this PR introduce _any_ user-facing change? `DataFrame.cache`, `persist`, `unpersist`, and `storageLevel` will be available. ### How was this patch tested? Added/enabled the related tests. Closes #40510 from ueshin/issues/SPARK-42889/cache. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b4f02248972c357cc2af6881b10565315ea15cb4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 March 2023, 01:18:19 UTC
61b85ae [SPARK-42816][CONNECT] Support Max Message size up to 128MB ### What changes were proposed in this pull request? This change lifts the default message size of 4MB to 128MB and makes it configurable. While 128MB is a "random number" it supports creating DataFrames from reasonably sized local data without failing. ### Why are the changes needed? Usability ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual Closes #40447 from grundprinzip/SPARK-42816. Lead-authored-by: Martin Grund <martin.grund@databricks.com> Co-authored-by: Martin Grund <grundprinzip@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit da19e0de05b4fbccae2c21385e67256ff31b1f1a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 March 2023, 01:13:04 UTC
ca260cc [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11 ### What changes were proposed in this pull request? Upgrade the [GCS Connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/tree/v2.2.11/gcs) bundled in the Spark distro from version 2.2.7 to 2.2.11. ### Why are the changes needed? The new release contains multiple bug fixes and enhancements discussed in the [Release Notes](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.11/gcs/CHANGES.md). Notable changes include: * Improved socket timeout handling. * Trace logging capabilities. * Fix bug that prevented usage of GCS as a [Hadoop Credential Provider](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html). * Dependency upgrades. * Support OAuth2 based client authentication. ### Does this PR introduce _any_ user-facing change? Distributions built with `-Phadoop-cloud` now include GCS connector 2.2.11 instead of 2.2.7. ``` cnaurothcnauroth-2-1-m:~/spark-3.5.0-SNAPSHOT-bin-custom-spark$ ls -lrt jars/gcs* -rw-r--r-- 1 cnauroth cnauroth 36497606 Mar 21 00:42 jars/gcs-connector-hadoop3-2.2.11-shaded.jar ``` ### How was this patch tested? **Build** I built a custom distro with `-Phadoop-cloud`: ``` ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-3 -Phadoop-cloud -Pscala-2.12 ``` **Run** I ran a PySpark job that successfully reads and writes using GCS: ``` from pyspark.sql import SparkSession def main() -> None: # Create SparkSession. spark = (SparkSession.builder .appName('copy-shakespeare') .getOrCreate()) # Read. df = spark.read.text('gs://dataproc-datasets-us-central1/shakespeare') # Write. df.write.text('gs://cnauroth-hive-metastore-proxy-dist/output/copy-shakespeare') spark.stop() if __name__ == '__main__': main() ``` Authored-by: Chris Nauroth <cnaurothapache.org> Closes #40511 from cnauroth/SPARK-42888. Authored-by: Chris Nauroth <cnauroth@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f9017cbe521f7696128b8c9edcb825c79f16768b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 March 2023, 01:00:43 UTC
d123bc6 [MINOR][DOCS] Remove SparkSession constructor invocation in the example ### What changes were proposed in this pull request? This PR proposes to Remove SparkSession constructor invocation in the example. While I am here, I piggyback and add an example of Spark Connect. ### Why are the changes needed? SparkSession's constructor is not meant to be exposed to the end users. This is also hidden in Scala side. ### Does this PR introduce _any_ user-facing change? Yes, it removes the usage of SparkSession constructor in the user facing docs. ### How was this patch tested? Linters should verify the changes in the CI. Closes #40505 from HyukjinKwon/minor-docs-session. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 54afa2bfffb46d80cbe9f815a37c3d3325648ef3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 March 2023, 00:19:59 UTC
6d5414f [SPARK-42851][SQL] Guard EquivalentExpressions.addExpr() with supportedExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by https://github.com/apache/spark/pull/39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes #40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef0a76eeea30fabb04499908b04124464225f5fd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 March 2023, 13:28:07 UTC
8cffa5c [SPARK-42876][SQL] DataType's physicalDataType should be private[sql] ### What changes were proposed in this pull request? `physicalDataType` should not be a public API but be private[sql]. ### Why are the changes needed? This is to limit API scope to not expose unnecessary API to be public. ### Does this PR introduce _any_ user-facing change? No since we have not released Spark 3.4.0 yet. ### How was this patch tested? N/A Closes #40499 from amaliujia/change_scope_of_physical_data_type. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c9a530e38e7f4ff3a491245c1d3ecaa1755c87ad) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 March 2023, 08:27:09 UTC
ed797bb [SPARK-42340][CONNECT][PYTHON][3.4] Implement Grouped Map API ### What changes were proposed in this pull request? Implement Grouped Map API:`GroupedData.applyInPandas` and `GroupedData.apply`. ### Why are the changes needed? Parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. `GroupedData.applyInPandas` and `GroupedData.apply` are supported now, as shown below. ```sh >>> df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],("id", "v")) >>> def normalize(pdf): ... v = pdf.v ... return pdf.assign(v=(v - v.mean()) / v.std()) ... >>> df.groupby("id").applyInPandas(normalize, schema="id long, v double").show() +---+-------------------+ | id| v| +---+-------------------+ | 1|-0.7071067811865475| | 1| 0.7071067811865475| | 2|-0.8320502943378437| | 2|-0.2773500981126146| | 2| 1.1094003924504583| +---+-------------------+ ``` ```sh >>> pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) ... def normalize(pdf): ... v = pdf.v ... return pdf.assign(v=(v - v.mean()) / v.std()) ... >>> df.groupby("id").apply(normalize).show() /Users/xinrong.meng/spark/python/pyspark/sql/connect/group.py:228: UserWarning: It is preferred to use 'applyInPandas' over this API. This API will be deprecated in the future releases. See SPARK-28264 for more details. warnings.warn( +---+-------------------+ | id| v| +---+-------------------+ | 1|-0.7071067811865475| | 1| 0.7071067811865475| | 2|-0.8320502943378437| | 2|-0.2773500981126146| | 2| 1.1094003924504583| +---+-------------------+ ``` ### How was this patch tested? (Parity) Unit tests. Closes #40486 from xinrong-meng/group_map3.4. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 March 2023, 07:42:43 UTC
5222cfd [SPARK-42864][ML][3.4] Make `IsotonicRegression.PointsAccumulator` private ### What changes were proposed in this pull request? Make `IsotonicRegression.PointsAccumulator` private, which was introduced in https://github.com/apache/spark/commit/3d05c7e037eff79de8ef9f6231aca8340bcc65ef ### Why are the changes needed? `PointsAccumulator` is implementation details, should not be exposed ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing UT Closes #40500 from zhengruifeng/isotonicRegression_private. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org> 21 March 2023, 04:55:44 UTC
602aaff [SPARK-42875][CONNECT][PYTHON] Fix toPandas to handle timezone and map types properly ### What changes were proposed in this pull request? Fix `DataFrame.toPandas()` to handle timezone and map types properly. ### Why are the changes needed? Currently `DataFrame.toPandas()` doesn't handle timezone for timestamp type, and map types properly. For example: ```py >>> schema = StructType().add("ts", TimestampType()) >>> spark.createDataFrame([(datetime(1969, 1, 1, 1, 1, 1),), (datetime(2012, 3, 3, 3, 3, 3),), (datetime(2100, 4, 4, 4, 4, 4),)], schema).toPandas() ts 0 1969-01-01 01:01:01-08:00 1 2012-03-03 03:03:03-08:00 2 2100-04-04 03:04:04-08:00 ``` which should be: ```py ts 0 1969-01-01 01:01:01 1 2012-03-03 03:03:03 2 2100-04-04 04:04:04 ``` ### Does this PR introduce _any_ user-facing change? The result of `DataFrame.toPandas()` with timestamp type and map type will be the same as PySpark. ### How was this patch tested? Enabled the related tests. Closes #40497 from ueshin/issues/SPARK-42875/timestamp. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 61035129a354d0b31c66908106238b12b1f2f7b0) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 21 March 2023, 01:43:32 UTC
a28b1ab [SPARK-42557][CONNECT][FOLLOWUP] Remove `broadcast` `ProblemFilters.exclude` rule from mima check ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/40275 has implemented the `functions#broadcast`, so this pr remove the corresponding `ProblemFilters.exclude` rule from `CheckConnectJvmClientCompatibility` ### Why are the changes needed? Remove `unnecessary` `ProblemFilters.exclude` rule. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual check `dev/connect-jvm-client-mima-check` passed Closes #40463 from LuciferYang/SPARK-42557-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 6ad8bf40a0e36c9e432b1b5c7670bb8af857ea8b) Signed-off-by: Herman van Hovell <herman@databricks.com> 20 March 2023, 13:04:10 UTC
57c9691 [MINOR][TEST] Fix spelling of 'regex' for RegexFilter ### What changes were proposed in this pull request? Fix RegexFilter's attribute 'regex' spelling issue ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? test-only ### How was this patch tested? existing tests Closes #40483 from yaooqinn/minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ad06a0f22763a565a12dd9e0755d1503eaf7cbf9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 March 2023, 10:41:36 UTC
666eb65 [SPARK-42852][SQL] Revert NamedLambdaVariable related changes from EquivalentExpressions ### What changes were proposed in this pull request? This PR reverts the follow-up PR of SPARK-41468: https://github.com/apache/spark/pull/39046 ### Why are the changes needed? These changes are not needed and actually might cause performance regression due to preventing higher order function subexpression elimination in `EquivalentExpressions`. Please find related conversation here: https://github.com/apache/spark/pull/40473#issuecomment-1474848224 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #40475 from peter-toth/SPARK-42852-revert-namedlambdavariable-changes. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ce3b03d7bd1964cbd8dd6b87edc024b38feaaffb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 March 2023, 00:55:17 UTC
ba132ef [SPARK-42247][CONNECT][PYTHON] Fix UserDefinedFunction to have returnType ### What changes were proposed in this pull request? Fix `UserDefinedFunction` to have `returnType`. ### Why are the changes needed? Currently `UserDefinedFunction` doesn't have `returnType` attribute. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled/modified the related tests. Closes #40472 from ueshin/issues/SPARK-42247/returnType. Lead-authored-by: Takuya UESHIN <ueshin@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 708bda3e288c8bf07a490d873d45c4b4667a6655) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 March 2023, 00:29:37 UTC
a0993ba [SPARK-41818][SPARK-41843][CONNECT][PYTHON][TESTS] Enable more parity tests ### What changes were proposed in this pull request? Enables more parity tests. ### Why are the changes needed? We can enable more parity tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled related parity tests. Closes #40470 from ueshin/issues/SPARK-41818/parity. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e911c5efea7d12800fdbe3cb8effbc0d5ce763f0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 March 2023, 00:26:04 UTC
613be4b [SPARK-42848][CONNECT][PYTHON] Implement DataFrame.registerTempTable ### What changes were proposed in this pull request? Implements `DataFrame.registerTempTable`. ### Why are the changes needed? Missing API. ### Does this PR introduce _any_ user-facing change? `DataFrame.registerTempTable` will be available. ### How was this patch tested? Enabled a related test. Closes #40469 from ueshin/issues/SPARK-42848/registerTempTable. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 71ea3909157684915fd6944e2707017d0ea32b5a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 March 2023, 00:25:27 UTC
8804803 [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect ### What changes were proposed in this pull request? Supports `UserDefinedType` in Spark Connect. ### Why are the changes needed? Currently Spark Connect doesn't support UDTs. ### Does this PR introduce _any_ user-facing change? Yes, UDTs will be available in Spark Connect. ### How was this patch tested? Enabled the related tests. Closes #40402 from ueshin/issues/SPARK-42020/udt. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit dc6cf74dc2db54d935cf54cb3e4829a468dcdf78) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 March 2023, 00:24:34 UTC
5bd7b09 [SPARK-42778][SQL][3.4] QueryStageExec should respect supportsRowBased this pr is for branch-3.4 https://github.com/apache/spark/pull/40407 ### What changes were proposed in this pull request? Make `QueryStageExec` respect plan.supportsRowBased ### Why are the changes needed? It is a long time issue that if the plan support both columnar and row, then it would add a unnecessary `ColumnarToRow` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes #40417 from ulysses-you/SPARK-42778-3.4. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 March 2023, 03:38:57 UTC
c29cf34 [SPARK-42823][SQL] `spark-sql` shell supports multipart namespaces for initialization ### What changes were proposed in this pull request? Currently, we only support initializing spark-sql shell with a single-part schema, which also must be forced to the session catalog. #### case 1, specifying catalog field for v1sessioncatalog ```sql bin/spark-sql --database spark_catalog.default Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spark_catalog.default' not found ``` #### case 2, setting the default catalog to another one ```sql bin/spark-sql -c spark.sql.defaultCatalog=testcat -c spark.sql.catalog.testcat=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog -c spark.sql.catalog.testcat.url='jdbc:derby:memory:testcat;create=true' -c spark.sql.catalog.testcat.driver=org.apache.derby.jdbc.AutoloadedDriver -c spark.sql.catalogImplementation=in-memory --database SYS 23/03/16 18:40:49 WARN ObjectStore: Failed to get database sys, returning NoSuchObjectException Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'sys' not found ``` In this PR, we switch to use-statement to support multipart namespaces, which helps us resovle to catalog correctly. ### Why are the changes needed? Make spark-sql shell better support the v2 catalog framework. ### Does this PR introduce _any_ user-facing change? Yes, `--database` option supports multipart namespaces and works for v2 catalogs now. And you will see this behavior on spark web ui. ### How was this patch tested? new ut Closes #40457 from yaooqinn/SPARK-42823. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2000d5f8db838db62967a45d574728a8bf2aaf6b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 March 2023, 03:29:24 UTC
ca75340 [SPARK-42824][CONNECT][PYTHON] Provide a clear error message for unsupported JVM attributes ### What changes were proposed in this pull request? This pull request proposes an improvement to the error message when trying to access a JVM attribute that is not supported in Spark Connect. Specifically, it adds a more informative error message that clearly indicates which attribute is not supported due to Spark Connect's lack of dependency on the JVM. ### Why are the changes needed? Currently, when attempting to access an unsupported JVM attribute in Spark Connect, the error message is not very clear, making it difficult for users to understand the root cause of the issue. This improvement aims to provide more helpful information to users to address this problem as below: **Before** ```python >>> spark._jsc Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'SparkSession' object has no attribute '_jsc' ``` **After** ```python >>> spark._jsc Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/haejoon.lee/Desktop/git_store/spark/python/pyspark/sql/connect/session.py", line 490, in _jsc raise PySparkAttributeError( pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsc` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, use the original PySpark instead of Spark Connect. ``` ### Does this PR introduce _any_ user-facing change? This PR does not introduce any user-facing change in terms of functionality. However, it improves the error message, which could potentially affect the user experience in a positive way. ### How was this patch tested? This patch was tested by adding new unit tests that specifically target the error message related to unsupported JVM attributes. The tests were run locally on a development environment. Closes #40458 from itholic/SPARK-42824. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit deac481304489f9b8ecd24ec6f3aed1e0c0d75eb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 March 2023, 02:13:07 UTC
833599c [SPARK-42826][PS][DOCS] Add migration notes for update to supported pandas version ### What changes were proposed in this pull request? This PR proposes to add a migration note for update to supported pandas version. ### Why are the changes needed? Some APIs have been deprecated or removed from SPARK-42593 to follow pandas 2.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review is required. Closes #40459 from itholic/SPARK-42826. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1b40565f99b1a3888be46cfdf673b60198bf47a2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 March 2023, 00:43:20 UTC
62d6a3b [SPARK-42817][CORE] Logging the shuffle service name once in ApplicationMaster ### What changes were proposed in this pull request? Removed the logging of shuffle service name multiple times in the driver log. It gets logged everytime a new executor is allocated. ### Why are the changes needed? This is needed because currently the driver logs gets polluted by these logs: ``` 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for shuffle service using name 'spark_shuffle_311' ``` ### Does this PR introduce _any_ user-facing change? Yes, the shuffle service name will be just logged once in the driver. ### How was this patch tested? Tested manually since it just changes the logging. With this see this logged in the driver logs: `23/03/15 16:50:54 INFO ApplicationMaster: Initializing service data for shuffle service using name 'spark_shuffle_311'` Closes #40448 from otterc/SPARK-42817. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f025d5eb1c2c9a6f7933679aa80752e806df9d2a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 March 2023, 21:27:39 UTC
60320f0 [SPARK-42812][CONNECT] Add client_type to AddArtifactsRequest protobuf message The missing `client_type` is added to the `AddArtifactsRequest` protobuf message. Consistency with the other RPCs. Yes, new field in proto message. N/A Closes #40443 from vicennial/SPARK-42812. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 43b64635ac1a018fa32f7150478deb63955ee8be) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 March 2023, 06:30:46 UTC
89e7e3d [SPARK-42820][BUILD] Update ORC to 1.8.3 ### What changes were proposed in this pull request? This PR aims to update ORC to 1.8.3. ### Why are the changes needed? This will bring the following bug fixes. - https://orc.apache.org/news/2023/03/15/ORC-1.8.3/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #40453 from williamhyun/orc183. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6216c7094aedd6e97e8d4eaaa51d28a5cea38fc5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 March 2023, 05:00:31 UTC
9f1e8af [SPARK-42767][CONNECT][TESTS] Add a precondition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive ### What changes were proposed in this pull request? This pr adds a precondition before `RemoteSparkSession` starts connect server to check whether `spark-hive-**.jar` exists in the `assembly/target/scala-*/jars` directory, and will fallback to using `spark.sql.catalogImplementation=in-memory` to start the connect server if `spark-hive-**.jar` doesn't exist. When using `spark.sql.catalogImplementation=in-memory` to start connect server, some test cases that strongly rely on the hive module will be ignored rather than fail rudely. At the same time, developers can see the following message on the terminal: ``` [info] ClientE2ETestSuite: Will start Spark Connect server with `spark.sql.catalogImplementation=in-memory`, some tests that rely on Hive will be ignored. If you don't want to skip them: 1. Test with maven: run `build/mvn install -DskipTests -Phive` before testing 2. Test with sbt: run test with `-Phive` profile ``` ### Why are the changes needed? Avoid rough failure of connect client module UTs due to lack of hive-related dependency. ### Does this PR introduce _any_ user-facing change? No, just for test ### How was this patch tested? - Manual checked test with `-Phive` is same as before - Manual test: - Maven run ``` build/mvn clean install -DskipTests build/mvn test -pl connector/connect/client/jvm ``` **Before** ``` Run completed in 14 seconds, 999 milliseconds. Total number of tests run: 684 Suites: completed 12, aborted 0 Tests: succeeded 678, failed 6, canceled 0, ignored 1, pending 0 *** 6 TESTS FAILED *** ``` **After** ``` Discovery starting. Discovery completed in 761 milliseconds. Run starting. Expected test count is: 684 ClientE2ETestSuite: Will start Spark Connect server with `spark.sql.catalogImplementation=in-memory`, some tests that rely on Hive will be ignored. If you don't want to skip them: 1. Test with maven: run `build/mvn install -DskipTests -Phive` before testing 2. Test with sbt: run test with `-Phive` profile ... Run completed in 15 seconds, 994 milliseconds. Total number of tests run: 682 Suites: completed 12, aborted 0 Tests: succeeded 682, failed 0, canceled 2, ignored 1, pending 0 All tests passed. ``` - SBT run `build/sbt clean "connect-client-jvm/test"` **Before** ``` [info] ClientE2ETestSuite: [info] org.apache.spark.sql.ClientE2ETestSuite *** ABORTED *** (1 minute, 3 seconds) [info] java.lang.RuntimeException: Failed to start the test server on port 15960. [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.beforeAll(RemoteSparkSession.scala:129) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.beforeAll$(RemoteSparkSession.scala:120) [info] at org.apache.spark.sql.ClientE2ETestSuite.beforeAll(ClientE2ETestSuite.scala:37) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:37) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:750) ``` **After** ``` [info] ClientE2ETestSuite: Will start Spark Connect server with `spark.sql.catalogImplementation=in-memory`, some tests that rely on Hive will be ignored. If you don't want to skip them: 1. Test with maven: run `build/mvn install -DskipTests -Phive` before testing 2. Test with sbt: run test with `-Phive` profile .... [info] Run completed in 22 seconds, 44 milliseconds. [info] Total number of tests run: 682 [info] Suites: completed 11, aborted 0 [info] Tests: succeeded 682, failed 0, canceled 2, ignored 1, pending 0 [info] All tests passed. ``` Closes #40389 from LuciferYang/spark-hive-available. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 83b9cbddc0ce1d594b718b061e82c231092db4a7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 March 2023, 04:04:32 UTC
fc43fa1 [SPARK-42818][CONNECT][PYTHON][FOLLOWUP] Add versionchanged ### What changes were proposed in this pull request? Follow-up of #40450. Adds `versionchanged` to the docstring. ### Why are the changes needed? The `versionchanged` is missing in the API newly supported in Spark Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #40451 from ueshin/issues/SPARK-42818/fup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 23eaa38f8d459a2d43b181127bc4eee22ac81efa) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 16 March 2023, 03:18:09 UTC
07324b8 [SPARK-42818][CONNECT][PYTHON] Implement DataFrameReader/Writer.jdbc ### What changes were proposed in this pull request? Implements `DataFrameReader/Writer.jdbc`. ### Why are the changes needed? Missing API. ### Does this PR introduce _any_ user-facing change? Yes, `DataFrameReader/Writer.jdbc` will be available. ### How was this patch tested? Added related tests. Closes #40450 from ueshin/issues/SPARK-42818/jdbc. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8b2f28bd53d0eacbac7555c3a09af908bc682e41) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 March 2023, 01:38:31 UTC
0f71caa [SPARK-42496][CONNECT][DOCS][FOLLOW-UP] Addressing feedback to remove last ">>>" and adding type(spark) example ### What changes were proposed in this pull request? Removing the last ">>>" in a Python code example based on feedback and adding type(spark) as an example of checking whether a session is Spark Connect. ### Why are the changes needed? To help readers determine whether a session is Spark Connect + removing unnecessary extra line for cleaner reading. ### Does this PR introduce _any_ user-facing change? Yes, updating user-facing documentation ### How was this patch tested? Built the doc website locally and checked the pages. PRODUCTION=1 SKIP_RDOC=1 bundle exec jekyll build Closes #40435 from allanf-db/connect_docs. Authored-by: Allan Folting <allan.folting@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8860f69455e5a722626194c4797b4b42cccd4510) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 March 2023, 12:04:39 UTC
fb729ad [MINOR][PYTHON] Change TypeVar to private symbols ### What changes were proposed in this pull request? I've convert internal typing symbols in `__init__.py` to private. When these changes are agreed upon, I can expand this MR with more `TypeVar` this applies to. ### Why are the changes needed? Editors consider all public symbols in a library importable. A common pattern is to use a shorthand for pyspark functions: ```python import pyspark.sql.functions as F F.col(...) ``` Since `pyspark.F` is a valid symbol according to `__init__.py`, editors will suggest this to users, while it is not a valid use-case of pyspark. This change is in line with Pyright's [Typing Guidance for Python Libraries](https://github.com/microsoft/pyright/blob/main/docs/typed-libraries.md#generic-classes-and-functions) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Verified with PyCharm auto-importing. Closes #40338 from MaicoTimmerman/master. Authored-by: Maico Timmerman <maico.timmerman@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6d3587ad5ba676b0c82a2c75ccd00c370b592563) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 March 2023, 11:38:32 UTC
492aa38 [SPARK-42799][BUILD] Update SBT build `xercesImpl` version to match with `pom.xml` ### What changes were proposed in this pull request? This PR aims to update `XercesImpl` version to `2.12.2` from `2.12.0` in order to match with the version of `pom.xml`. https://github.com/apache/spark/blob/149e020a5ca88b2db9c56a9d48e0c1c896b57069/pom.xml#L1429-L1433 ### Why are the changes needed? When we updated this version via SPARK-39183, we missed to update `SparkBuild.scala`. - https://github.com/apache/spark/pull/36544 ### Does this PR introduce _any_ user-facing change? No, this is a dev-only change because the release artifact' dependency is managed by Maven. ### How was this patch tested? Pass the CIs. Closes #40431 from dongjoon-hyun/SPARK-42799. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 049aa380b8b1361c2898bc499e64613d329c6f72) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 March 2023, 07:39:34 UTC
9c1cb47 [SPARK-42706][SQL][DOCS][3.4] Document the Spark SQL error classes in user-facing documentation ### What changes were proposed in this pull request? Cherry-pick for https://github.com/apache/spark/pull/40336. This PR proposes to document Spark SQL error classes to [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html). - Error Conditions <img width="1077" alt="Screen Shot 2023-03-08 at 8 54 43 PM" src="https://user-images.githubusercontent.com/44108233/223706823-7817b57d-c032-4817-a440-7f79119fa0b4.png"> - SQLSTATE Codes <img width="1139" alt="Screen Shot 2023-03-08 at 8 54 54 PM" src="https://user-images.githubusercontent.com/44108233/223706860-3f64b00b-fa0d-47e0-b154-0d7be92b8637.png"> - Error Classes that includes sub-error classes (`INVALID_FORMAT` as an example) <img width="1045" alt="Screen Shot 2023-03-08 at 9 10 22 PM" src="https://user-images.githubusercontent.com/44108233/223709925-74144f41-8836-45dc-b851-5d96ac8aa38c.png"> ### Why are the changes needed? To improve the usability for error messages for Spark SQL. ### Does this PR introduce _any_ user-facing change? No API change, but yes, it's user-facing documentation. ### How was this patch tested? Manually built docs and check the contents one-by-one compare to [error-classes.json](https://github.com/apache/spark/blob/master/core/src/main/resources/error/error-classes.json). Closes #40336 from itholic/SPARK-42706. Authored-by: itholic <haejoon.leedatabricks.com> ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #40433 from itholic/SPARK-42706-3.4. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 15 March 2023, 06:49:23 UTC
ab7c4f8 [SPARK-42801][CONNECT][TESTS] Ignore flaky `write jdbc` test of `ClientE2ETestSuite` on Java 8 ### What changes were proposed in this pull request? This PR aims to ignore the flaky `write jdbc` test of `ClientE2ETestSuite` on Java 8 ![Screenshot 2023-03-14 at 10 56 34 PM](https://user-images.githubusercontent.com/9700541/225219845-94eaea79-ade6-435d-9d03-19fc73cb8617.png) ### Why are the changes needed? Currently, this happens on `branch-3.4` with Java 8 only. **BRANCH-3.4** https://github.com/apache/spark/commits/branch-3.4 ![Screenshot 2023-03-14 at 10 55 29 PM](https://user-images.githubusercontent.com/9700541/225219670-f8a68dc0-5aa6-428f-9c02-ae41580a38bc.png) **JAVA 8** 1. Currently, `Connect` server is using `Hive` catalog during testing and uses `Derby` with disk store when it creates a table 2. `Connect Client` is trying to use `Derby` with `mem` store and it fails with `No suitable driver` at the first attempt. ``` $ bin/spark-shell -c spark.sql.catalogImplementation=hive Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/03/14 21:50:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as 'sc' (master = local[64], app id = local-1678855843831). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.1-SNAPSHOT /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_312) Type in expressions to have them evaluated. Type :help for more information. scala> sc.setLogLevel("INFO") scala> sql("CREATE TABLE t(a int)") 23/03/14 21:51:08 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir. 23/03/14 21:51:08 INFO SharedState: Warehouse path is 'file:/Users/dongjoon/APACHE/spark-merge/spark-warehouse'. 23/03/14 21:51:10 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. 23/03/14 21:51:10 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.9 using Spark classes. 23/03/14 21:51:11 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.9) is file:/Users/dongjoon/APACHE/spark-merge/spark-warehouse res1: org.apache.spark.sql.DataFrame = [] scala> java.sql.DriverManager.getConnection("jdbc:derby:memory:1234;create=true").createStatement().execute("CREATE TABLE s(a int)"); java.sql.SQLException: No suitable driver found for jdbc:derby:memory:1234;create=true at java.sql.DriverManager.getConnection(DriverManager.java:689) at java.sql.DriverManager.getConnection(DriverManager.java:270) ... 47 elided ``` **JAVA 11** ``` $ bin/spark-shell -c spark.sql.catalogImplementation=hive Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/03/14 21:57:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1678856279685). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.4.1-SNAPSHOT /_/ Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.18) Type in expressions to have them evaluated. Type :help for more information. scala> sql("CREATE TABLE hive_t2(a int)") 23/03/14 21:58:06 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead. 23/03/14 21:58:06 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/03/14 21:58:06 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/03/14 21:58:07 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 23/03/14 21:58:07 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore dongjoon127.0.0.1 23/03/14 21:58:07 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 23/03/14 21:58:07 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 23/03/14 21:58:07 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/03/14 21:58:07 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/03/14 21:58:07 WARN HiveMetaStore: Location: file:/Users/dongjoon/APACHE/spark-merge/spark-warehouse/hive_t2 specified for non-external table:hive_t2 res0: org.apache.spark.sql.DataFrame = [] scala> java.sql.DriverManager.getConnection("jdbc:derby:memory:1234;create=true").createStatement().execute("CREATE TABLE derby_t2(a int)"); res1: Boolean = false scala> :quit ``` ### Does this PR introduce _any_ user-facing change? No. This is a test only PR. ### How was this patch tested? Pass the CIs. Closes #40434 from dongjoon-hyun/SPARK-42801. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ec1d3b354c95e09d28c163a6d3550047c73e15c8) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 March 2023, 06:28:02 UTC
d92e5a5 [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions` ### What changes were proposed in this pull request? Enable users to import pandas_udf via `pyspark.sql.connect.functions.pandas_udf`. Previously, only `pyspark.sql.functions.pandas_udf` is supported. ### Why are the changes needed? Usability. ### Does this PR introduce _any_ user-facing change? Yes. Now users can import pandas_udf via `pyspark.sql.connect.functions.pandas_udf`. Previously only `pyspark.sql.functions.pandas_udf` is supported in Connect; importing `pyspark.sql.connect.functions.pandas_udf` raises an error instead, as shown below ```sh >>> pyspark.sql.connect.functions.pandas_udf() Traceback (most recent call last): ... NotImplementedError: pandas_udf() is not implemented. ``` Now, `pyspark.sql.connect.functions.pandas_udf` point to `pyspark.sql.functions.pandas_udf`, as shown below, ```sh >>> from pyspark.sql.connect import functions as CF >>> from pyspark.sql import functions as SF >>> getattr(CF, "pandas_udf") <function pandas_udf at 0x7f9c88812700> >>> getattr(SF, "pandas_udf") <function pandas_udf at 0x7f9c88812700> ``` ### How was this patch tested? Unit test. Closes #40388 from xinrong-meng/rmv_path. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 149e020a5ca88b2db9c56a9d48e0c1c896b57069) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 March 2023, 03:35:09 UTC
cc66287 [SPARK-42797][CONNECT][DOCS] Grammatical improvements for Spark Connect content ### What changes were proposed in this pull request? Grammatical improvements to the Spark Connect content as a follow-up on https://github.com/apache/spark/pull/40324/ ### Why are the changes needed? To improve readability of the pages. ### Does this PR introduce _any_ user-facing change? Yes, user-facing documentation is updated. ### How was this patch tested? Built the doc website locally and checked the updates. PRODUCTION=1 SKIP_RDOC=1 bundle exec jekyll build Closes #40428 from allanf-db/connect_overview_doc. Authored-by: Allan Folting <allan.folting@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 88d5c752829722b0b42f2c91fd57fb3e8fa17339) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 March 2023, 03:33:53 UTC
4c17885 [SPARK-42796][SQL] Support accessing TimestampNTZ columns in CachedBatch Support accessing TimestampNTZ columns in CachedBatch Implement a missing feature for TimestampNTZ type No, TimestampNTZ type is not released yet. New UT Closes #40426 from gengliangwang/ColumnAccessor. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e251bfaea3fa8bc8da00ac226ffa23e0b677ab71) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 March 2023, 01:48:21 UTC
bf9c4b9 [SPARK-42793][CONNECT] `connect` module requires `build_profile_flags` ### What changes were proposed in this pull request? This PR aims to add `build_profile_flags` to `connect` module. ### Why are the changes needed? SPARK-42656 added `connect` profile. https://github.com/apache/spark/blob/4db8e7b7944302a3929dd6a1197ea1385eecc46a/assembly/pom.xml#L155-L164 ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the CIs. Closes #40424 from dongjoon-hyun/SPARK-42793. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f55db2850ab2fbbc6d4973da552635f15375a4a5) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 March 2023, 23:56:11 UTC
3b4fd1d [SPARK-42757][CONNECT] Implement textFile for DataFrameReader ### What changes were proposed in this pull request? The pr aims to implement textFile for DataFrameReader. ### Why are the changes needed? API coverage. ### Does this PR introduce _any_ user-facing change? New method. ### How was this patch tested? Add new UT. Closes #40377 from panbingkun/connect_textFile. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0b8c1482a9504fd3d4ac0245b068072df4ebf427) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 March 2023, 23:41:34 UTC
1c7d780 [SPARK-42731][CONNECT][DOCS] Document Spark Connect configurations ### What changes were proposed in this pull request? This PR proposes to document the configuration of Spark Connect defined in https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala ### Why are the changes needed? To let users know which configuration are supported for Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, it documents the configurations for Spark Connect. ### How was this patch tested? Linters in CI should verify this change. Also manually built the docs as below: ![Screen Shot 2023-03-14 at 8 24 51 PM](https://user-images.githubusercontent.com/6477701/224986645-3e3abfe3-4f6b-4810-8887-24cf24532f5e.png) Closes #40416 from HyukjinKwon/SPARK-42731. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e986fb0767eb64f1db6cf30f8c9f3e01c192171d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 March 2023, 23:40:38 UTC
24cdae8 [SPARK-42754][SQL][UI] Fix backward compatibility issue in nested SQL execution ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/39268 / [SPARK-41752](https://issues.apache.org/jira/browse/SPARK-41752) added a new non-optional `rootExecutionId: Long` field to the SparkListenerSQLExecutionStart case class. When JsonProtocol deserializes this event it uses the "ignore missing properties" Jackson deserialization option, causing the rootExecutionField to be initialized with a default value of 0. The value 0 is a legitimate execution ID, so in the deserialized event we have no ability to distinguish between the absence of a value and a case where all queries have the first query as the root. Thanks JoshRosen for reporting and investigating this issue. ### Why are the changes needed? Bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40403 from linhongliu-db/fix-nested-execution. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4db8e7b7944302a3929dd6a1197ea1385eecc46a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 March 2023, 16:07:01 UTC
f36325d [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass ### What changes were proposed in this pull request? Run `LocalDateTime.now()` and `Instant.now()` with Java 8 & 11 always get microseconds on both Linux and MacOS, but there are some differences when using Java 17, it will get accurate nanoseconds on Linux, but still get the microseconds on MacOS. On Linux(CentOs) ``` jshell> java.time.LocalDateTime.now() $1 ==> 2023-03-13T18:09:12.498162194 jshell> java.time.Instant.now() $2 ==> 2023-03-13T10:09:16.013993186Z ``` On MacOS ``` jshell> java.time.LocalDateTime.now() $1 ==> 2023-03-13T17:13:47.485897 jshell> java.time.Instant.now() $2 ==> 2023-03-13T09:15:12.031850Z ``` At present, Spark always converts them to microseconds, this will cause `test implicit encoder resolution` in SQLImplicitsTestSuite test fail when using Java 17 on Linux, so this pr add `truncatedTo(ChronoUnit.MICROS)` when testing on Linux using Java 17 to ensure the accuracy of test input data is also microseconds. ### Why are the changes needed? Make Java 17 daily test GA task run successfully. The Java 17 daily test GA task failed as follows: ``` [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 milliseconds) 4429[info] 2023-03-02T23:00:20.404434 did not equal 2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63) 4430[info] org.scalatest.exceptions.TestFailedException: 4431[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) 4432[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 4433[info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) 4434[info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) 4435[info] at org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63) 4436[info] at org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual checked with Java 17 Closes #40395 from LuciferYang/SPARK-42770. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3a2571c3978e7271388b5e56267e3bd60c5a4712) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 March 2023, 15:53:22 UTC
d5a62e9 [SPARK-42785][K8S][CORE] When spark submit without `--deploy-mode`, avoid facing NPE in Kubernetes Case ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/37880 when user spark submit without `--deploy-mode XXX` or `–conf spark.submit.deployMode=XXXX`, may face NPE with this code. ### Why are the changes needed? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#164 ```scala args.deployMode.equals("client") && ``` Of course, submit without `deployMode` is not allowed and will throw an exception and terminate the application, but we should leave it to the later logic to give the appropriate hint instead of giving a NPE. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ![popo_2023-03-14 17-50-46](https://user-images.githubusercontent.com/52876270/224965310-ba9ec82f-e668-4a06-b6ff-34c3e80ca0b4.jpg) Closes #40414 from zwangsheng/SPARK-42785. Authored-by: zwangsheng <2213335496@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 767253bb6219f775a8a21f1cdd0eb8c25fa0b9de) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 March 2023, 15:49:22 UTC
f6b5fd9 [SPARK-42733][CONNECT][FOLLOWUP] Write without path or table Fixes `DataFrameWriter.save` to work without path or table parameter. Added support of jdbc method in the writer as it is one of the impl that does not contains a path or table. DataFrameWriter.save should work without path parameter because some data sources, such as jdbc, noop, works without those parameters. The follow up fix for scala client of https://github.com/apache/spark/pull/40356 No Unit and E2E test Closes #40358 from zhenlineo/write-without-path-table. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 93334e295483a0ba66e22d8398512ad970a3ea80) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 March 2023, 09:32:34 UTC
e93b59f [SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation ### What changes were proposed in this pull request? Adding a Spark Connect overview page to the Spark 3.4 documentation and a short section on the Spark overview page with a link to it. ### Why are the changes needed? The first version of Spark Connect is released as part of Spark 3.4.0 and this adds an overview for it to the documentation. ### Does this PR introduce _any_ user-facing change? Yes, the user facing documentation is updated. ### How was this patch tested? Built the doc website locally and tested the pages. SKIP_SCALADOC=1 SKIP_RDOC=1 bundle exec jekyll build index.html ![index html](https://user-images.githubusercontent.com/112507318/224800134-7d0093bb-261a-4abe-a3da-b1e8784854c1.png) spark-connect-overview.html ![spark-connect-overview html](https://user-images.githubusercontent.com/112507318/224800167-8f6308e0-2e0e-4b15-81a8-5bd9da12d0ce.png) Closes #40324 from allanf-db/spark_connect_docs. Authored-by: Allan Folting <allan.folting@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c7b9b42efdd577cb7ea41752fb6e73444462d0ce) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 March 2023, 06:38:54 UTC
a352507 [SPARK-42777][SQL] Support converting TimestampNTZ catalog stats to plan stats ### What changes were proposed in this pull request? When `spark.sql.cbo.planStats.enabled` or `spark.sql.cbo.enabled` is enabled, the logical plan will fetch row counts and column statistics from catalog. This PR is to support converting TimestampNTZ catalog stats to plan stats. ### Why are the changes needed? Implement a missing piece of the TimestampNTZ type. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? New UT Closes #40404 from gengliangwang/fromExternalString. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit c3ac782450583e6073b88d940af60714eb4cdf44) Signed-off-by: Gengliang Wang <gengliang@apache.org> 14 March 2023, 04:00:11 UTC
dadb5a5 [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect ### What changes were proposed in this pull request? Changing the 3.4.0 version change message for PySpark functionality from "Support Spark Connect" to "Supports Spark Connect". ### Why are the changes needed? Grammatical improvement. ### Does this PR introduce _any_ user-facing change? Yes, these messages are shown in the user-facing PySpark documentation. ### How was this patch tested? Built Spark and the documentation successfully on my computer and checked the PySpark documentation. ./build/sbt -Phive clean package PRODUCTION=1 SKIP_RDOC=1 bundle exec jekyll build ![supports_spark_connect_version_change_message](https://user-images.githubusercontent.com/112507318/224786569-d600ba83-483e-4a41-83a2-b3bf99d38af1.png) Closes #40401 from allanf-db/spark_docs_info_message. Authored-by: Allan Folting <allan.folting@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit c4f7f137ed2262a782414e217520b29bca77f960) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 14 March 2023, 02:24:18 UTC
ea25bda [SPARK-42756][CONNECT][PYTHON] Helper function to convert proto literal to value in Python Client ### What changes were proposed in this pull request? Helper function to convert proto literal to value in Python Client ### Why are the changes needed? needed in .ml ### Does this PR introduce _any_ user-facing change? no, dev-only ### How was this patch tested? added ut Closes #40376 from zhengruifeng/connect_literal_to_value. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 617c3554b2737a3cc3f9edc8e2685e94662c5251) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 13 March 2023, 06:31:36 UTC
def02cb [SPARK-42755][CONNECT] Factor literal value conversion out to `connect-common` ### What changes were proposed in this pull request? Factor literal value conversion out to `connect-common`. ### Why are the changes needed? when trying to build protos of literal array in the server side for ml, I found we have two implementations: `LiteralExpressionProtoConverter. toConnectProtoValue` in server module, but it doesn't support array; `LiteralProtoConverter. toLiteralProto` in client module, it support more types; We'd better factor it out to common module, so that both client and server can leverage it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UT Closes #40375 from zhengruifeng/connect_mv_literal_common. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 43caae31dfa05b3d237acfa3115bd0e7b4e540ed) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 13 March 2023, 06:29:37 UTC
2e4238c [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema ### What changes were proposed in this pull request? Fixes `spark.createDataFrame` to apply the given schema to work with non-nullable data types. ### Why are the changes needed? Currently `spark.createDataFrame` won't work with non-nullable schema as below: ```py >>> from pyspark.sql.types import * >>> schema_false = StructType([StructField("id", IntegerType(), False)]) >>> spark.createDataFrame([[1]], schema=schema_false) Traceback (most recent call last): ... pyspark.errors.exceptions.connect.AnalysisException: [NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required to be non-nullable. ``` whereas it works fine with nullable schema: ```py >>> from pyspark.sql.types import * >>> schema_false = StructType([StructField("id", IntegerType(), False)]) >>> spark.createDataFrame([[1]], schema=schema_false) DataFrame[id: int] ``` ### Does this PR introduce _any_ user-facing change? `spark.createDataFrame` with non-nullable schema will work. ### How was this patch tested? Added related tests. Closes #40382 from ueshin/issues/SPARK-42679/non-nullable. Lead-authored-by: panbingkun <pbk1982@gmail.com> Co-authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit fa5ca7fe87290c81ccd2ba214c8478beefe0c5ec) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 13 March 2023, 03:11:21 UTC
cb7ae04 [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT ### What changes were proposed in this pull request? Add a hook `onParamChange` in `Params.{set, setDefault, clear}`, so that subclass can update the internal status within it. ### Why are the changes needed? In 3.1, we added internal auxiliary variables in LoR and AFT to optimize prediction/transformation. In LoR, when users call `model.{setThreshold, setThresholds}`, the internal status will be correctly updated. But users still can call `model.set(model.threshold, value)`, then the status will not be updated. And when users call `model.clear(model.threshold)`, the status should be updated with default threshold value 0.5. for example: ``` import org.apache.spark.ml.linalg._ import org.apache.spark.ml.classification._ val df = Seq((1.0, 1.0, Vectors.dense(0.0, 5.0)), (0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0, 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0))).toDF("label", "weight", "features") val lor = new LogisticRegression().setWeightCol("weight") val model = lor.fit(df) val vec = Vectors.dense(0.0, 5.0) val p0 = model.predict(vec) // return 0.0 model.setThreshold(0.05) // change status val p1 = model.set(model.threshold, 0.5).predict(vec) // return 1.0; but should be 0.0 val p2 = model.clear(model.threshold).predict(vec) // return 1.0; but should be 0.0 ``` what makes it even worse it that `pyspark.ml` always set params via `model.set(model.threshold, value)`, so the internal status is easily out of sync, see the example in [SPARK-42747](https://issues.apache.org/jira/browse/SPARK-42747) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added ut Closes #40367 from zhengruifeng/ml_param_hook. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 5a702f22f49ca6a1b6220ac645e3fce70ec5189d) Signed-off-by: Sean Owen <srowen@gmail.com> 11 March 2023, 14:46:00 UTC
d3fd9ff [SPARK-42691][CONNECT][PYTHON] Implement Dataset.semanticHash ### What changes were proposed in this pull request? Implement `Dataset.semanticHash` for scala and python API of Spark connect. ### Why are the changes needed? Implement `Dataset.semanticHash` for scala and python API of Spark connect. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #40366 from beliefer/SPARK-42691. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 79b5abed8bdcd5f9657b4bcff2c5ea0c767d0bf6) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 11 March 2023, 09:04:05 UTC
67ccd8f [SPARK-42721][CONNECT][FOLLOWUP] Apply scalafmt to LoggingInterceptor ### What changes were proposed in this pull request? This is a follow-up to fix Scala linter failure at `LoggingInterceptor.scala`. ### Why are the changes needed? To recover CI. - **master**: https://github.com/apache/spark/actions/runs/4389407261/jobs/7686936044 - **branch-3.4**: https://github.com/apache/spark/actions/runs/4389407870/jobs/7686935027 ``` The scalafmt check failed on connector/connect at following occurrences: Requires formatting: LoggingInterceptor.scala Before submitting your change, please make sure to format your code using the following command: ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect/common -pl connector/connect/server -pl connector/connect/client/jvm Error: Process completed with exit code 1. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CI. Closes #40374 from dongjoon-hyun/SPARK-42721. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 967ce371ba1645d9a24dbf01a1b64faf569e8863) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 March 2023, 03:43:02 UTC
516a202 [SPARK-42721][CONNECT] RPC logging interceptor ### What changes were proposed in this pull request? This adds an gRPC interceptor in spark-connect server. It logs all the incoming RPC requests and responses. - How to enable: Set interceptor config. e.g. ./sbin/start-connect-server.sh --conf spark.connect.grpc.interceptor.classes=org.apache.spark.sql.connect.service.LoggingInterceptor --jars connector/connect/server/target/spark-connect_*-SNAPSHOT.jar - Sample output: 23/03/08 10:54:37 INFO LoggingInterceptor: Received RPC Request spark.connect.SparkConnectService/ExecutePlan (id 1868663481): { "client_id": "6844bc44-4411-4481-8109-a10e3a836f97", "user_context": { "user_id": "raghu" }, "plan": { "root": { "common": { "plan_id": "37" }, "show_string": { "input": { "common": { "plan_id": "36" }, "read": { "data_source": { "format": "csv", "schema": "", "paths": ["file:///tmp/x-in"] } } }, "num_rows": 20, "truncate": 20 } } }, "client_type": "_SPARK_CONNECT_PYTHON" } ### Why are the changes needed? This is useful in development. It might be useful to debug some problems in production as well. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - Manually in development - Unit test Closes #40342 from rangadi/logging-interceptor. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 19cb8d7014e03d828794a637bc67d09fc84650ad) Signed-off-by: Herman van Hovell <herman@databricks.com> 10 March 2023, 23:52:02 UTC
d79d102 [SPARK-42667][CONNECT][FOLLOW-UP] SparkSession created by newSession should not share the channel ### What changes were proposed in this pull request? SparkSession created by newSession should not share the channel. This is because that a SparkSession might be called `stop` in which the channel it uses will be shutdown. If the channel is shared, other non-stop SparkSession that is sharing this channel will get into trouble. ### Why are the changes needed? This fixes the issue when one SparkSession is stopped to cause other active SparkSession not working in Spark Connect. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #40346 from amaliujia/rw-session-do-not-share-channel. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit e5f56e51dcbffb1f79dc00e8493e946ce1209cdc) Signed-off-by: Herman van Hovell <herman@databricks.com> 10 March 2023, 23:38:31 UTC
6d8ea2f [SPARK-42398][SQL][FOLLOWUP] DelegatingCatalogExtension should override the new createTable method ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A, too trivial. Closes #40369 from cloud-fan/api. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 061bd92375ae9232c9b901ab0760f9712790c26f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 March 2023, 21:00:21 UTC
7e71378 Revert "[SPARK-41498] Propagate metadata through Union" This reverts commit 827ca9b82476552458e8ba7b01b90001895e8384. ### What changes were proposed in this pull request? After more thinking, it's a bit fragile to propagate metadata columns through Union. We have added quite some new fields in the file source `_metadata` metadata column such as `row_index`, `block_start`, etc. Some are parquet only. The same thing may happen in other data sources as well. If one day one table under Union adds a new metadata column (or add a new field if the metadata column is a struct type), but other tables under Union do not have this new column, then Union can't propagate metadata columns and the query will suddenly fail to analyze. To be future-proof, let's revert this support. ### Why are the changes needed? to make the analysis behavior more robust. ### Does this PR introduce _any_ user-facing change? Yes, but propagating metadata columns through Union is not released yet. ### How was this patch tested? N/A Closes #40371 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 164db5ba3c39614017f5ef6428194a442d79b425) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 March 2023, 18:30:30 UTC
24f3d4d [SPARK-42743][SQL] Support analyze TimestampNTZ columns ### What changes were proposed in this pull request? Support analyze TimestampNTZ columns ``` ANALYZE TABLE table_name [ PARTITION clause ] COMPUTE STATISTICS [ NOSCAN | FOR COLUMNS col1 [, ...] | FOR ALL COLUMNS ] ``` ### Why are the changes needed? Support computing statistics of TimestmapNTZ columns, which can be used for optimizations. ### Does this PR introduce _any_ user-facing change? No, the TimestampNTZ type is not released yet. ### How was this patch tested? Update existing UT Closes #40362 from gengliangwang/analyzeColumn. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> 10 March 2023, 17:41:19 UTC
4bbdcbc [SPARK-42745][SQL] Improved AliasAwareOutputExpression works with DSv2 ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/37525 (SPARK-40086 / SPARK-42049) the following, simple subselect expression containing query: ``` select (select sum(id) from t1) ``` fails with: ``` 09:48:57.645 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch$lzycompute(BatchScanExec.scala:47) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.batch(BatchScanExec.scala:47) at org.apache.spark.sql.execution.datasources.v2.BatchScanExec.hashCode(BatchScanExec.scala:60) at scala.runtime.Statics.anyHash(Statics.java:122) ... at org.apache.spark.sql.catalyst.trees.TreeNode.hashCode(TreeNode.scala:249) at scala.runtime.Statics.anyHash(Statics.java:122) at scala.collection.mutable.HashTable$HashUtils.elemHashCode(HashTable.scala:416) at scala.collection.mutable.HashTable$HashUtils.elemHashCode$(HashTable.scala:416) at scala.collection.mutable.HashMap.elemHashCode(HashMap.scala:44) at scala.collection.mutable.HashTable.addEntry(HashTable.scala:149) at scala.collection.mutable.HashTable.addEntry$(HashTable.scala:148) at scala.collection.mutable.HashMap.addEntry(HashMap.scala:44) at scala.collection.mutable.HashTable.init(HashTable.scala:110) at scala.collection.mutable.HashTable.init$(HashTable.scala:89) at scala.collection.mutable.HashMap.init(HashMap.scala:44) at scala.collection.mutable.HashMap.readObject(HashMap.scala:195) ... at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:85) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` when DSv2 is enabled. This PR proposes to fix `BatchScanExec` as its `equals()` and `hashCode()` as those shouldn't throw NPE in any circumstances. But if we dig deeper we realize that the NPE orrurs since https://github.com/apache/spark/pull/37525 and the root cause of the problem is changing `AliasAwareOutputExpression.aliasMap` from immutable to mutable. The mutable map deserialization invokes the `hashCode()` of the keys while that is not the case with immutable maps. In this case the key is a subquery expression whose plan contains the `BatchScanExec`. Please note that the mutability of `aliasMap` shouldn't be an issue as it is a `private` field of `AliasAwareOutputExpression` (though adding a simple `.toMap` would also help to avoid the NPE). Based on the above findings this PR also proposes making `aliasMap` to transient as it isn't needed on executors. A side quiestion is if adding any subqery expressions to `AliasAwareOutputExpression.aliasMap` makes any sense because `AliasAwareOutputExpression.projectExpression()` mainly projects `child.outputPartitioning` and `child.outputOrdering` that can't contain subquery expressions. But there are a few exceptions (`SortAggregateExec`, `TakeOrderedAndProjectExec`) where `AliasAwareQueryOutputOrdering.orderingExpressions` doesn't come from the `child` and actually leaving those expressions in the map doesn't do any harm. ### Why are the changes needed? To fix regression introduced with https://github.com/apache/spark/pull/37525. ### Does this PR introduce _any_ user-facing change? Yes, the query works again. ### How was this patch tested? Added new UT. Closes #40364 from peter-toth/SPARK-42745-improved-aliasawareoutputexpression-with-dsv2. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93d5816b3f1460b405c9828ed5ae646adfa236aa) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 March 2023, 12:58:54 UTC
0357c9f [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE This PR fixes a few issues of parameterized query: 1. replace placeholders in CTE/subqueries 2. don't replace placeholders in non-DML commands as it may store the original SQL text with placeholders and we can't resolve it later (e.g. CREATE VIEW). make the parameterized query feature complete yes, bug fix new tests Closes #40333 from cloud-fan/parameter. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f8966e7eee1d7f2db7b376d557d5ff6658c80653) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 March 2023, 06:08:55 UTC
bc16710 Preparing development version 3.4.1-SNAPSHOT 10 March 2023, 03:26:54 UTC
4000d68 Preparing Spark release v3.4.0-rc4 10 March 2023, 03:26:48 UTC
49cf58e [SPARK-42739][BUILD] Ensure release tag to be pushed to release branch ### What changes were proposed in this pull request? In the release script, add a check to ensure release tag to be pushed to release branch. ### Why are the changes needed? To ensure the success of a RC cut. Otherwise, release conductors have to manually check that. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ``` ~/spark [_d_branch] $ git commit -am '_d_commmit' ... ~/spark [_d_branch] $ git tag '_d_tag' ~/spark [_d_branch] $ git push origin _d_tag ~/spark [_d_branch] $ git branch -r --contains tags/_d_tag | grep origin ~/spark [_d_branch] $ echo $? 1 ~/spark [_d_branch] $ git push origin HEAD:_d_branch ... ~/spark [_d_branch] $ git branch -r --contains tags/_d_tag | grep origin origin/_d_branch ~/spark [_d_branch] $ echo $? 0 ``` Closes #40357 from xinrong-meng/chk_release. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org> (cherry picked from commit 785188dd8b5e74510c29edbff5b9991d88855e43) Signed-off-by: Xinrong Meng <xinrong@apache.org> 10 March 2023, 03:04:58 UTC
a01f4d6 [SPARK-42725][CONNECT][PYTHON] Make LiteralExpression support array params ### What changes were proposed in this pull request? Make LiteralExpression support array ### Why are the changes needed? MLIib requires literal to carry the array params, like `IntArrayParam`, `DoubleArrayArrayParam`. Note that this PR doesn't affect existing `functions.lit` method which apply unresolved `CreateArray` expression to support array input. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? added UT Closes #40349 from zhengruifeng/connect_py_ml_lit. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit d6d0fc74d36567c5163878656de787d6fb418604) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 10 March 2023, 02:18:58 UTC
12c7e75 Revert "[SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE" This reverts commit 80127286c5fd9cd472c868e0bf8ebcec4cf399dc. 10 March 2023, 01:49:02 UTC
8012728 [SPARK-42702][SPARK-42623][SQL] Support parameterized query in subquery and CTE ### What changes were proposed in this pull request? This PR fixes a few issues of parameterized query: 1. replace placeholders in CTE/subqueries 2. don't replace placeholders in non-DML commands as it may store the original SQL text with placeholders and we can't resolve it later (e.g. CREATE VIEW). ### Why are the changes needed? make the parameterized query feature complete ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? new tests Closes #40333 from cloud-fan/parameter. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a7807038d5be5e46634d5bf807dd12fa63546b33) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 March 2023, 01:31:40 UTC
94a2afc [SPARK-42726][CONNECT][PYTHON] Implement `DataFrame.mapInArrow` ### What changes were proposed in this pull request? Implement `DataFrame.mapInArrow`. ### Why are the changes needed? Parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. `DataFrame.mapInArrow` is supported as shown below. ``` >>> import pyarrow >>> df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age")) >>> def filter_func(iterator): ... for batch in iterator: ... pdf = batch.to_pandas() ... yield pyarrow.RecordBatch.from_pandas(pdf[pdf.id == 1]) ... >>> df.mapInArrow(filter_func, df.schema).show() +---+---+ | id|age| +---+---+ | 1| 21| +---+---+ ``` ### How was this patch tested? Unit tests. Closes #40350 from xinrong-meng/mapInArrowImpl. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f35c2cbdae1c7d35f61b437d056bd363cddbea61) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 March 2023, 01:08:57 UTC
7e4f870 [SPARK-42733][CONNECT][PYTHON] Fix DataFrameWriter.save to work without path parameter ### What changes were proposed in this pull request? Fixes `DataFrameWriter.save` to work without path parameter. ### Why are the changes needed? `DataFrameWriter.save` should work without path parameter because some data sources, such as jdbc, noop, works without those parameters. ```py >>> print(spark.range(10).write.format("noop").mode("append").save()) Traceback (most recent call last): ... AssertionError: Invalid configuration of WriteCommand, neither path or table present. ``` ### Does this PR introduce _any_ user-facing change? The data sources that don't need path parameter will work. ```py >>> print(spark.range(10).write.format("noop").mode("append").save()) None ``` ### How was this patch tested? Added a test. Closes #40356 from ueshin/issues/SPARK-42733/save. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit c2ee08baa98f21f664ec1966d63578346e4eebd8) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 10 March 2023, 00:51:45 UTC
e38e619 [SPARK-42630][CONNECT][PYTHON] Introduce UnparsedDataType and delay parsing DDL string until SparkConnectClient is available ### What changes were proposed in this pull request? Introduces `UnparsedDataType` and delays parsing DDL string for Python UDFs until `SparkConnectClient` is available. `UnparsedDataType` carries the DDL string and parse it in the server side. It should not be enclosed in other data types. Also changes `createDataFrame` to use the proto `DDLParse`. ### Why are the changes needed? Currently `parse_data_type` depends on `PySparkSession` that creates a local PySpark, but it won't be available in the client side. When `SparkConnectClient` is available, we can use the new proto `DDLParse` to parse the data types as string. ### Does this PR introduce _any_ user-facing change? The UDF's `returnType` attribute could be a string in Spark Connect if it is provided as string. ### How was this patch tested? Existing tests. Closes #40260 from ueshin/issues/SPARK-42630/ddl_parse. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c0b1735c0bfeb1ff645d146e262d7ccd036a590e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 March 2023, 11:14:19 UTC
back to top