https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
759511b Preparing Spark release v3.4.0-rc2 02 March 2023, 06:25:32 UTC
4fa4d2f [SPARK-41823][CONNECT][FOLLOW-UP][TESTS] Disable ANSI mode in ProtoToParsedPlanTestSuite ### What changes were proposed in this pull request? This PR proposes to disable ANSI mode in `ProtoToParsedPlanTestSuite`. ### Why are the changes needed? The plan suite is independent from ANSI mode as it does not check the result itself, and some tests fail when ANSI mode is on (https://github.com/apache/spark/actions/runs/4299081862/jobs/7493923661): ``` [info] - function_to_date_with_format *** FAILED *** (12 milliseconds) [info] Expected and actual plans do not match: [info] [info] === Expected Plan === [info] Project [cast(gettimestamp(s#0, yyyy-MM-dd, TimestampType, Some(America/Los_Angeles), false) as date) AS to_date(s, yyyy-MM-dd)#0] [info] +- LocalRelation <empty>, [d#0, t#0, s#0, x#0L, wt#0] [info] [info] [info] === Actual Plan === [info] Project [cast(gettimestamp(s#0, yyyy-MM-dd, TimestampType, Some(America/Los_Angeles), true) as date) AS to_date(s, yyyy-MM-dd)#0] [info] +- LocalRelation <empty>, [d#0, t#0, s#0, x#0L, wt#0] (ProtoToParsedPlanTestSuite.scala:129) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564) [info] at org.scalatest.Assertions.fail(Assertions.scala:933) [info] at org.scalatest.Assertions.fail$(Assertions.scala:929) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564) [info] at org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite.$anonfun$createTest$2(ProtoToParsedPlanTestSuite.scala:129) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. Closes #40245 from HyukjinKwon/SPARK-41823. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 41d3103f4d69a9ec25d9f78f3f94ff5f3b64ef78) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 March 2023, 04:47:23 UTC
d8fa508 [SPARK-42521][SQL] Add NULLs for INSERTs with user-specified lists of fewer columns than the target table ### What changes were proposed in this pull request? Add NULLs for INSERTs with user-specified lists of fewer columns than the target table. This is done by updating the semantics of the `USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES` SQLConf to only apply for INSERTs with explicit user-specific column lists, and changing it to true by default. ### Why are the changes needed? This behavior is consistent with other query engines. ### Does this PR introduce _any_ user-facing change? Yes, per above. ### How was this patch tested? Unit test coverage in `InsertSuite`. Closes #40229 from dtenedor/defaults-insert-nulls. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit d2a527a545c57c7fe1a736016b4e26666d2e571b) Signed-off-by: Gengliang Wang <gengliang@apache.org> 02 March 2023, 04:34:22 UTC
a1d5e89 [SPARK-42631][CONNECT] Support custom extensions in Scala client ### What changes were proposed in this pull request? This PR adds public interfaces for creating `Dataset` and `Column` instances, and for executing commands. These interfaces only allow creating `Plan`s and `Expression`s that contain an `extension` to limit what we need to expose. ### Why are the changes needed? Required to implement extensions to the Scala Spark Connect client. ### Does this PR introduce _any_ user-facing change? Yes, adds new public interfaces (see above). ### How was this patch tested? Added unit tests. Closes #40234 from tomvanbussel/SPARK-34827. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit a9c5efa413335c02621d79242e83595c4b932bf0) Signed-off-by: Herman van Hovell <herman@databricks.com> 02 March 2023, 00:42:37 UTC
e194260 [SPARK-42632][CONNECT] Fix scala paths in integration tests ### What changes were proposed in this pull request? We use the current scala version to figure out which jar to load. ### Why are the changes needed? The jar resolution in the connect client tests can resolve the jar for the wrong scala version if you are working with multiple scala versions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It is a test. Closes #40235 from hvanhovell/SPARK-42632. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1a457f9ed14810667b611155b586ebda5a95fece) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 March 2023, 00:16:18 UTC
93289f2 [SPARK-42637][CONNECT] Add SparkSession.stop() ### What changes were proposed in this pull request? Add `SparkSession.stop()` to SparkSession. ### Why are the changes needed? API parity. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Manually tested it. Closes #40239 from hvanhovell/SPARK-42637. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1667d3152603c1f6f0fb691e0899839898090ec6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 March 2023, 00:14:41 UTC
aefebe2 [SPARK-42458][CONNECT][PYTHON] Fixes createDataFrame to support DDL string as schema ### What changes were proposed in this pull request? Fixes `createDataFrame` to support DDL string as schema. ### Why are the changes needed? Currently DDL string as schema is ignored when the data is Python objects and the inference fails: ```py >>> spark.createDataFrame([(100, None)], "age INT, name STRING").show() Traceback (most recent call last): ... ValueError: Some of types cannot be determined after inferring, a StructType Schema is required in this case ``` ### Does this PR introduce _any_ user-facing change? The DDL string as schema will not be ignored when the schema can't be inferred from the given data. ```py >>> spark.createDataFrame([(100, None)], "age INT, name STRING").show() +---+----+ |age|name| +---+----+ |100|null| +---+----+ ``` ### How was this patch tested? Enabled related tests. Closes #40240 from ueshin/issues/SPARK-42458/schema_str. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4ccc88c49f330d366d15b6e9abeee6e97504006f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 March 2023, 00:13:53 UTC
1e0d88b [SPARK-42593][PS] Deprecate & remove the APIs that will be removed in pandas 2.0 ### What changes were proposed in this pull request? This PR proposes to mark the APIs as deprecated or remove the APIs that will be deprecated or removed in upcoming pandas 2.0.0 release. See [What's new in 2.0.0](https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html#removal-of-prior-version-deprecations-changes) for more detail. ### Why are the changes needed? We should match the behavior to pandas API. ### Does this PR introduce _any_ user-facing change? Yes, some APIs will be removed, so they will be no longer available. ### How was this patch tested? Fixed UTs when necessary case. Closes #40216 from itholic/SPARK-42593. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9d2fe90c9c88e5cc781b8058087a1cb1bf94f22d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 March 2023, 09:31:41 UTC
48874d9 [SPARK-42628][SQL][DOCS] Add a migration note for bloom filter join ### What changes were proposed in this pull request? This PR aims to add a migration note for bloom filter join. ### Why are the changes needed? SPARK-38841 enabled bloom filter joins by default in Apache Spark 3.4.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes #40231 from dongjoon-hyun/SPARK-42628. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 69d15c3a0e0184d5d2b1a5587d7a030969509cb6) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 March 2023, 09:29:09 UTC
0cebb4b [SPARK-41870][CONNECT][PYTHON] Fix createDataFrame to handle duplicated column names ### What changes were proposed in this pull request? Fixes `createDataFrame` to handle duplicated column names. ### Why are the changes needed? Currently the following command returns a wrong result: ```py >>> spark.createDataFrame([(1, 2)], ["c", "c"]).show() +---+---+ | c| c| +---+---+ | 2| 2| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Duplicated column names will work: ```py >>> spark.createDataFrame([(1, 2)], ["c", "c"]).show() +---+---+ | c| c| +---+---+ | 1| 2| +---+---+ ``` ### How was this patch tested? Enabled the related test. Closes #40227 from ueshin/issues/SPARK-41870/dup_cols. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 1ff93ae93d87ed22281aa68fb82ea869754f67c1) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 01 March 2023, 08:59:56 UTC
330b03c [SPARK-41725][CONNECT] Eager Execution of DF.sql() ### What changes were proposed in this pull request? This patch allows for eager execution of SQL statements using the Spark Connect Data Frame API. The implementation of the patch is as follows: When `spark.sql` is called, the client sends a command to the server including the SQL statement. The server will evaluate the query and execute the side-effects if necessary. If the query was a command it will return the results as a `Relaiton.LocalRelation` back to the client otherwise it will return a `Relation.SQL` to the client. The client then simply forwards the received payload to the next operator. ### Why are the changes needed? Compatibility ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40160 from grundprinzip/eager_sql_v2. Authored-by: Martin Grund <martin.grund@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 51a87ac549120d9fe1fe4503ca8825785d9e886d) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 01 March 2023, 08:56:21 UTC
09abea0 [SPARK-42611][SQL] Insert char/varchar length checks for inner fields during resolution ### What changes were proposed in this pull request? This PR adds char/varchar length checks for inner fields during resolution when struct fields are reordered. ### Why are the changes needed? These checks are needed to handle nested char/varchar columns correctly. Prior to this change, we would lose the raw type information when constructing nested attributes. As a result, we will not insert proper char/varchar length checks. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests that would previously fail. Closes #40206 from aokolnychyi/spark-42611. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d7d8af0dbb47e152b280226a7afcf0771b5a5ae8) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 March 2023, 07:50:41 UTC
ab2862e [SPARK-42616][SQL] SparkSQLCLIDriver shall only close started hive sessionState ### What changes were proposed in this pull request? the hive sessionState initiated in SparkSQLCLIDriver will be started later in HiveClient during communicating with HMS if necessary. There are some cases that it will not get started: - fail early before reaching HiveClient - HiveClient is not used, e.g., v2 catalog only - ... ### Why are the changes needed? Bugfix, an app will end up with unexpected states, e.g., ```java bin/spark-sql -c spark.sql.catalogImplementation=in-memory -e "select 1" 23/02/28 13:40:22 WARN Utils: Your hostname, hulk.local resolves to a loopback address: 127.0.0.1; using 10.221.102.180 instead (on interface en0) 23/02/28 13:40:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/02/28 13:40:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark master: local[*], Application Id: local-1677562824027 1 Time taken: 2.578 seconds, Fetched 1 row(s) 23/02/28 13:40:28 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 23/02/28 13:40:28 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 23/02/28 13:40:29 WARN Hive: Failed to register all functions. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1742) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:83) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3607) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3659) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3639) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3901) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:395) at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:339) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:319) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288) at org.apache.hadoop.hive.ql.session.SessionState.unCacheDataNucleusClassLoaders(SessionState.java:1596) at org.apache.hadoop.hive.ql.session.SessionState.close(SessionState.java:1586) at org.apache.hadoop.hive.cli.CliSessionState.close(CliSessionState.java:66) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.$anonfun$main$2(SparkSQLCLIDriver.scala:153) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2079) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1740) ... 31 more Caused by: MetaException(message:Version information not found in metastore. ) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:83) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6902) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:162) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:70) ... 36 more Caused by: MetaException(message:Version information not found in metastore. ) at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:7810) at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:7788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101) at com.sun.proxy.$Proxy37.verifySchema(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMSForConf(HiveMetaStore.java:595) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:588) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:655) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:431) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:79) ... 40 more ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? locally verified Closes #40211 from yaooqinn/SPARK-42616. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 41c26b61e64ba260915d2d271cb49ad6d82d4213) Signed-off-by: Kent Yao <yao@apache.org> 01 March 2023, 01:48:04 UTC
e67d39f [SPARK-41868][CONNECT][PYTHON] Fix createDataFrame to support durations ### What changes were proposed in this pull request? Fixes `createDataFrame` to support durations. ### Why are the changes needed? Currently the following command: ```py spark.createDataFrame(pd.DataFrame({"a": [timedelta(microseconds=123)]})) ``` raises an error: ``` [UNSUPPORTED_ARROWTYPE] Unsupported arrow type Duration(NANOSECOND). ``` because Arrow takes a different type for `timedelta` objects from what Spark expects. ### Does this PR introduce _any_ user-facing change? `timedelta` objects will be properly converted to `DayTimeIntervalType`. ### How was this patch tested? Enabled the related test. Closes #40226 from ueshin/issues/SPARK-41868/duration. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit b43b1337066a2d48d6de1df9f3955cddc105f57a) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 01 March 2023, 01:42:10 UTC
93612f4 [SPARK-42624][PYTHON][TESTS] Reorganize imports in test_functions ### What changes were proposed in this pull request? Reorganizes imports in `python/pyspark/sql/tests/test_functions.py`. ### Why are the changes needed? Currently imports in the file `python/pyspark/sql/tests/test_functions.py` is not organized well. There are individual imports in test functions, or imports by function names or module names, etc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified tests. Closes #40223 from ueshin/issues/SPARK-42624/imports. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8dbc47f1b987249df398435fc86cdb86edaee8c8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 March 2023, 00:43:43 UTC
6a8d281 [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client ### What changes were proposed in this pull request? This PR introduces a new client-streaming RPC service "AddArtifacts" to handle the transfer of artifacts from the client to the server. New message types `AddArtifactsRequest` and `AddArtifactsResponse` are added that specify the format of artifact transfer. An artifact is defined by its `name` and `data` fields. - `name` - The name of the artifact is expected in the form of a "Relative Path" that is made up of a sequence of directories and the final file element. - Examples of "Relative Path"s: `jars/test.jar`, `classes/xyz.class`, `abc.xyz`, `a/b/X.jar`. - The server is expected to maintain the hierarchy of files as defined by their name. (i.e The relative path of the file on the server's filesystem will be the same as the name of the provided artifact). - `data` - The raw data of the artifact. The intention behind the `name` format is to add extensibility to the approach. Through this scheme, the server can maintain the hierarchy/grouping of files in any way the client specifies as well as transfer different "forms" of artifacts without needing any updates to the protocol/code itself. The protocol supports batching and chunking (due to gRPC size limits) of artifacts as required. ### Why are the changes needed? In the decoupled client-server architecture of Spark Connect, a remote client may use a local JAR or a new class in their UDF that may not be present on the server. To handle these cases of missing "artifacts", a protocol for artifact transfer is needed to move the required artifacts from the client side over to the server side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #40147 from vicennial/artifactProtocol. Authored-by: vicennial <venkata.gudesa@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit d110d8a4e23882d53bad87e4992b62b42ff1de23) Signed-off-by: Herman van Hovell <herman@databricks.com> 01 March 2023, 00:25:02 UTC
3d95593 [SPARK-42569][CONNECT] Throw exceptions for unsupported session API ### What changes were proposed in this pull request? Throw exceptions for unsupported session API: 1. newSession 2. getActiveSession 3. getDefaultSession 4. active ### Why are the changes needed? Better indicate an API is not supported to users. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? N/A Closes #40184 from amaliujia/add_newsession. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 5b9b1142a99cdbba52a1e9ff5b80b497f3b19cf6) Signed-off-by: Herman van Hovell <herman@databricks.com> 01 March 2023, 00:20:33 UTC
dc5efe1 [SPARK-42615][CONNECT][FOLLOW-UP] Implement correct version API in SparkSession for Scala client ### What changes were proposed in this pull request? Following up on https://github.com/apache/spark/pull/40210, add correct `version` in the scala client side. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40222 from amaliujia/improve_sparksession. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 41bc4e6f8068571bc58186f090717bc645914105) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 February 2023, 21:47:55 UTC
8f81d40 [SPARK-42614][CONNECT] Make constructors private[sql] ### What changes were proposed in this pull request? Tiny PR to make most of the scala client classes have a private[sql] constructor. ### Why are the changes needed? Consistency, safety for the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #40207 from hvanhovell/SPARK-42614. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 43baed1bbe5077128385dc1acfc61a573ec1e361) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 February 2023, 18:11:36 UTC
57aa3d1 Revert "[SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions" This reverts commit 5a599dec507786139fb2ecb7ce1a44c83fd06b0d. 28 February 2023, 17:27:00 UTC
937da6d [SPARK-42615][CONNECT][PYTHON] Refactor the AnalyzePlan RPC and add `session.version` ### What changes were proposed in this pull request? Refactor the AnalyzePlan RPC and add `session.version` ### Why are the changes needed? Existing implementation always return schema, explain string, input files, etc together, but in most cases we only need the schema, so we should separate them to avoid unnecessary analysis, optimization and IO (required in `input files`). ### Does this PR introduce _any_ user-facing change? yes, new session API ``` >>> spark.version '3.5.0-SNAPSHOT' >>> ``` ### How was this patch tested? updated tests and added tests Closes #40210 from zhengruifeng/connect_refactor_analyze. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 0c19ea4773b8eaeedc4031cb1c5c3e97f2c2d3b9) Signed-off-by: Herman van Hovell <herman@databricks.com> 28 February 2023, 16:39:44 UTC
2ea7097 [SPARK-42608][SQL] Use full inner field names in resolution errors ### What changes were proposed in this pull request? This PR makes `TableOutputResolver` use full names for inner fields in resolution errors. ### Why are the changes needed? These changes are needed to avoid confusion when there are multiple inner fields with the same name. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. Closes #40202 from aokolnychyi/spark-42608. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 74410ca2f1318177e558f1e719e0cac0f0196807) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 February 2023, 09:00:31 UTC
1cd031a [SPARK-42427][SQL][TESTS][FOLLOW-UP] Disable ANSI for one more conv test case in MathFunctionsSuite ### What changes were proposed in this pull request? This PR proposes to disable ANSI for one more conv test cases in `MathFunctionsSuite`. They are intentionally testing the behaviours when ANSI is disabled. This is another followup of https://github.com/apache/spark/pull/40117. ### Why are the changes needed? To make the ANSI tests pass. It currently fails (https://github.com/apache/spark/actions/runs/4277973597/jobs/7447263656): ``` 2023-02-28T02:34:53.0298317Z [info] - SPARK-36229 inconsistently behaviour where returned value is above the 64 char threshold (105 milliseconds) 2023-02-28T02:34:53.0631723Z 02:34:53.062 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 146.0 (TID 268) 2023-02-28T02:34:53.0632672Z org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in function conv(). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. 2023-02-28T02:34:53.0633557Z at org.apache.spark.sql.errors.QueryExecutionErrors$.arithmeticOverflowError(QueryExecutionErrors.scala:643) 2023-02-28T02:34:53.0634361Z at org.apache.spark.sql.errors.QueryExecutionErrors$.overflowInConvError(QueryExecutionErrors.scala:315) 2023-02-28T02:34:53.0635124Z at org.apache.spark.sql.catalyst.util.NumberConverter$.encode(NumberConverter.scala:68) 2023-02-28T02:34:53.0711747Z at org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:158) 2023-02-28T02:34:53.0712298Z at org.apache.spark.sql.catalyst.util.NumberConverter.convert(NumberConverter.scala) 2023-02-28T02:34:53.0712925Z at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:38) 2023-02-28T02:34:53.0713547Z at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 2023-02-28T02:34:53.0714098Z at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) 2023-02-28T02:34:53.0714552Z at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 2023-02-28T02:34:53.0715094Z at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 2023-02-28T02:34:53.0715466Z at org.apache.spark.util.Iterators$.size(Iterators.scala:29) 2023-02-28T02:34:53.0715829Z at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1944) 2023-02-28T02:34:53.0716195Z at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1266) 2023-02-28T02:34:53.0716555Z at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1266) 2023-02-28T02:34:53.0716963Z at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303) 2023-02-28T02:34:53.0717400Z at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) 2023-02-28T02:34:53.0717857Z at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 2023-02-28T02:34:53.0718332Z at org.apache.spark.scheduler.Task.run(Task.scala:139) 2023-02-28T02:34:53.0718743Z at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) 2023-02-28T02:34:53.0719152Z at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520) 2023-02-28T02:34:53.0719548Z at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) 2023-02-28T02:34:53.0720001Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2023-02-28T02:34:53.0720481Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2023-02-28T02:34:53.0720848Z at java.lang.Thread.run(Thread.java:750) 2023-02-28T02:34:53.0721492Z 02:34:53.065 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 146.0 (TID 268) (localhost executor driver): org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in function conv(). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. 2023-02-28T02:34:53.0722264Z at org.apache.spark.sql.errors.QueryExecutionErrors$.arithmeticOverflowError(QueryExecutionErrors.scala:643) 2023-02-28T02:34:53.0722821Z at org.apache.spark.sql.errors.QueryExecutionErrors$.overflowInConvError(QueryExecutionErrors.scala:315) 2023-02-28T02:34:53.0723337Z at org.apache.spark.sql.catalyst.util.NumberConverter$.encode(NumberConverter.scala:68) 2023-02-28T02:34:53.0723963Z at org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:158) 2023-02-28T02:34:53.0724474Z at org.apache.spark.sql.catalyst.util.NumberConverter.convert(NumberConverter.scala) 2023-02-28T02:34:53.0725128Z at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:38) 2023-02-28T02:34:53.0725826Z at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 2023-02-28T02:34:53.0726376Z at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) 2023-02-28T02:34:53.0726827Z at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 2023-02-28T02:34:53.0727189Z at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 2023-02-28T02:34:53.0727556Z at org.apache.spark.util.Iterators$.size(Iterators.scala:29) 2023-02-28T02:34:53.0727931Z at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1944) 2023-02-28T02:34:53.0728346Z at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1266) 2023-02-28T02:34:53.0728701Z at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1266) 2023-02-28T02:34:53.0729096Z at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303) 2023-02-28T02:34:53.0729523Z at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) 2023-02-28T02:34:53.0729966Z at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Fixed unittests. Closes #40209 from HyukjinKwon/SPARK-42427-3. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 36b64652eb7aa017206d975ca9da58e77bf5e2e4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2023, 08:20:36 UTC
a2d5c9c [SPARK-42548][SQL] Add ReferenceAllColumns to skip rewriting attributes ### What changes were proposed in this pull request? Add a new trait `ReferenceAllColumns ` that overrides `references` using children output. Then we can skip it during rewriting attributes in transformUpWithNewOutput. ### Why are the changes needed? There are two reasons with this new trait: 1. it's dangerous to call `references` on an unresolved plan that all of references come from children 2. it's unnecessary to rewrite its attributes that all of references come from children ### Does this PR introduce _any_ user-facing change? prevent potential bug ### How was this patch tested? add test and pass CI Closes #40154 from ulysses-you/references. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit db0e8224e1e4c928fa2f7046ae13b6aad2b8cad6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 February 2023, 07:53:11 UTC
816774a [SPARK-42610][CONNECT] Add encoders to SQLImplicits ### What changes were proposed in this pull request? Add implicit encoder resolution to `SQLImplicits` class. ### Why are the changes needed? API parity. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added test to `SQLImplicitsTestSuite`. Closes #40205 from hvanhovell/SPARK-42610. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 968f280fd0d488372b0b09738ff9728b45499bef) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 February 2023, 05:32:55 UTC
72b4067 [SPARK-42406] Terminate Protobuf recursive fields by dropping the field ### What changes were proposed in this pull request? Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields up to certain depth. Currently it uses `NullType` to terminate the recursion. But an `ArrayType` containing `NullType` is not really useful and it does not work delta. This PR fixes this by removing the field to terminate recursion rather than using `NullType`. The following example illustrates the difference. E.g. Consider a recursive Protobuf like this: ``` message Node { int value = 1; repeated Node children = 2 // recursive array } message Tree { Node root = 1 } ``` Catalyst schama with `from_protobuf()` of `Tree` with max recursive depth set to 2, would be: - **Before**: _STRUCT<root: STRUCT<value: int, children: array<STRUCT<value: int, **children: array< void >**>>>>_ - **After**: _STRUCT<root: STRUCT<value: int, children: array<STRUCT<value: int>>>>_ Notice that at second level, the `children` array is dropped, rather than being defined as `array<void>`. ### Why are the changes needed? - This improves how Protobuf connector handles recursive fields. It avoids using `void` fields which are problematic in many scenarios and do not add any information. ### Does this PR introduce _any_ user-facing change? - This changes the schema in a subtle manner while using with recursive support enabled. Since this only removes an optional field, it is backward compatible. ### How was this patch tested? - Added multiple unit tests and updated existing one. Most of the changes for this PR are in the tests. Closes #40141 from rangadi/recursive-fields. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit e397a1585d3b089f155ccb4359d5525cd012d5da) Signed-off-by: Gengliang Wang <gengliang@apache.org> 28 February 2023, 04:44:43 UTC
26009d4 Revert "[SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client" This reverts commit 27ad5830f9aee616d25301b19aa7059d394fb942. 28 February 2023, 04:05:24 UTC
be88832 [SPARK-42612][CONNECT][PYTHON][TESTS] Enable more parity tests related to functions ### What changes were proposed in this pull request? Enables more parity tests related to `functions`. ### Why are the changes needed? There are still some more tests we should enable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified/enabled related tests. Closes #40203 from ueshin/issues/SPARK-42612/tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a9f20c12f81e8832123ea8ee87213e12846a69f9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2023, 04:01:24 UTC
1f3d9a9 [SPARK-40776][SQL][PROTOBUF][DOCS] Spark-Protobuf docs ### What changes were proposed in this pull request? The goal of this PR is to document protobuf-protobuf usage. ### Why are the changes needed? added new sql-data-sources-protobuf.md file. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? cc: rangadi Closes #39039 from SandishKumarHN/DOCS-SPARK-40776. Authored-by: SandishKumarHN <sanysandish@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5f685d01a7117511e8a386cc2e0295a3e0f86471) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2023, 03:59:34 UTC
0bb719d [SPARK-42515][BUILD][CONNECT][TESTS] Make `write table` in `ClientE2ETestSuite` sbt local test pass This pr use `LocalProject("assembly") / Compile / Keys.package` instead of `buildTestDeps` to ensure `${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars` is available for testing for `connect-client-jvm` module. On the other hand, this pr also add similar options support for `testOnly` Make `write table` in `ClientE2ETestSuite` sbt local test pass No - Pass GitHub Actions - Manual test: run `test` ``` build/sbt clean "connect-client-jvm/test" ``` **Before** ``` [info] - write table *** FAILED *** (34 milliseconds) [info] io.grpc.StatusRuntimeException: UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport [info] at io.grpc.Status.asRuntimeException(Status.java:535) [info] at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) [info] at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) [info] at scala.collection.Iterator.foreach(Iterator.scala:943) [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) [info] at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:169) [info] at org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:255) [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:338) [info] at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$13(ClientE2ETestSuite.scala:145) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.withTable(RemoteSparkSession.scala:169) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.withTable$(RemoteSparkSession.scala:167) [info] at org.apache.spark.sql.ClientE2ETestSuite.withTable(ClientE2ETestSuite.scala:33) [info] at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$12(ClientE2ETestSuite.scala:143) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) [info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) [info] at org.scalatest.Suite.run(Suite.scala:1114) [info] at org.scalatest.Suite.run$(Suite.scala:1096) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) [info] at org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:33) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [info] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [info] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [info] at java.base/java.lang.Thread.run(Thread.java:833) Warning: Unable to serialize throwable of type io.grpc.StatusRuntimeException for TestFailed(Ordinal(0, 15),UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport,ClientE2ETestSuite,org.apache.spark.sql.ClientE2ETestSuite,Some(org.apache.spark.sql.ClientE2ETestSuite),write table,write table,Vector(),Vector(),Some(io.grpc.StatusRuntimeException: UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport),Some(34),Some(IndentedText(- write table,write table,0)),Some(SeeStackDepthException),Some(org.apache.spark.sql.ClientE2ETestSuite),None,pool-1-thread-1-ScalaTest-running-ClientE2ETestSuite,1677123932064), setting it as NotSerializableWrapperException. Warning: Unable to read from client, please check on client for futher details of the problem. [info] - writeTo with create and using *** FAILED *** (27 milliseconds) [info] io.grpc.StatusRuntimeException: UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport [info] at io.grpc.Status.asRuntimeException(Status.java:535) [info] at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) [info] at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) [info] at scala.collection.Iterator.foreach(Iterator.scala:943) [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) [info] at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:169) [info] at org.apache.spark.sql.DataFrameWriterV2.executeWriteOperation(DataFrameWriterV2.scala:160) [info] at org.apache.spark.sql.DataFrameWriterV2.create(DataFrameWriterV2.scala:81) [info] at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$15(ClientE2ETestSuite.scala:162) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.withTable(RemoteSparkSession.scala:169) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.withTable$(RemoteSparkSession.scala:167) [info] at org.apache.spark.sql.ClientE2ETestSuite.withTable(ClientE2ETestSuite.scala:33) [info] at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$14(ClientE2ETestSuite.scala:161) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) [info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) Warning: Unable to serialize throwable of type io.grpc.StatusRuntimeException for TestFailed(Ordinal(0, 17),UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport,ClientE2ETestSuite,org.apache.spark.sql.ClientE2ETestSuite,Some(org.apache.spark.sql.ClientE2ETestSuite),writeTo with create and using,writeTo with create and using,Vector(),Vector(),Some(io.grpc.StatusRuntimeException: UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport),Some(27),Some(IndentedText(- writeTo with create and using,writeTo with create and using,0)),Some(SeeStackDepthException),Some(org.apache.spark.sql.ClientE2ETestSuite),None,pool-1-thread-1-ScalaTest-running-ClientE2ETestSuite,1677123932096), setting it as NotSerializableWrapperException. [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) [info] at org.scalatest.Suite.run(Suite.scala:1114) [info] at org.scalatest.Suite.run$(Suite.scala:1096) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) [info] at org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:33) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) Warning: Unable to read from client, please check on client for futher details of the problem. [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [info] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [info] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [info] at java.base/java.lang.Thread.run(Thread.java:833) [info] - writeTo with create and append *** FAILED *** (20 milliseconds) [info] io.grpc.StatusRuntimeException: UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport [info] at io.grpc.Status.asRuntimeException(Status.java:535) [info] at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) [info] at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) [info] at scala.collection.Iterator.foreach(Iterator.scala:943) [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) [info] at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:169) [info] at org.apache.spark.sql.DataFrameWriterV2.executeWriteOperation(DataFrameWriterV2.scala:160) [info] at org.apache.spark.sql.DataFrameWriterV2.create(DataFrameWriterV2.scala:81) [info] at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$17(ClientE2ETestSuite.scala:175) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.withTable(RemoteSparkSession.scala:169) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.withTable$(RemoteSparkSession.scala:167) [info] at org.apache.spark.sql.ClientE2ETestSuite.withTable(ClientE2ETestSuite.scala:33) [info] at org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$16(ClientE2ETestSuite.scala:174) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) [info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) Warning: Unable to read from client, please check on client for futher details of the problem. [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) Warning: Unable to serialize throwable of type io.grpc.StatusRuntimeException for TestFailed(Ordinal(0, 19),UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport,ClientE2ETestSuite,org.apache.spark.sql.ClientE2ETestSuite,Some(org.apache.spark.sql.ClientE2ETestSuite),writeTo with create and append,writeTo with create and append,Vector(),Vector(),Some(io.grpc.StatusRuntimeException: UNKNOWN: org/apache/parquet/hadoop/api/ReadSupport),Some(20),Some(IndentedText(- writeTo with create and append,writeTo with create and append,0)),Some(SeeStackDepthException),Some(org.apache.spark.sql.ClientE2ETestSuite),None,pool-1-thread-1-ScalaTest-running-ClientE2ETestSuite,1677123932118), setting it as NotSerializableWrapperException. [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) Fatal: Existing as unable to continue running tests, after 3 failing attempts to read event from server socket. [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) [info] at org.scalatest.Suite.run(Suite.scala:1114) [info] at org.scalatest.Suite.run$(Suite.scala:1096) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) [info] at org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:33) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [info] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [info] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [info] at java.base/java.lang.Thread.run(Thread.java:833) ``` **After** ``` [info] Run completed in 12 seconds, 629 milliseconds. [info] Total number of tests run: 505 [info] Suites: completed 9, aborted 0 [info] Tests: succeeded 505, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` run `testOnly` ``` build/sbt clean "connect-client-jvm/testOnly *ClientE2ETestSuite" build/sbt clean "connect-client-jvm/testOnly *CompatibilitySuite" ``` **Before** ``` [info] org.apache.spark.sql.ClientE2ETestSuite *** ABORTED *** (27 milliseconds) [info] java.lang.AssertionError: assertion failed: Failed to find the jar inside folder: /spark/connector/connect/server/target [info] at scala.Predef$.assert(Predef.scala:223) [info] at org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67) [info] at org.apache.spark.sql.connect.client.util.SparkConnectServerUtils$.sparkConnect$lzycompute(RemoteSparkSession.scala:64) [info] at org.apache.spark.sql.connect.client.util.SparkConnectServerUtils$.sparkConnect(RemoteSparkSession.scala:59) [info] at org.apache.spark.sql.connect.client.util.SparkConnectServerUtils$.start(RemoteSparkSession.scala:90) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.beforeAll(RemoteSparkSession.scala:120) [info] at org.apache.spark.sql.connect.client.util.RemoteSparkSession.beforeAll$(RemoteSparkSession.scala:118) [info] at org.apache.spark.sql.ClientE2ETestSuite.beforeAll(ClientE2ETestSuite.scala:33) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:750) [info] - compatibility MiMa tests *** FAILED *** (27 milliseconds) [info] java.lang.AssertionError: assertion failed: Failed to find the jar inside folder: /spark/connector/connect/client/jvm/target [info] at scala.Predef$.assert(Predef.scala:223) [info] at org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67) [info] at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57) [info] at org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53) [info] at org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) [info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) ``` **After** ``` [info] Run completed in 13 seconds, 572 milliseconds. [info] Total number of tests run: 17 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 17, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed [info] Run completed in 1 second, 578 milliseconds. [info] Total number of tests run: 2 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` Closes #40136 from LuciferYang/SPARK-42515. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8c90342e71f04a3019f70e43d38d938f09e1b356) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2023, 03:58:56 UTC
da353e5 [SPARK-42592][SS][DOCS][FOLLOWUP] Add missed commit on reflecting review comment ### What changes were proposed in this pull request? This PR addressed missed commit on reflecting review comment (https://github.com/apache/spark/pull/40188#discussion_r1119055652) from previous PR #40188. ### Why are the changes needed? Previously pushed commit didn't reflect all review comments. ### Does this PR introduce _any_ user-facing change? Yes, documentation. ### How was this patch tested? N/A. Closes #40208 from HeartSaVioR/SPARK-42592-follow-up. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit c3347aee9765b4eb4f12e69d21dda8132744366f) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 28 February 2023, 03:54:48 UTC
0b7385a [SPARK-42367][CONNECT][PYTHON] DataFrame.drop` should handle duplicated columns properly ### What changes were proposed in this pull request? match https://github.com/apache/spark/pull/40135 ### Why are the changes needed? `DataFrame.drop` should handle duplicated columns properly. we can not always convert column names to columns when there are multi columns with the same name. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? enabled tests Closes #40013 from zhengruifeng/connect_drop_duplicate. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 3d900b70c5593326ddc96f094d9abe796308b0e4) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 28 February 2023, 03:26:13 UTC
334e49b [SPARK-42596][CORE][YARN] OMP_NUM_THREADS not set to number of executor cores by default ### What changes were proposed in this pull request? The PR fixes a mistake in SPARK-41188 that removed the PythonRunner code setting OMP_NUM_THREADS to number of executor cores by default. That author and reviewers thought it's a duplicate. ### Why are the changes needed? SPARK-41188 stopped setting OMP_NUM_THREADS to number of executor cores by default when running Python UDF on YARN. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual testing Closes #40199 from jzhuge/SPARK-42596. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 43b15b31d26bbf1e539728e6c64aab4eda7ade62) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2023, 02:19:30 UTC
9b6b66d [SPARK-42600][SQL] CatalogImpl.currentDatabase shall use NamespaceHelper instead of MultipartIdentifierHelper ### What changes were proposed in this pull request? v2 catalog default namespace may be empty ```java Exception in thread "main" org.apache.spark.sql.AnalysisException: Multi-part identifier cannot be empty. at org.apache.spark.sql.errors.QueryCompilationErrors$.emptyMultipartIdentifierError(QueryCompilationErrors.scala:1887) at org.apache.spark.sql.connector.catalog.CatalogV2Implicits$MultipartIdentifierHelper.<init>(CatalogV2Implicits.scala:152) at org.apache.spark.sql.connector.catalog.CatalogV2Implicits$.MultipartIdentifierHelper(CatalogV2Implicits.scala:150) at org.apache.spark.sql.internal.CatalogImpl.currentDatabase(CatalogImpl.scala:65) ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? locally verified Closes #40192 from yaooqinn/SPARK-42600. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 53c4158386aaa6278a2e428f6ab11eaf71e740d2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 February 2023, 01:58:16 UTC
f3d834d [SPARK-39859][SQL][FOLLOWUP] Only get ColStats when isExtended is true in Describe Column ### What changes were proposed in this pull request? get ColStats in `DescribeColumnExec` when `isExtended` is true ### Why are the changes needed? To make code cleaner ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing test Closes #40139 from huaxingao/describe_followup. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit cf591580b08889384633c093972e45c289bce979) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 February 2023, 01:35:24 UTC
e60ce3e [SPARK-42121][SQL] Add built-in table-valued functions posexplode, posexplode_outer, json_tuple and stack ### What changes were proposed in this pull request? This PR adds new builtin table-valued functions `posexplode`, `posexplode_outer`, `json_tuple` and `stack`. ### Why are the changes needed? To improve the usability of table-valued generator functions. Now all generate functions can be used as table value functions. ### Does this PR introduce _any_ user-facing change? Yes. After this PR, 4 new table-valued generator functions can be used in the FROM clause of a query. ### How was this patch tested? New SQL query tests Closes #40151 from allisonwang-db/spark-42121-posexplode. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a7e61f9cbd17c8eb3d3281c2ca09dba602ee86af) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 February 2023, 00:39:30 UTC
f921df5 [SPARK-42510][CONNECT][PYTHON][TEST] Enable more `DataFrame.mapInPandas` parity tests ### What changes were proposed in this pull request? Enables more `DataFrame.mapInPandas` parity tests. ### Why are the changes needed? Now that we have `SparkSession.conf`, we can enable some more parity tests for `DataFrame.mapInPandas` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled related tests. Closes #40201 from ueshin/issues/SPARK-42510/tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 611a0f6adf17cd894557c4fa2687023f946737ac) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2023, 00:16:41 UTC
2ce9fc3 [SPARK-42592][SS][DOCS] Document how to perform chained time window aggregations ### What changes were proposed in this pull request? This PR proposes to document how to perform chained time window aggregations. Although it is introduced as a way to perform chained time window aggregations, it can be also used "generally" to apply operations which require timestamp column against the time window data. ### Why are the changes needed? We didn't document the new functionality in the guide doc in SPARK-40925. There was a doc change SPARK-42105, but it only mentioned the unblock of limitations. ### Does this PR introduce _any_ user-facing change? Yes, documentation change. ### How was this patch tested? Created a page via `SKIP_API=1 bundle exec jekyll serve --watch` and confirmed. Screenshot: <img width="611" alt="스크린샷 2023-02-28 오전 8 32 24" src="https://user-images.githubusercontent.com/1317309/221713232-3ea906ce-23f6-4293-82c0-de1e69ea1ee8.png"> Closes #40188 from HeartSaVioR/SPARK-42592. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 27 February 2023, 23:38:49 UTC
40a4019 [SPARK-42539][SQL][HIVE] Eliminate separate classloader when using 'builtin' Hive version for metadata client ### What changes were proposed in this pull request? When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen. ### Why are the changes needed? Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](https://github.com/apache/spark/pull/24057#discussion_r265142878)). The isolated clientloader was originally added in #5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin": > attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive. I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in #6435 / #6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session. ### Does this PR introduce _any_ user-facing change? No, except to protect Spark itself from potentially being broken by bad user JARs. ### How was this patch tested? This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected. Closes #40144 from xkrogen/xkrogen/SPARK-42539/hive-isolatedclientloader-builtin-user-jar-conflict-fix/java9strategy. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com> 27 February 2023, 22:58:42 UTC
2c5a9cf [SPARK-42542][CONNECT] Support Pivot without providing pivot column values ### What changes were proposed in this pull request? Add `Pivot` API when pivot column values are not provided. The decision here is that we push everything into server thus does not do max value validation for the pivot column on the client sides (both Scala and Python) now. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40200 from amaliujia/pivot_2. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit fdb36df3f6005f2cdc3d76016a656117d62e1efc) Signed-off-by: Herman van Hovell <herman@databricks.com> 27 February 2023, 22:40:06 UTC
103eca6 [SPARK-42605][CONNECT] Add TypedColumn ### What changes were proposed in this pull request? This PR adds TypedColumn to the Spark Connect Scala Client. We also add one of the typed select methods for Dataset, and typed count function. ### Why are the changes needed? API Parity. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added tests to PlanGenerationTestSuite and ClientE2EtestSuite. Closes #40197 from hvanhovell/SPARK-42605. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 7f64ec302420652932ff515c325ba37938f0b175) Signed-off-by: Herman van Hovell <herman@databricks.com> 27 February 2023, 19:13:44 UTC
79092f0 [SPARK-42580][CONNECT] Scala client add client side typed APIs ### What changes were proposed in this pull request? This PR adds the client side typed API to the Spark Connect Scala Client. ### Why are the changes needed? We want to reach API parity with the existing APIs. ### Does this PR introduce _any_ user-facing change? Yes, it adds user API. ### How was this patch tested? Added tests to `ClientE2ETestSuite`, and updated existing tests. Closes #40175 from hvanhovell/SPARK-42580. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 5243d0be2c15e3af36e981a9487ea600ab86a808) Signed-off-by: Herman van Hovell <herman@databricks.com> 27 February 2023, 13:15:15 UTC
a1d9db3 [SPARK-42581][CONNECT] Add SQLImplicits ### What changes were proposed in this pull request? This PR adds the `SQLImplicits` class to Spark Connect. This makes it easier for end users to work with Connect Datasets. The current implementation only contains the column conversions, we will add the encoder implicits in a follow-up. ### Why are the changes needed? API Parity. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? I added a new test suite: `SQLImplicitTestSuite.` Closes #40186 from hvanhovell/SPARK-42581. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit f0aef011e35a4f561764fdc1da9f32658cc14977) Signed-off-by: Herman van Hovell <herman@databricks.com> 27 February 2023, 13:11:55 UTC
05fd991 [SPARK-42586][CONNECT] Add RuntimeConfig for Scala Client ### What changes were proposed in this pull request? This PR adds the RuntimeConfig class for the Spark Connect Scala Client. ### Why are the changes needed? API Parity. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added tests to the ClientE2ETestSuite. Closes #40185 from hvanhovell/SPARK-42586. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit a6f28ca7eab25d6cc1e6bcea1dedc70d36c30a61) Signed-off-by: Herman van Hovell <herman@databricks.com> 27 February 2023, 13:05:10 UTC
eb8fa5b [SPARK-42478] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### What changes were proposed in this pull request? Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### Why are the changes needed? [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #40064 from Yikf/write-job-id. Authored-by: Yikf <yikaifei@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d46b15d2b23f13b65d781bb364ccde3be6679b99) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 February 2023, 08:56:27 UTC
68547ee [SPARK-41290][SQL] Support GENERATED ALWAYS AS expressions for columns in create/replace table statements ### What changes were proposed in this pull request? Enables creating generated columns in CREATE/REPLACE TABLE statements by specifying a generation expression for a column with GENERATED ALWAYS AS expr. For example the following will be supported: ```sql CREATE TABLE default.example ( time TIMESTAMP, date DATE GENERATED ALWAYS AS (CAST(time AS DATE)) ) ``` To be more specific this PR 1. Adds parser support for `GENERATED ALWAYS AS expr` in create/replace table statements. Generation expressions are temporarily stored in the field's metadata, and then are parsed/verified in `DataSourceV2Strategy` and used to instantiate v2 [Column](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java). 4. Adds `TableCatalog::capabilities()` and `TableCatalogCapability.SUPPORTS_CREATE_TABLE_WITH_GENERATED_COLUMNS` This will be used to determine whether to allow specifying generation expressions or whether to throw an exception. ### Why are the changes needed? `GENERATED ALWAYS AS` is SQL standard. These changes will allow defining generated columns in create/replace table statements in Spark SQL. ### Does this PR introduce _any_ user-facing change? Using `GENERATED ALWAYS AS expr` in CREATE/REPLACE table statements will no longer throw a parsing error. When used with a supporting table catalog the query should progress, when used with a nonsupporting catalog there will be an analysis exception. Previous behavior: ``` spark-sql> CREATE TABLE default.example ( > time TIMESTAMP, > date DATE GENERATED ALWAYS AS (CAST(time AS DATE)) > ) > ; Error in query: Syntax error at or near 'GENERATED'(line 3, pos 14) == SQL == CREATE TABLE default.example ( time TIMESTAMP, date DATE GENERATED ALWAYS AS (CAST(time AS DATE)) --------------^^^ ) ``` New behavior: ``` AnalysisException: [UNSUPPORTED_FEATURE.TABLE_OPERATION] The feature is not supported: Table `my_tab` does not support creating generated columns with GENERATED ALWAYS AS expressions. Please check the current catalog and namespace to make sure the qualified table name is expected, and also check the catalog implementation which is configured by "spark.sql.catalog". ``` ### How was this patch tested? Adds unit tests Closes #38823 from allisonport-db/parser-support. Authored-by: Allison Portis <allison.portis@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bd6b751a5b1c0b9b81039ca554d00e5ef841205d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 February 2023, 08:41:40 UTC
840a436 [SPARK-42587][CONNECT][TESTS][FOLLOWUP] Fix `scalafmt` failure ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/40180. ### Why are the changes needed? At previous PR, `Scalastyle` is checked but `scalafmt` was missed at the last commit. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass all CI linter jobs. Closes #40183 from dongjoon-hyun/SPARK-42587-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 379cb71330c585a2c93e8d513bb98dd12d7d5b4e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 February 2023, 03:50:39 UTC
ef15432 [SPARK-42560][CONNECT] Add ColumnName class ### What changes were proposed in this pull request? This PR adds the ColumnName for the Spark Connect Scala Client. This is a stepping stone to implement the SQLImplicits. ### Why are the changes needed? API parity with the current API. ### Does this PR introduce _any_ user-facing change? Yes. It adds new API. ### How was this patch tested? Added existing tests tot `ColumnTestSuite`. Closes #40179 from hvanhovell/SPARK-42560. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 24f0c45dc11eb7ac1ee43ebd630cdb325da30326) Signed-off-by: Herman van Hovell <herman@databricks.com> 27 February 2023, 02:42:00 UTC
6875449 [SPARK-42587][CONNECT][TESTS] Use wrapper versions for SBT and Maven in `connect` module tests ### What changes were proposed in this pull request? This PR aims to use `wrapper versions` for SBT and Maven in `connect` test module's exceptions and comments. ### Why are the changes needed? To clarity the versions we used. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #40180 from dongjoon-hyun/SPARK-42587. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a6a90feb4be891375cefbd7bbc75078e297ed008) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 February 2023, 02:34:58 UTC
d1c3902 [SPARK-42564][CONNECT] Implement SparkSession.version and SparkSession.time ### What changes were proposed in this pull request? The pr aims to implement SparkSession.version and SparkSession.time. ### Why are the changes needed? API coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new UT. Closes #40176 from panbingkun/SPARK-42564. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 9c7aa16c9a5ede3a712a49f84118bfff89273f60) Signed-off-by: Herman van Hovell <herman@databricks.com> 27 February 2023, 00:39:24 UTC
a9220f7 [SPARK-42419][CONNECT][PYTHON] Migrate into error framework for Spark Connect Column API ### What changes were proposed in this pull request? This PR proposes to migrate `TypeError` into error framework for Spark Connect Column API. ### Why are the changes needed? To improve errors by leveraging the PySpark error framework for Spark Connect Column APIs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Fixed & added UTs. Closes #39991 from itholic/SPARK-42419. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 86d3db9fc1372a377625c67c2966187ebdf2848e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 February 2023, 00:31:05 UTC
f6a5834 [SPARK-42569][CONNECT][FOLLOW-UP] Throw unsupported exceptions for persist ### What changes were proposed in this pull request? Follow up https://github.com/apache/spark/pull/40164 to also throw unsupported operation exception for `persist`. Right now we are ok to depends on the `StorageLevel` in core module but in the future that shall be refactored and moved to a common module. ### Why are the changes needed? Better way to indicate a non-supported API. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? N/A Closes #40172 from amaliujia/unsupported_op_2. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 08675f2922e4e018e25083760a0ac7413229bc43) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 February 2023, 00:28:51 UTC
d4376e0 [SPARK-42574][CONNECT][PYTHON] Fix toPandas to handle duplicated column names ### What changes were proposed in this pull request? Fixes `DataFrame.toPandas` to handle duplicated column names. ### Why are the changes needed? Currently ```py spark.sql("select 1 v, 1 v").toPandas() ``` fails with the error: ```py Traceback (most recent call last): ... File ".../python/pyspark/sql/connect/dataframe.py", line 1335, in toPandas return self._session.client.to_pandas(query) File ".../python/pyspark/sql/connect/client.py", line 548, in to_pandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 819, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 938, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/.../lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 1186, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` Simliar to #28210. ### Does this PR introduce _any_ user-facing change? Duplicated column names will be available when calling `toPandas()`. ### How was this patch tested? Enabled related tests. Closes #40170 from ueshin/issues/SPARK-42574/toPandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 89cf490f12937eaac0bb04f6cf227294776557b4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 February 2023, 00:23:32 UTC
8123f15 [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset ### What changes were proposed in this pull request? Add `groupBy(col1: String, cols: String*)` to Scala client Dataset API. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40173 from amaliujia/2nd_groupby. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit a4c12fc63beb64c15a9ac3d4f22ff132d90f610f) Signed-off-by: Herman van Hovell <herman@databricks.com> 26 February 2023, 02:35:54 UTC
8d2a1c4 [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source ### What changes were proposed in this pull request? Fixes `DataFrameReader` to use the default source. ### Why are the changes needed? ```py spark.read.load(path) ``` should work and use the default source without specifying the format. ### Does this PR introduce _any_ user-facing change? The `format` doesn't need to be specified. ### How was this patch tested? Enabled related tests. Closes #40166 from ueshin/issues/SPARK-42570/reader. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit ad35f35f12f715c276d216d621be583a6a44111a) Signed-off-by: Herman van Hovell <herman@databricks.com> 25 February 2023, 18:14:16 UTC
a2b01b2 [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types ### What changes were proposed in this pull request? This pr aims add more types support of `sql.functions#lit` function, include: - Decimal - Instant - Timestamp - LocalDateTime - Date - Duration - Period - CalendarInterval ### Why are the changes needed? Make ·sql.functions#lit· function support more types ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Add new test - Manual checked new case with Scala-2.13 Closes #40143 from LuciferYang/functions-lit. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 2a4aab7cc9cdd19e0889dc9577f16033991fda3e) Signed-off-by: Herman van Hovell <herman@databricks.com> 25 February 2023, 18:09:07 UTC
4853985 [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API ### What changes were proposed in this pull request? Match https://github.com/apache/spark/blob/6a2433070e60ad02c69ae45706a49cdd0b88a082/python/pyspark/sql/connect/dataframe.py#L1500 to throw unsupported exceptions in Scala client. ### Why are the changes needed? Better indicating a API is not supported yet. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? N/A Closes #40164 from amaliujia/unsupported_op. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 2f9e5d5cde07de7b7f386a9af10643eb66f4df84) Signed-off-by: Herman van Hovell <herman@databricks.com> 25 February 2023, 18:07:22 UTC
99db66a [SPARK-42561][CONNECT] Add temp view API to Dataset ### What changes were proposed in this pull request? Add temp view API to Dataset ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40167 from amaliujia/add_temp_view_api. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 710e7de08723a67eac86f8f802fcfcf70ef5039c) Signed-off-by: Herman van Hovell <herman@databricks.com> 25 February 2023, 18:04:36 UTC
4a6f0bf [SPARK-42575][CONNECT][SCALA] Make all client tests to extend from ConnectFunSuite ### What changes were proposed in this pull request? Make all client tests to extend from ConnectFunSuite to avoid `// scalastyle:ignore funsuite` when extending directly from `AnyFunSuite` ### Why are the changes needed? Simple dev work. ### Does this PR introduce _any_ user-facing change? No. test only. ### How was this patch tested? n/a Closes #40169 from zhenlineo/life-savor. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit c2633904bde9c89501a3ad3be284860a08f33fe7) Signed-off-by: Herman van Hovell <herman@databricks.com> 25 February 2023, 18:03:00 UTC
1c4bcf3 [SPARK-42573][CONNECT][SCALA] Enable binary compatibility tests on all major client APIs ### What changes were proposed in this pull request? Make binary compatibility check for SparkSession/Dataset/Column/functions etc. ### Why are the changes needed? Help us to have a good understanding of the current API coverage of the Scala client. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #40168 from zhenlineo/comp-it. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 2470b753171a3765e778ae699b4b1675ba7a023e) Signed-off-by: Herman van Hovell <herman@databricks.com> 25 February 2023, 17:59:32 UTC
8e589f3 [SPARK-42541][CONNECT] Support Pivot with provided pivot column values ### What changes were proposed in this pull request? Support Pivot with provided pivot column values. Not supporting Pivot without providing column values because that requires to do max value check which depends on the implementation of Spark configuration in Spark Connect. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40145 from amaliujia/rw-pivot. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 34a2d95dadfca2ee643eb937d50f12e3b8b148eb) Signed-off-by: Herman van Hovell <herman@databricks.com> 25 February 2023, 02:30:32 UTC
40e6b9d [SPARK-42568][CONNECT] Fix SparkConnectStreamHandler to handle configs properly while planning ### What changes were proposed in this pull request? Fixes `SparkConnectStreamHandler` to handle configs properly while planning. The whole process should be done in `session.withActive` to take the proper `SQLConf` into account. ### Why are the changes needed? Some components for planning need to check configs in `SQLConf.get` while building the plan, but currently it's unavailable. For example, `spark.sql.legacy.allowNegativeScaleOfDecimal` needs to check when construct `DecimalType` but it's not set while planning, thus it causes an error when trying to cast to `DecimalType(1, -1)` with the config set to `"true"`: ``` [INTERNAL_ERROR] Negative scale is not allowed: -1. Set the config "spark.sql.legacy.allowNegativeScaleOfDecimal" to "true" to allow it. ``` ### Does this PR introduce _any_ user-facing change? The configs will take effect while planning. ### How was this patch tested? Enabled a related test. Closes #40165 from ueshin/issues/SPARK-42568/withActive. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b99435328f76febd36dadd93649d1b3c10fe03da) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 February 2023, 01:39:01 UTC
d928966 [SPARK-41834][CONNECT] Implement SparkSession.conf Implements `SparkSession.conf`. Took #39995 over. `SparkSession.conf` is a missing feature. Yes, `SparkSession.conf` will be available. Added/enabled related tests. Closes #40150 from ueshin/issues/SPARK-41834/conf. Lead-authored-by: Takuya UESHIN <ueshin@databricks.com> Co-authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 47951c9ab98523665530b291218073c885183184) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 February 2023, 01:33:18 UTC
e54a5e6 [MINOR][CONNECT] Typo fixes & update comment ### What changes were proposed in this pull request? The pr aims to > 1.Fix typos 'WriteOperaton -> WriteOperation' for `SaveModeConverter`. > 2.Update comment. ### Why are the changes needed? Fix typos. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #40158 from panbingkun/MINOR_CONNECT_TYPO. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit a8e10efc89c9ca716fdbaa23b6124a7863172616) Signed-off-by: Herman van Hovell <herman@databricks.com> 25 February 2023, 00:55:14 UTC
000895d [SPARK-42510][CONNECT][PYTHON] Implement `DataFrame.mapInPandas` ### What changes were proposed in this pull request? Implement `DataFrame.mapInPandas` and enable parity tests to vanilla PySpark. A proto message `FrameMap` is intorudced for `mapInPandas` and `mapInArrow`(to implement next). ### Why are the changes needed? To reach parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. `DataFrame.mapInPandas` is supported. An example is as shown below. ```py >>> df = spark.range(2) >>> def filter_func(iterator): ... for pdf in iterator: ... yield pdf[pdf.id == 1] ... >>> df.mapInPandas(filter_func, df.schema) DataFrame[id: bigint] >>> df.mapInPandas(filter_func, df.schema).show() +---+ | id| +---+ | 1| +---+ ``` ### How was this patch tested? Unit tests. Closes #40104 from xinrong-meng/mapInPandas. Lead-authored-by: Xinrong Meng <xinrong@apache.org>] Co-authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Xinrong Meng <xinrong@apache.org> (cherry picked from commit 9abccad1d93a243d7e47e53dcbc85568a460c529) Signed-off-by: Xinrong Meng <xinrong@apache.org> 25 February 2023, 00:00:52 UTC
e7a5730 [SPARK-41823][CONNECT] Scala Client resolve ambiguous columns in Join ### What changes were proposed in this pull request? This is the scala version of https://github.com/apache/spark/pull/39925. We introduce a plan_id that is both used for each plan created by the scala client, and by the columns created when calling `Dataframe.col(..)` and `Dataframe.apply(..)`. This way we can later properly resolve the columns created for a specific Dataframe. ### Why are the changes needed? Joining columns created using Dataframe.apply(...) does not work when the column names are ambiguous. We should be able to figure out where a column comes from when they are created like this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated golden files. Added test case to ClientE2ETestSuite. Closes #40156 from hvanhovell/SPARK-41823. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 6a2433070e60ad02c69ae45706a49cdd0b88a082) Signed-off-by: Herman van Hovell <herman@databricks.com> 24 February 2023, 17:05:22 UTC
bd8c9bd [SPARK-42533][CONNECT][SCALA] Add ssl for Scala client ### What changes were proposed in this pull request? Adding SSL encryption and access token support for Scala client ### Why are the changes needed? To support basic client side encryption to protect data sent over the network. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Manual tests. Closes #40133 from zhenlineo/ssl. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit c0d301ea3c3f6e3d1b10373823e0aeeb997e8daf) Signed-off-by: Herman van Hovell <herman@databricks.com> 24 February 2023, 12:44:35 UTC
7959067 [SPARK-42534][SQL][3.4] Fix DB2Dialect Limit clause ### What changes were proposed in this pull request? The PR fixes DB2 Limit clause syntax. Although DB2 supports LIMIT keyword, it seems that this support varies across databases and versions and the recommended way is to use `FETCH FIRST x ROWS ONLY`. In fact, some versions don't support LIMIT at all. Doc: https://www.ibm.com/docs/en/db2/11.5?topic=subselect-fetch-clause, usage example: https://www.mullinsconsulting.com/dbu_0502.htm. ### Why are the changes needed? Fixes the incorrect Limit clause which could cause errors when using against DB2 versions that don't support LIMIT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test and an integration test to cover this functionality. Closes #40155 from sadikovi/db2-limit-fix-3.4. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 February 2023, 12:43:49 UTC
149458c [SPARK-42049][SQL][FOLLOWUP] Always filter away invalid ordering/partitioning ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/37525 . When the project list has aliases, we go to the `projectExpression` branch which filters away invalid partitioning/ordering that reference non-existing attributes in the current plan node. However, this filtering is missing when the project list has no alias, where we directly return the child partitioning/ordering. This PR fixes it. ### Why are the changes needed? to make sure we always return valid output partitioning/ordering. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #40137 from cloud-fan/alias. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 72922adc8a78e8d31f03205a148b89291a9a4d19) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 February 2023, 05:54:06 UTC
8b7a073 [SPARK-42547][PYTHON] Make PySpark working with Python 3.7 ### What changes were proposed in this pull request? This PR proposes to avoid new Python typing syntax that causes the test failure in lower Python version. ### Why are the changes needed? Python 3.7 support is broken: ``` + ./python/run-tests --python-executables=python3 Running PySpark tests. Output is in /home/ec2-user/spark/python/unit-tests.log Will test against the following Python executables: ['python3'] Will test the following Python modules: ['pyspark-connect', 'pyspark-core', 'pyspark-errors', 'pyspark-ml', 'pyspark-mllib', 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming'] python3 python_implementation is CPython python3 version is: Python 3.7.16 Starting test(python3): pyspark.ml.tests.test_feature (temp output: /home/ec2-user/spark/python/target/8ca9ab1a-05cc-4845-bf89-30d9001510bc/python3__pyspark.ml.tests.test_feature__kg6sseie.log) Starting test(python3): pyspark.ml.tests.test_base (temp output: /home/ec2-user/spark/python/target/f2264f3b-6b26-4e61-9452-8d6ddd7eb002/python3__pyspark.ml.tests.test_base__0902zf9_.log) Starting test(python3): pyspark.ml.tests.test_algorithms (temp output: /home/ec2-user/spark/python/target/d1dc4e07-e58c-4c03-abe5-09d8fab22e6a/python3__pyspark.ml.tests.test_algorithms__lh3wb2u8.log) Starting test(python3): pyspark.ml.tests.test_evaluation (temp output: /home/ec2-user/spark/python/target/3f42dc79-c945-4cf2-a1eb-83e72b40a9ee/python3__pyspark.ml.tests.test_evaluation__89idc7fa.log) Finished test(python3): pyspark.ml.tests.test_base (16s) Starting test(python3): pyspark.ml.tests.test_functions (temp output: /home/ec2-user/spark/python/target/5a3b90f0-216b-4edd-9d15-6619d3e03300/python3__pyspark.ml.tests.test_functions__g5u1290s.log) Traceback (most recent call last): File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ec2-user/spark/python/pyspark/ml/tests/test_functions.py", line 21, in <module> from pyspark.ml.functions import predict_batch_udf File "/home/ec2-user/spark/python/pyspark/ml/functions.py", line 38, in <module> from typing import Any, Callable, Iterator, List, Mapping, Protocol, TYPE_CHECKING, Tuple, Union ImportError: cannot import name 'Protocol' from 'typing' (/usr/lib64/python3.7/typing.py) Had test failures in pyspark.ml.tests.test_functions with python3; see logs. ``` ### Does this PR introduce _any_ user-facing change? This change has not been released out yet so no user-facing change. But this is a release blocker. ### How was this patch tested? Manually tested via: ```bash ./python/run-tests --python-executables=python3.7 ``` Closes #40153 from HyukjinKwon/SPARK-42547. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 936a58fb27f5b38c627561ef80106c4978416c31) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 February 2023, 05:15:39 UTC
12d94bd [SPARK-42475][CONNECT][DOCS] Getting Started: Live Notebook for Spark Connect ### What changes were proposed in this pull request? This PR proposes to add "Live Notebook: DataFrame with Spark Connect" at [Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/index.html) documents as below: <img width="794" alt="Screen Shot 2023-02-23 at 1 15 41 PM" src="https://user-images.githubusercontent.com/44108233/220820191-ca0e5705-1694-4eaa-8658-67d522af1bf8.png"> Basically, the notebook copied the contents of 'Live Notebook: DataFrame', and updated the contents related to Spark Connect. The Notebook looks like the below: <img width="814" alt="Screen Shot 2023-02-23 at 1 15 54 PM" src="https://user-images.githubusercontent.com/44108233/220820218-bbfb6a58-7009-4327-aea4-72ed6496d77c.png"> ### Why are the changes needed? To help quick start using DataFrame with Spark Connect for those who new to Spark Connect. ### Does this PR introduce _any_ user-facing change? No, it's documentation. ### How was this patch tested? Manually built the docs, and run the CI. Closes #40092 from itholic/SPARK-42475. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3830285d56d8971c6948763aeb3d99c5bf5eca91) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 February 2023, 02:33:00 UTC
4fb30f7 [SPARK-42545][K8S][DOCS] Remove `experimental` from `Volcano` docs ### What changes were proposed in this pull request? This PR aims to remove `experimental` notes from `Volcano` docs. ### Why are the changes needed? Apache Spark 3.3.0 added `Volcano` as an experimental module. Now, we can remove it from Apache Spark 3.4.0 because we don't expect breaking future behavior changes. ### Does this PR introduce _any_ user-facing change? No, this is a documentation only change. ### How was this patch tested? Manual review. Closes #40152 from dongjoon-hyun/SPARK-42545. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit be745f7f3b0b970cf385e7d8cc2be42c340fd1d6) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 February 2023, 02:04:11 UTC
6752ad6 [SPARK-42544][CONNNECT] Spark Connect Scala Client: support parameterized SQL ### What changes were proposed in this pull request? Support parameterized SQL API in Scala client. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40148 from amaliujia/parameterized_sql. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit ee22a0bf3c91a6b26d608b5fc28e9472eaca6b40) Signed-off-by: Herman van Hovell <herman@databricks.com> 24 February 2023, 01:53:04 UTC
8dd00d9 [SPARK-42444][PYTHON] `DataFrame.drop` should handle duplicated columns properly ### What changes were proposed in this pull request? Existing implementation always convert inputs (maybe column or column name) to columns, this cause `AMBIGUOUS_REFERENCE` issue since there maybe several columns with the same name. In the JVM side, the logics of drop(column: Column) and drop(columnName: String) are different, we can not simply always convert a column name to column via col() method. When there are multi-column with the same name (e.g, `name`), users can: 1, `drop('name')` --- drop all the columns; 2, `drop(df1.name)` --- drop the column from the specific dataframe `df1`; But if users call `drop(col('name'))`, it will fail due to ambiguous issue. In Pyspark, it is a bit complex, that the user can input both column names with columns. This PR drops the columns first, and then the column names. ### Why are the changes needed? bug fix ``` >>> from pyspark.sql import Row >>> df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"]) >>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, name="Bob")]) >>> df3 = df1.join(df2, df1.name == df2.name, 'inner') >>> df3.show() +---+----+------+----+ |age|name|height|name| +---+----+------+----+ | 16| Bob| 85| Bob| | 14| Tom| 80| Tom| +---+----+------+----+ ``` BEFORE ``` >>> df3.drop("name", "age").columns Traceback (most recent call last): ... pyspark.errors.exceptions.captured.AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, `name`]. ``` AFTER ``` >>> df3.drop("name", "age").columns ['height'] ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added tests Closes #40135 from zhengruifeng/py_fix_drop. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 0b9ed26e48248aa58642b3626a02dd8c89a01afb) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 24 February 2023, 00:03:34 UTC
52d4f10 [SPARK-25050][SQL] Avro: writing complex unions ### What changes were proposed in this pull request? Spark was able to read complex unions already but not write them. Now it is possible to also write them. If you have a schema with a complex union the following code is now working: ```scala spark .read.format("avro").option("avroSchema", avroSchema).load(path) .write.format("avro").option("avroSchema", avroSchema).save("/tmp/b") ``` While before this patch it would throw `Unsupported Avro UNION type` when writing. Add the capability to write complex unions, next to reading them. Complex unions map to struct types where field names are member0, member1, etc. This is consistent with the behavior in SchemaConverters for reading them and when converting between Avro and Parquet. ### Why are the changes needed? Fixes SPARK-25050, lines up read and write compatibility. ### Does this PR introduce _any_ user-facing change? The behaviour improved of course, this is as far as I could see not impacting any customer facing API's or documentation. ### How was this patch tested? - Added extra unit tests. - Updated existing unit tests for improved behaviour. - Validated manually with an internal corpus of avro files if they now could be read and written without problems. Which was not before this patch. Closes #36506 from steven-aerts/spark-25050. Authored-by: Steven Aerts <steven.aerts@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 0ba6ba9a3829cf63f8917367ea3c066e422ad04f) Signed-off-by: Gengliang Wang <gengliang@apache.org> 23 February 2023, 21:23:39 UTC
0e6df2e [SPARK-42531][CONNECT] Scala Client Add Collections Functions ### What changes were proposed in this pull request? This PR adds all the collections functions to `functions.scala` for Scala client. This is the last PR large functions PR, there are a few functions missing, these will be added later. ### Why are the changes needed? We want the Scala client to have API parity with the existing API ### Does this PR introduce _any_ user-facing change? Yes, it adds functions to the Spark Connect Scala Client. ### How was this patch tested? Added tests to `PlanGenerationTestSuite` and to `ProtoToPlanTestSuite`. I have added a few tests to `ClientE2ETestSuite` for lambda functions (to make sure name scoping works) and the array shuffle function (non-deterministic, hard to test with golden files). Closes #40130 from hvanhovell/SPARK-42531. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 87e3d5625e76bb734b8dd753bfb25002822c8585) Signed-off-by: Herman van Hovell <herman@databricks.com> 23 February 2023, 15:46:07 UTC
0a440e9 [SPARK-41793][SQL] Incorrect result for window frames defined by a range clause on large decimals ### What changes were proposed in this pull request? Use `DecimalAddNoOverflowCheck` instead of `Add` to craete bound ordering for window range frame ### Why are the changes needed? Before 3.4, the `Add` did not check overflow. Instead, we always wrapped `Add` with a `CheckOverflow`. After https://github.com/apache/spark/pull/36698, we make `Add` check overflow by itself. However, the bound ordering of window range frame uses `Add` to calculate the boundary that is used to determine which input row lies within the frame boundaries of an output row. Then the behavior is changed with an extra overflow check. Technically,We could allow an overflowing value if it is just an intermediate result. So this pr use `DecimalAddNoOverflowCheck` to replace the `Add` to restore the previous behavior. ### Does this PR introduce _any_ user-facing change? yes, restore the previous(before 3.4) behavior ### How was this patch tested? add test Closes #40138 from ulysses-you/SPARK-41793. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fec4f7f9aedf55709bcb40e5b504298ff4f2ccc7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 February 2023, 12:36:36 UTC
2d9a963 [SPARK-42448][SQL] Fix spark sql shell prompt for current db ### What changes were proposed in this pull request? The CliSessionState does not contain the current database info, we shall use spark's `catalog.currentDatabase` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? yes, when users use spark-sql and switch database, the prompt now show the correct one instead of `default` ### How was this patch tested? locally tested ```textmate spark-sql (default)> create database abc; Time taken: 0.24 seconds spark-sql (default)> use abc; Time taken: 0.027 seconds spark-sql (ABC)> ```` Closes #40036 from yaooqinn/SPARK-42448. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 2478342f834152ab33aa283816e6a0f346b64e44) Signed-off-by: Kent Yao <yao@apache.org> 23 February 2023, 08:46:52 UTC
314e25f [SPARK-42529][CONNECT] Support Cube and Rollup in Scala client ### What changes were proposed in this pull request? Support Cube and Rollup in Scala client. ### Why are the changes needed? API Coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40129 from amaliujia/support_cube_rollup_pivot. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 21767d29b36c3c8d812bb3ea8946a21a8ef6e65c) Signed-off-by: Herman van Hovell <herman@databricks.com> 23 February 2023, 03:56:50 UTC
3477d14 [SPARK-42530][PYSPARK][DOCS] Remove Hadoop 2 from PySpark installation guide ### What changes were proposed in this pull request? This PR aims to remove `Hadoop 2` from PySpark installation guide. ### Why are the changes needed? From Apache Spark 3.4.0, we don't provide Hadoop 2 binaries. ### Does this PR introduce _any_ user-facing change? This is a documentation fix to be consistent with the new availability. ### How was this patch tested? Manual review. Closes #40127 from dongjoon-hyun/SPARK-42530. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 295617c5d8913fc1afc78fa9647d2f99b925ceaf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 February 2023, 01:09:24 UTC
50f2ff2 [SPARK-42532][K8S][DOCS] Update YuniKorn docs with v1.2 ### What changes were proposed in this pull request? This PR aims to update `YuniKorn` documentation with the latest v1.2.0 and fix codify issues in doc. ### Why are the changes needed? - https://yunikorn.apache.org/release-announce/1.2.0 ### Does this PR introduce _any_ user-facing change? This is a documentation-only change. **BEFORE** - https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-docs/_site/running-on-kubernetes.html#using-apache-yunikorn-as-customized-scheduler-for-spark-on-kubernetes **AFTER** <img width="927" alt="Screenshot 2023-02-22 at 2 27 50 PM" src="https://user-images.githubusercontent.com/9700541/220775386-90268ecb-facf-4701-bcb7-4f6b3e847e70.png"> ### How was this patch tested? Manually test with YuniKorn v1.2.0. ``` $ helm list -n yunikorn NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION yunikorn yunikorn 1 2023-02-22 14:01:11.728926 -0800 PST deployed yunikorn-1.2.0 ``` ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" -Dtest.exclude.tags=minikube,local,decom -Dtest.default.exclude.tags='' [info] KubernetesSuite: [info] - SPARK-42190: Run SparkPi with local[*] (10 seconds, 832 milliseconds) [info] - Run SparkPi with no resources (12 seconds, 421 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (17 seconds, 861 milliseconds) [info] - Run SparkPi with a very long application name. (12 seconds, 531 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (17 seconds, 697 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (12 seconds, 499 milliseconds) [info] - Run SparkPi with an argument. (18 seconds, 734 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (12 seconds, 520 milliseconds) [info] - All pods have the same service account by default (17 seconds, 504 milliseconds) [info] - Run extraJVMOptions check on driver (9 seconds, 402 milliseconds) [info] - SPARK-42474: Run extraJVMOptions JVM GC option check - G1GC (9 seconds, 389 milliseconds) [info] - SPARK-42474: Run extraJVMOptions JVM GC option check - Other GC (9 seconds, 330 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (17 seconds, 710 milliseconds) [info] - Run SparkPi with env and mount secrets. (19 seconds, 797 milliseconds) [info] - Run PySpark on simple pi.py example (18 seconds, 568 milliseconds) [info] - Run PySpark to test a pyfiles example (15 seconds, 622 milliseconds) [info] - Run PySpark with memory customization (18 seconds, 507 milliseconds) [info] - Run in client mode. (6 seconds, 185 milliseconds) [info] - Start pod creation from template (17 seconds, 696 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (12 seconds, 585 milliseconds) [info] - Run SparkR on simple dataframe.R example (19 seconds, 639 milliseconds) [info] YuniKornSuite: [info] - SPARK-42190: Run SparkPi with local[*] (12 seconds, 421 milliseconds) [info] - Run SparkPi with no resources (20 seconds, 465 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (15 seconds, 516 milliseconds) [info] - Run SparkPi with a very long application name. (20 seconds, 532 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (15 seconds, 545 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (20 seconds, 575 milliseconds) [info] - Run SparkPi with an argument. (16 seconds, 462 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (20 seconds, 568 milliseconds) [info] - All pods have the same service account by default (15 seconds, 630 milliseconds) [info] - Run extraJVMOptions check on driver (12 seconds, 483 milliseconds) [info] - SPARK-42474: Run extraJVMOptions JVM GC option check - G1GC (12 seconds, 665 milliseconds) [info] - SPARK-42474: Run extraJVMOptions JVM GC option check - Other GC (11 seconds, 615 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (20 seconds, 810 milliseconds) [info] - Run SparkPi with env and mount secrets. (24 seconds, 622 milliseconds) [info] - Run PySpark on simple pi.py example (16 seconds, 650 milliseconds) [info] - Run PySpark to test a pyfiles example (23 seconds, 662 milliseconds) [info] - Run PySpark with memory customization (15 seconds, 450 milliseconds) [info] - Run in client mode. (5 seconds, 121 milliseconds) [info] - Start pod creation from template (20 seconds, 552 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (15 seconds, 847 milliseconds) [info] - Run SparkR on simple dataframe.R example (22 seconds, 739 milliseconds) [info] Run completed in 15 minutes, 41 seconds. [info] Total number of tests run: 42 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 42, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 1306 s (21:46), completed Feb 22, 2023, 2:28:18 PM ``` Closes #40132 from dongjoon-hyun/SPARK-42532. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5b1c45eedaed0138afb260019db800b637c3b135) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 February 2023, 23:02:28 UTC
6b5885b [SPARK-42150][K8S][DOCS][FOLLOWUP] Use v1.7.0 in docs ### What changes were proposed in this pull request? This is a follow-up of #39690. ### Why are the changes needed? To be consistent across multiple docs. ### Does this PR introduce _any_ user-facing change? No, this is a doc-only change. ### How was this patch tested? Manual review. Closes #40131 from dongjoon-hyun/SPARK-42150-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 70e69892b55d09064c760b92ba941289f9def005) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 February 2023, 22:14:29 UTC
6998f97 [SPARK-42468][CONNECT][FOLLOW-UP] Add .agg variants in Dataset ### What changes were proposed in this pull request? Add `.agg` in Dataset in Scala client. ### Why are the changes needed? API coverage. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Existing UT Closes #40125 from amaliujia/rw_add_agg_dataset. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 1232309e44d8ed65528c2b29ee4087e4173a3e06) Signed-off-by: Herman van Hovell <herman@databricks.com> 22 February 2023, 20:55:11 UTC
51e8275 [SPARK-42522][CONNECT] Fix DataFrameWriterV2 to find the default source ### What changes were proposed in this pull request? Fixes `DataFrameWriterV2` to find the default source. ### Why are the changes needed? Currently `DataFrameWriterV2` in Spark Connect doesn't work without the provider with a weird error: For example: ```py df.writeTo("test_table").create() ``` ``` pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkClassNotFoundException) [DATA_SOURCE_NOT_FOUND] Failed to find the data source: . Please find packages at `https://spark.apache.org/third-party-projects.html`. ``` ### Does this PR introduce _any_ user-facing change? Users will be able to use `DataFrameWriterV2` without provider as same as PySpark. ### How was this patch tested? Added some tests. Closes #40109 from ueshin/issues/SPARK-42522/writer_v2. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit dbe23c8e88d1a2968ae1c17ec9ee3029ef7a348a) Signed-off-by: Herman van Hovell <herman@databricks.com> 22 February 2023, 20:53:18 UTC
6dd52fe [SPARK-42518][CONNECT] Scala Client DataFrameWriterV2 ### What changes were proposed in this pull request? Adding DataFrameWriterV2. This allows users to use the Dataset#writeTo API. ### Why are the changes needed? Impls Dataset#writeTo ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E This is based on https://github.com/apache/spark/pull/40061 Closes #40075 from zhenlineo/write-v2. Authored-by: Zhen Li <zhenlineo@users.noreply.github.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 0c4645eb6bb4740b92281d124053d4610090da34) Signed-off-by: Herman van Hovell <herman@databricks.com> 22 February 2023, 20:51:35 UTC
fcbb19a [SPARK-42272][CONNEC][TESTS][FOLLOW-UP] Do not cache local port in SparkConnectService ### What changes were proposed in this pull request? This PR proposes to do not cache local port. ### Why are the changes needed? When Spark Context is stopped, and started again, the Spark Connect server shuts down and starts up again too (while JVM itself is alive). So, we should not cache the local port but have the new local port. For example, in https://github.com/apache/spark/pull/40109, the Spark Connect server at `ReadwriterTestsMixin` stops after the tests. And then, `ReadwriterV2TestsMixin` starts the new Spark Connect server which causes failures on any actual protobuf message exchanges to the server. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? I tested it on the top of https://github.com/apache/spark/pull/40109. That PR should validate it. Closes #40123 from HyukjinKwon/SPARK-42272-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 8cc8dd6f67a31cb46a228a7fd06ab0c18a462c07) Signed-off-by: Herman van Hovell <herman@databricks.com> 22 February 2023, 15:02:46 UTC
921a633 [SPARK-42527][CONNECT] Scala Client add Window functions ### What changes were proposed in this pull request? This PR aims add the window functions to the Scala spark connect client. ### Why are the changes needed? Provide same APIs in the Scala spark connect client as in the original Dataset API. ### Does this PR introduce _any_ user-facing change? Yes, it adds new for functions to the Spark Connect Scala client. ### How was this patch tested? - Add new test - Manual checked connect-client-jvm and connect with Scala-2.13 Closes #40120 from LuciferYang/window-functions. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit e2f65b5316ed1473518e2d79e89c9bed756029e9) Signed-off-by: Herman van Hovell <herman@databricks.com> 22 February 2023, 14:53:31 UTC
006babc [SPARK-41151][FOLLOW-UP][SQL] Improve the doc of `_metadata` generated columns nullability implementation ### What changes were proposed in this pull request? Add a doc of how `_metadata` nullability is implemented for generated metadata columns. ### Why are the changes needed? Improve readability ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #40035 from Yaohua628/spark-41151-doc-follow-up. Lead-authored-by: yaohua <yaohua.zhao@databricks.com> Co-authored-by: Yaohua Zhao <79476540+Yaohua628@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 100056ad1b33e134d71239ec729e609e3a68f2c9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 February 2023, 11:54:36 UTC
78d0b54 [SPARK-42427][SQL][TESTS][FOLLOW-UP] Disable ANSI for several conv test cases in MathFunctionsSuite ### What changes were proposed in this pull request? This PR proposes to disable ANSI for several conv test cases in `MathFunctionsSuite`. They are intentionally testing the behaviours when ANSI is disabled. Exception cases are already handled in https://github.com/apache/spark/commit/cb463fb40e8f663b7e3019c8d8560a3490c241d0 I believe. ### Why are the changes needed? To make the ANSI tests pass. It currently fails (https://github.com/apache/spark/actions/runs/4228390267/jobs/7343793692): ``` 2023-02-21T03:03:20.3799795Z [info] - SPARK-33428 conv function should trim input string (177 milliseconds) 2023-02-21T03:03:20.4252604Z 03:03:20.424 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 138.0 (TID 256) 2023-02-21T03:03:20.4253602Z org.apache.spark.SparkArithmeticException: [ARITHMETIC_OVERFLOW] Overflow in function conv(). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. 2023-02-21T03:03:20.4254440Z at org.apache.spark.sql.errors.QueryExecutionErrors$.arithmeticOverflowError(QueryExecutionErrors.scala:643) 2023-02-21T03:03:20.4255265Z at org.apache.spark.sql.errors.QueryExecutionErrors$.overflowInConvError(QueryExecutionErrors.scala:315) 2023-02-21T03:03:20.4256001Z at org.apache.spark.sql.catalyst.util.NumberConverter$.encode(NumberConverter.scala:68) 2023-02-21T03:03:20.4256888Z at org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:158) 2023-02-21T03:03:20.4257450Z at org.apache.spark.sql.catalyst.util.NumberConverter.convert(NumberConverter.scala) 2023-02-21T03:03:20.4258084Z at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:38) 2023-02-21T03:03:20.4258720Z at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 2023-02-21T03:03:20.4259293Z at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) 2023-02-21T03:03:20.4259769Z at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 2023-02-21T03:03:20.4260157Z at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 2023-02-21T03:03:20.4260535Z at org.apache.spark.util.Iterators$.size(Iterators.scala:29) 2023-02-21T03:03:20.4260918Z at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1944) 2023-02-21T03:03:20.4261283Z at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1266) 2023-02-21T03:03:20.4261649Z at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1266) 2023-02-21T03:03:20.4262050Z at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2303) 2023-02-21T03:03:20.4262726Z at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) 2023-02-21T03:03:20.4263206Z at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) 2023-02-21T03:03:20.4263628Z at org.apache.spark.scheduler.Task.run(Task.scala:139) 2023-02-21T03:03:20.4264227Z at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) 2023-02-21T03:03:20.4265048Z at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1520) 2023-02-21T03:03:20.4266209Z at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) 2023-02-21T03:03:20.4266805Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2023-02-21T03:03:20.4267369Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2023-02-21T03:03:20.4267799Z at java.lang.Thread.run(Thread.java:750) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Fixed unittests. Closes #40117 from HyukjinKwon/SPARK-42427-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 50006b9715c17be7c9ea5809195945dd78418baa) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2023, 11:38:14 UTC
033eb37 [SPARK-42516][SQL] Always capture the session time zone config while creating views ### What changes were proposed in this pull request? In the PR, I propose to capture the session time zone config (`spark.sql.session.timeZone`) as a view property, and use it while re-parsing/analysing the view. If the SQL config is not set while creating a view, use the default value of the config. ### Why are the changes needed? To improve user experience with Spark SQL. The current behaviour might confuse users because query results depends on whether or not the session time zone was set explicitly while creating a view. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the current value of the session time zone is used in view analysis but this behaviour can be restored via another SQL config `spark.sql.legacy.useCurrentConfigsForView`. ### How was this patch tested? By running the new test via: ``` $ build/sbt "test:testOnly *.PersistedViewTestSuite" ``` Closes #40103 from MaxGekk/view-tz-conf. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 00e56905f77955f67e3809d724b33aebcc79cb5e) Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 February 2023, 11:03:31 UTC
c082251 [SPARK-42526][ML] Add Classifier.getNumClasses back ### What changes were proposed in this pull request? Add Classifier.getNumClasses back ### Why are the changes needed? some famous libraries like `xgboost` happen to depend on this method, even though it is not a public API so it should be nice to make xgboost integration better. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? update mima Closes #40119 from zhengruifeng/ml_add_classifier_get_num_class. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit a6098beade01eac5cf92727e69b3537fcac31b2d) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 22 February 2023, 11:02:25 UTC
c123c85 [SPARK-42520][CONNECT] Support basic Window API in Scala client ### What changes were proposed in this pull request? Support Window orderby, partitionby, rowsbetween/rangebetween. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #40107 from amaliujia/rw-window-2. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 054522b67626aa1515b8f3f164ba7c063c38e5b8) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 February 2023, 07:19:27 UTC
6e6993a [SPARK-41933][FOLLOWUP][CONNECT] Correct an error message ### What changes were proposed in this pull request? This PR follow-ups for https://github.com/apache/spark/pull/39441 to fix the wrong error message. ### Why are the changes needed? Error message correction. ### Does this PR introduce _any_ user-facing change? No, but it's just about error message. ### How was this patch tested? The existing CI should pass Closes #40112 from itholic/SPARK-41933-followup. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 58efc4b469c229cd649fe28e5f201824cc3cfc07) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2023, 06:36:22 UTC
d89481a [SPARK-42524][BUILD] Upgrade numpy and pandas in the release Dockerfile ### What changes were proposed in this pull request? Upgrade pandas from 1.1.5 to 1.5.3, numpy from 1.19.4 to 1.20.3 in the Dockerfile used for Spark releases. They are also what we use to cut `v3.4.0-rc1`. ### Why are the changes needed? Otherwise, errors are raised as shown below when building release docs. ``` ImportError: Warning: Latest version of pandas (1.5.3) is required to generate the documentation; however, your version was 1.1.5 ImportError: this version of pandas is incompatible with numpy < 1.20.3 your numpy version is 1.19.4. Please upgrade numpy to >= 1.20.3 to use this pandas version ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests. Closes #40111 from xinrong-meng/docker_lib. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d09742b955782fc9717aaa0a76f067ccdf241010) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2023, 02:38:30 UTC
df800d0 [SPARK-41775][PYTHON][FOLLOW-UP] Updating docs for readability ### What changes were proposed in this pull request? Added minor UI fixes. <img width="732" alt="image" src="https://user-images.githubusercontent.com/81988348/220488925-eda62d80-d54d-41e9-a9ec-53d02b6fb94d.png"> <img width="725" alt="image" src="https://user-images.githubusercontent.com/81988348/220488948-929b1c35-4da7-4317-9883-078c2a57896a.png"> <img width="693" alt="image" src="https://user-images.githubusercontent.com/81988348/220488975-fdc34ae5-a539-4557-993c-d740232b29b5.png"> ### Why are the changes needed? For easy to read documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #40110 from rithwik-db/docs-update-2. Authored-by: Rithwik Ediga Lakhamsani <rithwik.ediga@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ba24dcec42bcd45caee5a4866137bc352cba02ef) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2023, 02:30:52 UTC
6559928 [SPARK-42406][SQL] Fix check for missing required fields of to_protobuf ### What changes were proposed in this pull request? Protobuf serializer (used in `to_protobuf()`) should error if non-nullable fields (i.e. protobuf `required` fields) are present in the schema of the catalyst record being converted to a protobuf. But `isNullable()` method used for this check returns opposite (see PR comment in the diff). As a result, Serializer incorrectly requires the fields that are optional. This PR fixes this check (see PR comment in the diff). This also requires corresponding fix for couple of unit tests. In order use a Protobuf message with a `required` field, Protobuf version 2 file `proto2_messages.proto` is added. Two tests are updated to verify missing required fields results in an error. ### Why are the changes needed? This is need to fix a bug where we were incorrectly enforcing a schema check on optional fields rather than on required fields. ### Does this PR introduce _any_ user-facing change? It fixes a bug, and gives more flexibility for user queries. ### How was this patch tested? - Updated unit tests Closes #40080 from rangadi/fix-required-field-check. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit fb5647732fa2f49838f803f67ea11b20fc14282b) Signed-off-by: Gengliang Wang <gengliang@apache.org> 22 February 2023, 01:34:27 UTC
5c8ddb4 [SPARK-42514][CONNECT] Scala Client add partition transforms functions ### What changes were proposed in this pull request? This PR aims add the partition transforms functions to the Scala spark connect client. ### Why are the changes needed? Provide same APIs in the Scala spark connect client as in the original Dataset API. ### Does this PR introduce _any_ user-facing change? Yes, it adds new for functions to the Spark Connect Scala client. ### How was this patch tested? - Add new test - Manual checked `connect-client-jvm` and `connect` with Scala-2.13 Closes #40105 from LuciferYang/partition-transforms-functions. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 4f3afdcd7561db38f3c0427d31db4f27fa94a83c) Signed-off-by: Herman van Hovell <herman@databricks.com> 22 February 2023, 00:47:56 UTC
9a7881a [SPARK-42002][CONNECT][FOLLOW-UP] Add Required/Optional notions to writer v2 proto ### What changes were proposed in this pull request? Follow existing proto style guide, we should always add `Required/Optional` to proto documentation. ### Why are the changes needed? Improve documentation. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? N/A Closes #40106 from amaliujia/rw-fix-proto. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5097c669ffae23997db00b8f2eec89abb4f33cfc) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 February 2023, 23:40:03 UTC
85fab6b [SPARK-42495][CONNECT] Scala Client add Misc, String, and Date/Time functions ### What changes were proposed in this pull request? This PR adds the following functions to the scala client: - Misc functions. - String functions. - Date/Time functions. ### Why are the changes needed? We want to be able the same APIs in the scala client as in the original Dataset API. ### Does this PR introduce _any_ user-facing change? Yes, it adds new for functions to the Spark Connect Scala client. ### How was this patch tested? Added tests to `PlanGenerationTestSuite` (and indirectly to `ProtoToPlanTestSuite`). Overloads are tested in `FunctionTestSuite`. Closes #40089 from hvanhovell/functions-2. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit b36d1484c1a090a33d9add056730128b9ba5729f) Signed-off-by: Herman van Hovell <herman@databricks.com> 21 February 2023, 14:00:39 UTC
6fda64a [SPARK-41765][SQL] Pull out v1 write metrics to WriteFiles ### What changes were proposed in this pull request? Pull out the metrics which is in `InsertIntoHiveTable` and `InsertIntoHadoopFsRelationCommand` to `WriteFiles`. ### Why are the changes needed? Move metrics to the right place. ### Does this PR introduce _any_ user-facing change? yes, the SQL UI metrics changed from `V1WriteCommand` to `WriteFiles` with `spark.sql.optimizer.plannedWrite.enabled` disable and enable: // disable <img width="314" alt="image" src="https://user-images.githubusercontent.com/12025282/220267296-62d2deef-f8d8-4e71-adc6-23e416f0777c.png"> // enable <img width="319" alt="image" src="https://user-images.githubusercontent.com/12025282/220267151-dd1e7ed9-eb92-44a5-abdc-75f204ecf97e.png"> ### How was this patch tested? fix and improve test Closes #39428 from ulysses-you/SPARK-41765. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a111a02de1a814c5f335e0bcac4cffb0515557dc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 February 2023, 13:48:17 UTC
back to top