https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
7c465bc Preparing Spark release v3.3.1-rc3 06 October 2022, 05:15:03 UTC
5fe895a [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to distribute elements ### What changes were proposed in this pull request? Cherry-picked from #38106 and reverted changes in RDD.scala: https://github.com/apache/spark/blob/d2952b671a3579759ad9ce326ed8389f5270fd9f/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L507 ### Why are the changes needed? The number of output files has changed since SPARK-40407. [Some downstream projects](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java#L578-L579) use repartition to determine the number of output files in the test. ``` bin/spark-shell --master "local[2]" spark.range(10).repartition(10).write.mode("overwrite").parquet("/tmp/spark/repartition") ``` Before this PR and after SPARK-40407, the number of output files is 8. After this PR or before SPARK-40407, the number of output files is 10. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #38110 from wangyum/branch-3.3-SPARK-40660. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> 06 October 2022, 05:02:22 UTC
5dc9ba0 [SPARK-40669][SQL][TESTS] Parameterize `rowsNum` in `InMemoryColumnarBenchmark` This PR aims to parameterize `InMemoryColumnarBenchmark` to accept `rowsNum`. This enables us to benchmark more flexibly. ``` build/sbt "sql/test:runMain org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark 1000000" ... [info] Running benchmark: Int In-Memory scan [info] Running case: columnar deserialization + columnar-to-row [info] Stopped after 3 iterations, 444 ms [info] Running case: row-based deserialization [info] Stopped after 3 iterations, 462 ms [info] OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Mac OS X 12.6 [info] Apple M1 Max [info] Int In-Memory scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] -------------------------------------------------------------------------------------------------------------------------- [info] columnar deserialization + columnar-to-row 119 148 26 8.4 118.5 1.0X [info] row-based deserialization 119 154 32 8.4 119.5 1.0X ``` ``` $ build/sbt "sql/test:runMain org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark 10000000" ... [info] Running benchmark: Int In-Memory scan [info] Running case: columnar deserialization + columnar-to-row [info] Stopped after 3 iterations, 3855 ms [info] Running case: row-based deserialization [info] Stopped after 3 iterations, 4250 ms [info] OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Mac OS X 12.6 [info] Apple M1 Max [info] Int In-Memory scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] -------------------------------------------------------------------------------------------------------------------------- [info] columnar deserialization + columnar-to-row 1082 1285 199 9.2 108.2 1.0X [info] row-based deserialization 1057 1417 335 9.5 105.7 1.0X ``` ``` $ build/sbt "sql/test:runMain org.apache.spark.sql.execution.columnar.InMemoryColumnarBenchmark 20000000" [info] Running benchmark: Int In-Memory scan [info] Running case: columnar deserialization + columnar-to-row [info] Stopped after 3 iterations, 8482 ms [info] Running case: row-based deserialization [info] Stopped after 3 iterations, 7534 ms [info] OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Mac OS X 12.6 [info] Apple M1 Max [info] Int In-Memory scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] -------------------------------------------------------------------------------------------------------------------------- [info] columnar deserialization + columnar-to-row 2261 2828 555 8.8 113.1 1.0X [info] row-based deserialization 1788 2511 1187 11.2 89.4 1.3X ``` No. This is a benchmark test code. Manually. Closes #38114 from dongjoon-hyun/SPARK-40669. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 95cfdc694d3e0b68979cd06b78b52e107aa58a9f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 October 2022, 01:08:15 UTC
5483607 [SPARK-39725][BUILD][3.3] Upgrade `jetty` to 9.4.48.v20220622 ### What changes were proposed in this pull request? Upgrade jetty from 9.4.46.v20220331 to 9.4.48.v20220622 [Relase notes](https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.48.v20220622) ### Why are the changes needed? [CVE-2022-2047](https://nvd.nist.gov/vuln/detail/CVE-2022-2047) and [CVE-2022-2048](https://nvd.nist.gov/vuln/detail/CVE-2022-2048) This is a 7.5 HIGH ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass Github actions Closes #38098 from bjornjorgensen/patch-1. Lead-authored-by: Bjørn Jørgensen <47577197+bjornjorgensen@users.noreply.github.com> Co-authored-by: Bjørn <bjornjorgensen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 05 October 2022, 00:50:57 UTC
d7b78a3 [SPARK-40648][YARN][TESTS][3.3] Add @ExtendedLevelDBTest to LevelDB relevant tests in the yarn module ### What changes were proposed in this pull request? SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` starts to verify the `registeredExecFile` reload test scenario again,so this pr add `ExtendedLevelDBTest` to `LevelDB` relevant tests in the `yarn` module so that the `MacOs/Apple Silicon` can skip the tests through `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest`. ### Why are the changes needed? According to convention, Add `ExtendedLevelDBTest` to LevelDB relevant tests to make `yarn` module can skip these tests through `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` on `MacOs/Apple Silicon`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test on `MacOs/Apple Silicon` ``` build/sbt clean "yarn/testOnly *YarnShuffleIntegrationSuite*" -Pyarn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest build/sbt clean "yarn/testOnly *YarnShuffleAuthSuite*" -Pyarn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest build/sbt clean "yarn/testOnly *YarnShuffleAlternateNameConfigSuite*" -Pyarn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest ``` **Before** All 3 case aborted as follows ``` [info] YarnShuffleIntegrationSuite: [info] org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite *** ABORTED *** (1 second, 144 milliseconds) [info] java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /Users/yangjie01/SourceCode/git/spark-source/target/tmp/libleveldbjni-64-1-7065283280142546801.8: dlopen(/Users/yangjie01/SourceCode/git/spark-source/target/tmp/libleveldbjni-64-1-7065283280142546801.8, 1): no suitable image found. Did find: [info] /Users/yangjie01/SourceCode/git/spark-source/target/tmp/libleveldbjni-64-1-7065283280142546801.8: no matching architecture in universal wrapper [info] /Users/yangjie01/SourceCode/git/spark-source/target/tmp/libleveldbjni-64-1-7065283280142546801.8: no matching architecture in universal wrapper] [info] at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182) [info] at org.fusesource.hawtjni.runtime.Library.load(Library.java:140) [info] at org.fusesource.leveldbjni.JniDBFactory.<clinit>(JniDBFactory.java:48) [info] at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48) [info] at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:126) [info] at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:99) [info] at org.apache.spark.network.shuffle.ExternalBlockHandler.<init>(ExternalBlockHandler.java:90) [info] at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:247) [info] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) [info] at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:475) [info] at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:758) [info] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) [info] at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) [info] at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327) [info] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) [info] at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) [info] at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494) [info] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597) [info] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) [info] at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327) [info] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) [info] at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:111) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:64) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:750) ``` **After** All 3 case as follows: ``` [info] YarnShuffleAlternateNameConfigSuite: [info] Run completed in 1 second, 288 milliseconds. [info] Total number of tests run: 0 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 0, failed 0, ``` Closes #38096 from LuciferYang/SPARK-40648-33. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 04 October 2022, 20:40:37 UTC
2adee97 [SPARK-40636][CORE] Fix wrong remained shuffles log in BlockManagerDecommissioner ### What changes were proposed in this pull request Fix wrong remained shuffles log in BlockManagerDecommissioner ### Why are the changes needed? BlockManagerDecommissioner should log correct remained shuffles. Current log used all shuffles num as remained. ``` 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #38078 from warrenzhu25/deco-log. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b39f2d6acf25726d99bf2c2fa84ba6a227d0d909) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 04 October 2022, 20:38:27 UTC
90a2775 [SPARK-40617] Fix race condition at the handling of ExecutorMetricsPoller's stageTCMP entries ### What changes were proposed in this pull request? Fix a race condition in ExecutorMetricsPoller between `getExecutorUpdates()` and `onTaskStart()` methods by avoiding removing entries when another stage is not started yet. ### Why are the changes needed? Spurious failures are reported because of the following assert: ``` 22/09/29 09:46:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 3063.0 in stage 1997.0 (TID 677249),5,main] java.lang.AssertionError: assertion failed: task count shouldn't below 0 at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.executor.ExecutorMetricsPoller.decrementCount$1(ExecutorMetricsPoller.scala:130) at org.apache.spark.executor.ExecutorMetricsPoller.$anonfun$onTaskCompletion$3(ExecutorMetricsPoller.scala:135) at java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1822) at org.apache.spark.executor.ExecutorMetricsPoller.onTaskCompletion(ExecutorMetricsPoller.scala:135) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:737) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) 22/09/29 09:46:24 INFO MemoryStore: MemoryStore cleared 22/09/29 09:46:24 INFO BlockManager: BlockManager stopped 22/09/29 09:46:24 INFO ShutdownHookManager: Shutdown hook called 22/09/29 09:46:24 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1664443624160_0001/spark-93efc2d4-84de-494b-a3b7-2cb1c3a45426 ``` I have checked the code and the basic assumption to have at least as many `onTaskStart()` calls as `onTaskCompletion()` for the same `stageId` & `stageAttemptId` pair is correct. But there is race condition between `getExecutorUpdates()` and `onTaskStart()`. First of all we have two different threads: - task runner: to execute the task and informs `ExecutorMetricsPoller` about task starts and completion - heartbeater: which uses the `ExecutorMetricsPoller` to get the metrics To show the race condition assume a task just finished which was running on its own (no other tasks was running). So this will decrease the `count` from 1 to 0. On the task runner thread let say a new task starts. So the execution is in the `onTaskStart()` method let's assume the `countAndPeaks` is already computed and here the counter is 0 but the execution is still before incrementing the counter. So we are in between the following two lines: ```scala val countAndPeaks = stageTCMP.computeIfAbsent((stageId, stageAttemptId), _ => TCMP(new AtomicLong(0), new AtomicLongArray(ExecutorMetricType.numMetrics))) val stageCount = countAndPeaks.count.incrementAndGet() ``` Let's look at the other thread (heartbeater) where the `getExecutorUpdates()` is running and it is at the `removeIfInactive()` method: ```scala def removeIfInactive(k: StageKey, v: TCMP): TCMP = { if (v.count.get == 0) { logDebug(s"removing (${k._1}, ${k._2}) from stageTCMP") null } else { v } } ``` And here this entry is removed from `stageTCMP` as the count is 0. Let's go back to the task runner thread where we increase the counter to 1 but that value will be lost as we have no entry in the `stageTCMP` for this stage and attempt. So if a new task comes instead of 2 we will have 1 in the `stageTCMP` and when those two tasks finishes the second one will decrease the counter from 0 to -1. This is when the assert raised. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. I managed to reproduce the issue with a temporary test: ```scala test("reproduce assert failure") { val testMemoryManager = new TestMemoryManager(new SparkConf()) val taskId = new AtomicLong(0) val runFlag = new AtomicBoolean(true) val poller = new ExecutorMetricsPoller(testMemoryManager, 1000, None) val callUpdates = new Thread("getExecutorOpdates") { override def run() { while (runFlag.get()) { poller.getExecutorUpdates.size } } } val taskStartRunner1 = new Thread("taskRunner1") { override def run() { while (runFlag.get) { var l = taskId.incrementAndGet() poller.onTaskStart(l, 0, 0) poller.onTaskCompletion(l, 0, 0) } } } val taskStartRunner2 = new Thread("taskRunner2") { override def run() { while (runFlag.get) { var l = taskId.incrementAndGet() poller.onTaskStart(l, 0, 0) poller.onTaskCompletion(l, 0, 0) } } } val taskStartRunner3 = new Thread("taskRunner3") { override def run() { while (runFlag.get) { var l = taskId.incrementAndGet() var m = taskId.incrementAndGet() poller.onTaskStart(l, 0, 0) poller.onTaskStart(m, 0, 0) poller.onTaskCompletion(l, 0, 0) poller.onTaskCompletion(m, 0, 0) } } } callUpdates.start() taskStartRunner1.start() taskStartRunner2.start() taskStartRunner3.start() Thread.sleep(1000 * 20) runFlag.set(false) callUpdates.join() taskStartRunner1.join() taskStartRunner2.join() taskStartRunner3.join() } ``` The assert which raised is: ``` Exception in thread "taskRunner3" java.lang.AssertionError: assertion failed: task count shouldn't below 0 at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.executor.ExecutorMetricsPoller.decrementCount$1(ExecutorMetricsPoller.scala:130) at org.apache.spark.executor.ExecutorMetricsPoller.$anonfun$onTaskCompletion$3(ExecutorMetricsPoller.scala:135) at java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1828) at org.apache.spark.executor.ExecutorMetricsPoller.onTaskCompletion(ExecutorMetricsPoller.scala:135) at org.apache.spark.executor.ExecutorMetricsPollerSuite$$anon$4.run(ExecutorMetricsPollerSuite.scala:64) ``` But when I switch off `removeIfInactive()` by using the following code: ```scala if (false && v.count.get == 0) { logDebug(s"removing (${k._1}, ${k._2}) from stageTCMP") null } else { v } ``` Then no assert is raised. Closes #38056 from attilapiros/SPARK-40617. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> (cherry picked from commit 564a51b64e71f7402c2674de073b3b18001df56f) Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> 03 October 2022, 13:27:51 UTC
7cf579f [SPARK-40612][CORE] Fixing the principal used for delegation token renewal on non-YARN resource managers ### What changes were proposed in this pull request? When the delegation token is fetched at the first time (see the `fetchDelegationTokens()` call at `HadoopFSDelegationTokenProvider#getTokenRenewalInterval()`) the principal is the current user but at the subsequent token renewals (see `obtainDelegationTokens()` where `getTokenRenewer()` is used to identify the principal) are using a MapReduce/Yarn specific principal even on resource managers different from YARN. This PR fixes `getTokenRenewer()` to use the current user instead of `org.apache.hadoop.mapred.Master.getMasterPrincipal(hadoopConf)` when the resource manager is not YARN. The condition `(master != null && master.contains("yarn"))` is the very same what we already have in `hadoopFSsToAccess()`. I would like to say thank you for squito who have done the investigation regarding of the problem which lead to this PR. ### Why are the changes needed? To avoid `org.apache.hadoop.security.AccessControlException: Permission denied.` for long running applications. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. Closes #38048 from attilapiros/SPARK-40612. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6484992535767ae8dc93df1c79efc66420728155) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 September 2022, 21:52:49 UTC
367d3e5 [SPARK-40574][DOCS] Enhance DROP TABLE documentation ### What changes were proposed in this pull request? This PR adds `PURGE` in `DROP TABLE` documentation. Related documentation and code: 1. Hive `DROP TABLE` documentation: https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl <img width="877" alt="image" src="https://user-images.githubusercontent.com/5399861/192425153-63ac5373-dd34-48b3-864c-324cf5ba5db9.png"> 2. Hive code: https://github.com/apache/hive/blob/rel/release-2.3.9/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1185-L1209 3. Spark code: https://github.com/apache/spark/blob/v3.3.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1317-L1327 ### Why are the changes needed? Enhance documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test. Closes #38011 from wangyum/SPARK-40574. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 11eefc81e5c1f3ec7db6df8ba068a7155f7abda3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 September 2022, 22:32:44 UTC
b53106e [SPARK-40583][DOCS] Fixing artifactId name in `cloud-integration.md` ### What changes were proposed in this pull request? I am changing the name of the artifactId that enables the integration with several cloud infrastructures. ### Why are the changes needed? The name of the package is wrong and it does not exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It is not needed. Closes #38021 from danitico/fix/SPARK-40583. Authored-by: Daniel Ranchal Parrado <daniel.ranchal@vlex.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dac58f82d1c10fb91f85fd9670f88d88dbe2feea) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 September 2022, 22:25:18 UTC
0eb8721 [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy` ### What changes were proposed in this pull request? This PR aims to add a new legacy configuration to keep `grouping__id` value like the released Apache Spark 3.2 and 3.3. Please note that this syntax is non-SQL standard and even Hive doesn't support it. ```SQL hive> SELECT version(); OK 3.1.3 r4df4d75bf1e16fe0af75aad0b4179c34c07fc975 Time taken: 0.111 seconds, Fetched: 1 row(s) hive> SELECT count(*), grouping__id from t GROUP BY a GROUPING SETS (b); FAILED: SemanticException 1:63 [Error 10213]: Grouping sets expression is not in GROUP BY key. Error encountered near token 'b' ``` ### Why are the changes needed? SPARK-40218 fixed a bug caused by SPARK-34932 (at Apache Spark 3.2.0). As a side effect, `grouping__id` values are changed. - Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.3.0. ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 1| | 1| 1| +--------+------------+ ``` - After SPARK-40218, Apache Spark 3.4.0, 3.3.1, 3.2.3 ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 2| | 1| 2| +--------+------------+ ``` - This PR (Apache Spark 3.4.0, 3.3.1, 3.2.3) ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 2| | 1| 2| +--------+------------+ scala> sql("set spark.sql.legacy.groupingIdWithAppendedUserGroupBy=true") res1: org.apache.spark.sql.DataFrame = [key: string, value: string]scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 1| | 1| 1| +--------+------------+ ``` ### Does this PR introduce _any_ user-facing change? No, this simply added back the previous behavior by the legacy configuration. ### How was this patch tested? Pass the CIs. Closes #38001 from dongjoon-hyun/SPARK-40562. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5c0ebf3d97ae49b6e2bd2096c2d590abf4d725bd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 September 2022, 09:14:32 UTC
ec5c2d0 [SPARK-38717][SQL][3.3] Handle Hive's bucket spec case preserving behaviour ### What changes were proposed in this pull request? When converting a native table metadata representation `CatalogTable` to `HiveTable` make sure bucket spec uses an existing column. ### Does this PR introduce _any_ user-facing change? Hive metastore seems to be not case preserving with columns but case preserving with bucket spec, which means the following table creation: ``` CREATE TABLE t( c STRING, B_C STRING ) PARTITIONED BY (p_c STRING) CLUSTERED BY (B_C) INTO 4 BUCKETS STORED AS PARQUET ``` followed by a query: ``` SELECT * FROM t ``` fails with: ``` Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns B_C is not part of the table columns ([FieldSchema(name:c, type:string, comment:null), FieldSchema(name:b_c, type:string, comment:null)] ``` ### Why are the changes needed? Bug fix. ### How was this patch tested? Added new UT. Closes #37982 from peter-toth/SPARK-38717-handle-upper-case-bucket-spec-3.3. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 September 2022, 05:07:48 UTC
76b7ea2 [SPARK-40322][DOCS][3.3] Fix all dead links in the docs This PR backports https://github.com/apache/spark/pull/37981 to branch-3.3. The original PR description: ### What changes were proposed in this pull request? This PR fixes any dead links in the documentation. ### Why are the changes needed? Correct the document. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test. Closes #37984 from wangyum/branch-3.3-SPARK-40322. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 September 2022, 08:12:16 UTC
0f99027 [SPARK-40547][DOCS] Fix dead links in sparkr-vignettes.Rmd ### What changes were proposed in this pull request? This PR fix all dead links in sparkr-vignettes.Rmd. ### Why are the changes needed? binary-release-hadoop3.log logs: ``` yumwangLM-SHC-16508156 output % tail -n 30 binary-release-hadoop3.log * this is package ‘SparkR’ version ‘3.3.1’ * package encoding: UTF-8 * checking CRAN incoming feasibility ... NOTE Maintainer: ‘The Apache Software Foundation <devspark.apache.org>’ New submission Package was archived on CRAN CRAN repository db overrides: X-CRAN-Comment: Archived on 2021-06-28 as issues were not corrected in time. Should use tools::R_user_dir(). Found the following (possibly) invalid URLs: URL: https://spark.apache.org/docs/latest/api/R/column_aggregate_functions.html From: inst/doc/sparkr-vignettes.html Status: 404 Message: Not Found URL: https://spark.apache.org/docs/latest/api/R/read.df.html From: inst/doc/sparkr-vignettes.html Status: 404 Message: Not Found URL: https://spark.apache.org/docs/latest/api/R/sparkR.session.html From: inst/doc/sparkr-vignettes.html Status: 404 Message: Not Found * checking package namespace information ... OK * checking package dependencies ...% ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test. Closes #37983 from wangyum/fix-links. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7d13d467a88521c81dfbd9453edda444a13e8855) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 September 2022, 15:58:09 UTC
431e90d [SPARK-39200][CORE] Make Fallback Storage readFully on content ### What changes were proposed in this pull request? Looks like from bug description, fallback storage doesn't readFully and then cause `org.apache.spark.shuffle.FetchFailedException: Decompression error: Corrupted block detected`. This is an attempt to fix this by read the underlying stream fully. ### Why are the changes needed? Fix a bug documented in SPARK-39200 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Wrote a unit test Closes #37960 from ukby1234/SPARK-39200. Authored-by: Frank Yin <franky@ziprecruiter.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 07061f1a07a96f59ae42c9df6110eb784d2f3dab) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 September 2022, 11:23:35 UTC
fb42c3e [SPARK-40535][SQL] Fix bug the buffer of AggregatingAccumulator will not be created if the input rows is empty ### What changes were proposed in this pull request? When `AggregatingAccumulator` serialize aggregate buffer, may throwing NPE. There is one test case could repeat this error. ``` val namedObservation = Observation("named") val df = spark.range(1, 10, 1, 10) val observed_df = df.observe( namedObservation, percentile_approx($"id", lit(0.5), lit(100)).as("percentile_approx_val")) observed_df.collect() ``` throws exception as follows: ``` 13:45:10.976 ERROR org.apache.spark.util.Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:641) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.getBufferObject(interfaces.scala:602) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:624) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:205) at org.apache.spark.sql.execution.AggregatingAccumulator.withBufferSerialized(AggregatingAccumulator.scala:33) at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:186) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1245) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1456) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:663) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Why are the changes needed? Fix a bug. After my investigation, The root cause is the buffer of `AggregatingAccumulator` will not be created if the input rows is empty. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users will see the correct results. ### How was this patch tested? New test case. Closes #37977 from beliefer/SPARK-37203_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 7bbd975f165ec73c17e4604050f0828e3e5b9c0e) Signed-off-by: Max Gekk <max.gekk@gmail.com> 23 September 2022, 11:22:23 UTC
81c2b0c Preparing development version 3.3.2-SNAPSHOT 23 September 2022, 08:29:34 UTC
1d3b8f7 Preparing Spark release v3.3.1-rc2 23 September 2022, 08:29:17 UTC
2bae604 [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition ### What changes were proposed in this pull request? ``` scala val df = spark.range(0, 100, 1, 50).repartition(4) val v = df.rdd.mapPartitions { iter => { Iterator.single(iter.length) }.collect() println(v.mkString(",")) ``` The above simple code outputs `50,0,0,50`, which means there is no data in partition 1 and partition 2. The RoundRobin seems to ensure to distribute the records evenly *in the same partition*, and not guarantee it between partitions. Below is the code to generate the key ``` scala case RoundRobinPartitioning(numPartitions) => // Distributes elements evenly across output partitions, starting from a random partition. var position = new Random(TaskContext.get().partitionId()).nextInt(numPartitions) (row: InternalRow) => { // The HashPartitioner will handle the `mod` by the number of partitions position += 1 position } ``` In this case, There are 50 partitions, each partition will only compute 2 elements. The issue for RoundRobin here is it always starts with position=2 to do the Roundrobin. See the output of Random ``` scala scala> (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(4) + " ")) // the position is always 2. 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ``` Similarly, the below Random code also outputs the same value, ``` scala (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(2) + " ")) (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(4) + " ")) (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(8) + " ")) (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(16) + " ")) (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(32) + " ")) ``` Consider partition 0, the total elements are [0, 1], so when shuffle writes, for element 0, the key will be (position + 1) = 2 + 1 = 3%4=3, the element 1, the key will be (position + 1)=(3+1)=4%4 = 0 consider partition 1, the total elements are [2, 3], so when shuffle writes, for element 2, the key will be (position + 1) = 2 + 1 = 3%4=3, the element 3, the key will be (position + 1)=(3+1)=4%4 = 0 The calculation is also applied for other left partitions since the starting position is always 2 for this case. So, as you can see, each partition will write its elements to Partition [0, 3], which results in Partition [1, 2] without any data. This PR changes the starting position of RoundRobin. The default position calculated by `new Random(partitionId).nextInt(numPartitions)` may always be the same for different partitions, which means each partition will output the data into the same keys when shuffle writes, and some keys may not have any data in some special cases. ### Why are the changes needed? The PR can fix the data skew issue for the special cases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Will add some tests and watch CI pass Closes #37855 from wbo4958/roundrobin-data-skew. Authored-by: Bobby Wang <wbo4958@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f6c4e58b85d7486c70cd6d58aae208f037e657fa) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 September 2022, 12:59:27 UTC
b608ba3 [SPARK-40490][YARN][TESTS][3.3] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios ### What changes were proposed in this pull request? After SPARK-17321, `YarnShuffleService` will persist data to local shuffle state db/reload data from local shuffle state db only when Yarn NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`. `YarnShuffleIntegrationSuite` not set `YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the configuration is false, so `YarnShuffleIntegrationSuite` will neither trigger data persistence to the db nor verify the reload of data. This pr aims to let `YarnShuffleIntegrationSuite` restart the verification of registeredExecFile reload scenarios, to achieve this goal, this pr make the following changes: 1. Add a new un-document configuration `spark.yarn.shuffle.testing` to `YarnShuffleService`, and Initialize `_recoveryPath` when `_recoveryPath == null && spark.yarn.shuffle.testing == true`. 2. Only set `spark.yarn.shuffle.testing = true` in `YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` is not null to ensure that registeredExecFile reload scenarios will be verified. ### Why are the changes needed? Fix registeredExecFile reload test scenarios. Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true? This configuration has been tried **Hadoop 3.3.2** ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite -Phadoop-3 ``` ``` YarnShuffleIntegrationSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/org/iq80/leveldb/DBException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:111) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) ... Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) ``` **Hadoop 2.7.4** ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite -Phadoop-2 ``` ``` YarnShuffleIntegrationSuite: org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite *** ABORTED *** java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... Run completed in 3 seconds, 992 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 *** 1 SUITE ABORTED *** ``` From the above test, we need to use a fixed port to enable Yarn NodeManager recovery, but this is difficult to be guaranteed in UT, so this pr try a workaround way. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #37962 from LuciferYang/SPARK-40490-33. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 September 2022, 09:34:20 UTC
50f0ced [SPARK-40385][SQL] Fix interpreted path for companion object constructor ### What changes were proposed in this pull request? Fixes encoding of classes that uses companion object constructors in the interpreted path. Without this change the that is added in this change would fail with ``` ... Cause: java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: Couldn't find a valid constructor on interface org.apache.spark.sql.catalyst.ScroogeLikeExample newInstance(interface org.apache.spark.sql.catalyst.ScroogeLikeExample) at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1199) ... ``` As far as I can tell this bug has existed since the initial implementation in SPARK-8288 https://github.com/apache/spark/pull/23062 The existing spec that tested this part of the code incorrectly provided an outerPointer which hid the bug from that test. ### Why are the changes needed? Fixes a bug, the new spec in the ExpressionsEncoderSuite shows that this is in fact a bug. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug. ### How was this patch tested? New and existing specs in ExpressionEncoderSuite and ObjectExpressionsSuite. Closes #37837 from eejbyfeldt/spark-40385. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 73e3c36ec89242e4c586a5328e813310f82de728) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 September 2022, 02:46:45 UTC
1f2b5d1 [SPARK-40508][SQL][3.3] Treat unknown partitioning as UnknownPartitioning ### What changes were proposed in this pull request? When running spark application against spark 3.3, I see the following : ``` java.lang.IllegalArgumentException: Unsupported data source V2 partitioning type: CustomPartitioning at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:46) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioning$$anonfun$apply$1.applyOrElse(V2ScanPartitioning.scala:34) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) ``` The CustomPartitioning works fine with Spark 3.2.1 This PR proposes to relax the code and treat all unknown partitioning the same way as that for UnknownPartitioning. ### Why are the changes needed? 3.3.0 doesn't seem to warrant such behavioral change (from that of 3.2.1 release). ### Does this PR introduce _any_ user-facing change? This would allow user's custom partitioning to continue to work with 3.3.x releases. ### How was this patch tested? Existing test suite. I have run the test using Cassandra Spark connector and modified Spark (with this patch) which passes. Closes #37957 from tedyu/unk-33. Authored-by: Ted Yu <yuzhihong@gmail.com> Signed-off-by: Chao Sun <sunchao@apple.com> 21 September 2022, 21:50:09 UTC
5b81c0f [MINOR][DOCS][PYTHON] Document datetime.timedelta <> DayTimeIntervalType ### What changes were proposed in this pull request? This PR proposes to document datetime.timedelta support in PySpark in SQL DataType reference page. This support was added in SPARK-37275 ### Why are the changes needed? To show the support of datetime.timedelta. ### Does this PR introduce _any_ user-facing change? Yes, this fixes the documentation. ### How was this patch tested? CI in this PR should validate the build. Closes #37939 from HyukjinKwon/minor-daytimeinterval. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 2aeb8d74c45aea358c0887573f0d549f6111f119) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 September 2022, 06:27:32 UTC
896ecd8 [SPARK-38803][K8S][TESTS] Lower minio cpu to 250m (0.25) from 1 in K8s IT ### What changes were proposed in this pull request? This PR aims to set minio request cpu to `250m` (0.25). - This value also recommand in [link](https://docs.gitlab.com/charts/charts/minio/#installation-command-line-options). - There are [no cpu request limitation](https://github.com/minio/minio/blob/a3e317773a2b90a433136e1ff2a8394bc5017c75/helm/minio/values.yaml#L251) on current minio. ### Why are the changes needed? In some cases (such as resource limited case), we reduce request cpu of minio. See also: https://github.com/apache/spark/pull/35830#pullrequestreview-929597027 ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? IT passsed Closes #36096 from Yikun/minioRequestCores. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5ea2b386eb866e20540660cdb6ed43792cb29969) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 September 2022, 05:02:58 UTC
883a481 [SPARK-38802][K8S][TESTS] Add Support for `spark.kubernetes.test.(driver|executor)RequestCores` ### What changes were proposed in this pull request? This patch adds support for `spark.kubernetes.test.(driver|executor)RequestCores`, this help devs set specific cores info for (driver|executor)RequestCores. ### Why are the changes needed? In some cases (such as resource limited case), we want to set specifc `RequestCores`. See also: https://github.com/apache/spark/pull/35830#pullrequestreview-929597027 ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? IT passed Closes #36087 from Yikun/SPARK-38802. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 83963828b54bffe99527a004057272bc584cbc26) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 September 2022, 05:01:24 UTC
fca6ab9 [SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata ### What changes were proposed in this pull request? Cherry-picked from #37905 Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata` column. Because the logical plan from the batch and the actual planned logical plan are mismatched. So, [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348) we cannot find the plan and collect metrics correctly. This PR fixes this by replacing the initial `LogicalPlan` with the `LogicalPlan` containing the metadata column ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing + New UTs Closes #37932 from Yaohua628/spark-40460-3-3. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 September 2022, 03:46:58 UTC
d616da7 [SPARK-40468][SQL] Fix column pruning in CSV when _corrupt_record is selected ### What changes were proposed in this pull request? The PR fixes an issue when depending on the name of the `_corrupt_record` field, column pruning would behave differently for a record that has no parsing errors. For example, with a CSV file like this (c1 and c2 columns): ``` 1,a ``` Before the patch, the following query would return: ```scala val df = spark.read .schema("c1 int, c2 string, x string, _corrupt_record string") .csv("file:/tmp/file.csv") .withColumn("x", lit("A")) Result: +---+---+---+---------------+ |c1 |c2 |x |_corrupt_record| +---+---+---+---------------+ |1 |a |A |1,a | +---+---+---+---------------+ ``` However, if you rename the corrupt record column, the result is different (the original, arguably correct, behaviour before https://github.com/apache/spark/commit/959694271e30879c944d7fd5de2740571012460a): ```scala val df = spark.read .option("columnNameCorruptRecord", "corrupt_record") .schema("c1 int, c2 string, x string, corrupt_record string") .csv("file:/tmp/file.csv") .withColumn("x", lit("A")) +---+---+---+--------------+ |c1 |c2 |x |corrupt_record| +---+---+---+--------------+ |1 |a |A |null | +---+---+---+--------------+ ``` This patch fixes the former so both results would return `null` for corrupt record as there are no parsing issues. Note that https://issues.apache.org/jira/browse/SPARK-38523 is still fixed and works correctly. ### Why are the changes needed? Fixes a bug where corrupt record would be non-null even though the record has no parsing errors. ### Does this PR introduce _any_ user-facing change? Yes, fixes the output of corrupt record with additional columns provided by user. Everything should be unchanged outside of that scenario. ### How was this patch tested? I added a unit test that reproduces the issue. Closes #37909 from sadikovi/SPARK-40468. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 0776f9e7bcb10612eb977ed4884e9848aea86c33) Signed-off-by: Max Gekk <max.gekk@gmail.com> 17 September 2022, 07:59:45 UTC
ba6d172 [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema ### What changes were proposed in this pull request? Currently in Parquet V1 read path, Spark will pushdown data filters even if they have no reference in the Parquet read schema. This can cause correctness issues as described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). The root cause, it seems, is because in the V1 path, we first use `AttributeReference` equality to filter out data columns without partition columns, and then use `AttributeSet` equality to filter out filters with only references to data columns. There's inconsistency in the two steps, when case sensitive check is false. Take the following scenario as example: - data column: `[COL, a]` - partition column: `[col]` - filter: `col > 10` With `AttributeReference` equality, `COL` is not considered equal to `col` (because their names are different), and thus the filtered out data column set is still `[COL, a]`. However, when calculating filters with only reference to data columns, `COL` is **considered equal** to `col`. Consequently, the filter `col > 10`, when checking with `[COL, a]`, is considered to have reference to data columns, and thus will be pushed down to Parquet as data filter. On the Parquet side, since `col` doesn't exist in the file schema (it only has `COL`), when column index enabled, it will incorrectly return wrong number of rows. See [PARQUET-2170](https://issues.apache.org/jira/browse/PARQUET-2170) for more detail. In general, where data columns overlap with partition columns and case sensitivity is false, partition filters will not be filter out before we calculate filters with only reference to data columns, which is incorrect. ### Why are the changes needed? This fixes the correctness bug described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? There are existing test cases for this issue from [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). This also modified them to test the scenarios when case sensitivity is on or off. Closes #37881 from sunchao/SPARK-40169. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> 16 September 2022, 17:48:11 UTC
b9a514e [SPARK-40470][SQL] Handle GetArrayStructFields and GetMapValue in "arrays_zip" function ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/37833. The PR fixes column names in `arrays_zip` function for the cases when `GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests for more details). Before the patch, the column names would be indexes or an AnalysisException would be thrown in the case of `GetArrayStructFields` example. ### Why are the changes needed? Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would be labeled as indexes instead of column names. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added unit tests that reproduce the issue and confirmed that the patch fixes them. Closes #37911 from sadikovi/SPARK-40470. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9b0f979141ba2c4124d96bc5da69ea5cac51df0d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 September 2022, 13:05:12 UTC
507708d [SPARK-40461][INFRA] Set upperbound for pyzmq 24.0.0 for Python linter ### What changes were proposed in this pull request? This PR sets the upperbound for `pyzmq` as `<24.0.0` in our CI Python linter job. The new release seems having a problem (https://github.com/zeromq/pyzmq/commit/2d3327d2e50c2510d45db2fc51488578a737b79b). ### Why are the changes needed? To fix the linter build failure. See https://github.com/apache/spark/actions/runs/3063515551/jobs/4947782771 ``` /tmp/timer_created_0ftep6.c: In function ‘main’: /tmp/timer_created_0ftep6.c:2:5: warning: implicit declaration of function ‘timer_create’ [-Wimplicit-function-declaration] 2 | timer_create(); | ^~~~~~~~~~~~ x86_64-linux-gnu-gcc -pthread tmp/timer_created_0ftep6.o -L/usr/lib/x86_64-linux-gnu -o a.out /usr/bin/ld: tmp/timer_created_0ftep6.o: in function `main': /tmp/timer_created_0ftep6.c:2: undefined reference to `timer_create' collect2: error: ld returned 1 exit status no timer_create, linking librt ************************************************ building 'zmq.libzmq' extension creating build/temp.linux-x86_64-cpython-39/buildutils creating build/temp.linux-x86_64-cpython-39/bundled creating build/temp.linux-x86_64-cpython-39/bundled/zeromq creating build/temp.linux-x86_64-cpython-39/bundled/zeromq/src x86_64-linux-gnu-g++ -pthread -std=c++11 -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DZMQ_HAVE_CURVE=1 -DZMQ_USE_TWEETNACL=1 -DZMQ_USE_EPOLL=1 -DZMQ_IOTHREADS_USE_EPOLL=1 -DZMQ_POLL_BASED_ON_POLL=1 -Ibundled/zeromq/include -Ibundled -I/usr/include/python3.9 -c buildutils/initlibzmq.cpp -o build/temp.linux-x86_64-cpython-39/buildutils/initlibzmq.o buildutils/initlibzmq.cpp:10:10: fatal error: Python.h: No such file or directory 10 | #include "Python.h" | ^~~~~~~~~~ compilation terminated. error: command '/usr/bin/x86_64-linux-gnu-g++' failed with exit code 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for pyzmq ERROR: Could not build wheels for pyzmq, which is required to install pyproject.toml-based projects ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PRs should validate it. Closes #37904 from HyukjinKwon/fix-linter. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 254bd80278843b3bc13584ca2f04391a770a78c7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 September 2022, 01:09:14 UTC
c0acd3f [SPARK-40459][K8S] `recoverDiskStore` should not stop by existing recomputed files ### What changes were proposed in this pull request? This PR aims to ignore `FileExistsException` during `recoverDiskStore` processing. ### Why are the changes needed? Although `recoverDiskStore` is already wrapped by `tryLogNonFatalError`, a single file recovery exception should not block the whole `recoverDiskStore` . https://github.com/apache/spark/blob/5938e84e72b81663ccacf0b36c2f8271455de292/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/shuffle/KubernetesLocalDiskShuffleExecutorComponents.scala#L45-L47 ``` org.apache.commons.io.FileExistsException: ... at org.apache.commons.io.FileUtils.requireAbsent(FileUtils.java:2587) at org.apache.commons.io.FileUtils.moveFile(FileUtils.java:2305) at org.apache.commons.io.FileUtils.moveFile(FileUtils.java:2283) at org.apache.spark.storage.DiskStore.moveFileToBlock(DiskStore.scala:150) at org.apache.spark.storage.BlockManager$TempFileBasedBlockStoreUpdater.saveToDiskStore(BlockManager.scala:487) at org.apache.spark.storage.BlockManager$BlockStoreUpdater.$anonfun$save$1(BlockManager.scala:407) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1445) at org.apache.spark.storage.BlockManager$BlockStoreUpdater.save(BlockManager.scala:380) at org.apache.spark.storage.BlockManager$TempFileBasedBlockStoreUpdater.save(BlockManager.scala:490) at org.apache.spark.shuffle.KubernetesLocalDiskShuffleExecutorComponents$.$anonfun$recoverDiskStore$14(KubernetesLocalDiskShuffleExecutorComponents.scala:95) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at org.apache.spark.shuffle.KubernetesLocalDiskShuffleExecutorComponents$.recoverDiskStore(KubernetesLocalDiskShuffleExecutorComponents.scala:91) ``` ### Does this PR introduce _any_ user-facing change? No, this will improve the recover rate. ### How was this patch tested? Pass the CIs. Closes #37903 from dongjoon-hyun/SPARK-40459. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f24bb430122eaa311070cfdefbc82d34b0341701) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 September 2022, 01:04:16 UTC
d7483b5 [SPARK-40429][SQL][3.3] Only set KeyGroupedPartitioning when the referenced column is in the output ### What changes were proposed in this pull request? back porting [PR](https://github.com/apache/spark/pull/37886) to 3.3. Only set `KeyGroupedPartitioning` when the referenced column is in the output ### Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test Closes #37901 from huaxingao/3.3. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 September 2022, 18:28:20 UTC
d8e157d [SPARK-38017][FOLLOWUP][3.3] Hide TimestampNTZ in the doc ### What changes were proposed in this pull request? This PR removes `TimestampNTZ` from the doc about `TimeWindow` and `SessionWIndow`. ### Why are the changes needed? As we discussed, it's better to hide `TimestampNTZ` from the doc. https://github.com/apache/spark/pull/35313#issuecomment-1185192162 ### Does this PR introduce _any_ user-facing change? The document will be changed, but there is no compatibility problem. ### How was this patch tested? Built the doc with `SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build` at `doc` directory. Then, confirmed the generated HTML. Closes #37882 from sarutak/fix-window-doc-3.3. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 September 2022, 00:28:06 UTC
ec40006 [SPARK-40423][K8S][TESTS] Add explicit YuniKorn queue submission test coverage ### What changes were proposed in this pull request? This PR aims to add explicit Yunikorn queue submission test coverage instead of implicit assignment by admission controller. ### Why are the changes needed? - To provide a proper test coverage. - To prevent the side effect of YuniKorn admission controller which overrides all Spark's scheduler settings by default (if we do not edit the rule explicitly). This breaks Apache Spark's default scheduler K8s IT test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually run the CI and check the YuniKorn queue UI. ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" -Dtest.exclude.tags=minikube,local,decom -Dtest.default.exclude.tags= ``` <img width="1197" alt="Screen Shot 2022-09-14 at 2 07 38 AM" src="https://user-images.githubusercontent.com/9700541/190112005-5863bdd3-2e43-4ec7-b34b-a286d1a7c95e.png"> Closes #37877 from dongjoon-hyun/SPARK-40423. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 12e48527846d993a78b159fbba3e900a4feb7b55) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 September 2022, 16:28:15 UTC
82351be [SPARK-39915][SQL] Ensure the output partitioning is user-specified in AQE ### What changes were proposed in this pull request? - Support get user-specified root repartition through `DeserializeToObjectExec` - Skip optimize empty for the root repartition which is user-specified - Add a new rule `AdjustShuffleExchangePosition` to adjust the shuffle we add back, so that we can restore shuffle safely. ### Why are the changes needed? AQE can not completely respect the user-specified repartition. The main reasons are: 1. the AQE optimzier will convert empty to local relation which does not reserve the partitioning info 2. the machine of AQE `requiredDistribution` only restore the repartition which does not support through `DeserializeToObjectExec` After the fix: The partition number of `spark.range(0).repartition(5).rdd.getNumPartitions` should be 5. ### Does this PR introduce _any_ user-facing change? yes, ensure the user-specified distribution. ### How was this patch tested? add tests Closes #37612 from ulysses-you/output-partition. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 801ca252f43b20cdd629c01d734ca9049e6eccf4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 September 2022, 00:43:17 UTC
e7c9d1a Preparing development version 3.3.2-SNAPSHOT 14 September 2022, 00:21:03 UTC
ea1a426 Preparing Spark release v3.3.1-rc1 14 September 2022, 00:20:48 UTC
d70f156 [SPARK-40417][K8S][DOCS] Use YuniKorn v1.1+ ### What changes were proposed in this pull request? This PR aims to update K8s document to declare the support of YuniKorn v1.1.+ ### Why are the changes needed? YuniKorn 1.1.0 has 87 JIRAs and is the first version to support multi-arch officially. - https://yunikorn.apache.org/release-announce/1.1.0 ``` $ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture "Architecture": "amd64", $ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture "Architecture": "arm64", ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested with Apache YuniKorn v1.1.0+. ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests \ -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" \ -Dtest.exclude.tags=minikube,local,decom \ -Dtest.default.exclude.tags= ... [info] KubernetesSuite: [info] - Run SparkPi with no resources (11 seconds, 238 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (11 seconds, 58 milliseconds) [info] - Run SparkPi with a very long application name. (9 seconds, 948 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 884 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (9 seconds, 834 milliseconds) [info] - Run SparkPi with an argument. (9 seconds, 870 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 887 milliseconds) [info] - All pods have the same service account by default (9 seconds, 891 milliseconds) [info] - Run extraJVMOptions check on driver (5 seconds, 888 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (10 seconds, 261 milliseconds) [info] - Run SparkPi with env and mount secrets. (18 seconds, 702 milliseconds) [info] - Run PySpark on simple pi.py example (10 seconds, 944 milliseconds) [info] - Run PySpark to test a pyfiles example (13 seconds, 934 milliseconds) [info] - Run PySpark with memory customization (10 seconds, 853 milliseconds) [info] - Run in client mode. (11 seconds, 301 milliseconds) [info] - Start pod creation from template (9 seconds, 853 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (9 seconds, 923 milliseconds) [info] - Run SparkR on simple dataframe.R example (13 seconds, 929 milliseconds) [info] YuniKornSuite: [info] - Run SparkPi with no resources (9 seconds, 769 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (9 seconds, 776 milliseconds) [info] - Run SparkPi with a very long application name. (9 seconds, 856 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 803 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (10 seconds, 783 milliseconds) [info] - Run SparkPi with an argument. (10 seconds, 771 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 868 milliseconds) [info] - All pods have the same service account by default (10 seconds, 811 milliseconds) [info] - Run extraJVMOptions check on driver (6 seconds, 858 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (11 seconds, 171 milliseconds) [info] - Run SparkPi with env and mount secrets. (18 seconds, 221 milliseconds) [info] - Run PySpark on simple pi.py example (11 seconds, 970 milliseconds) [info] - Run PySpark to test a pyfiles example (13 seconds, 990 milliseconds) [info] - Run PySpark with memory customization (11 seconds, 992 milliseconds) [info] - Run in client mode. (11 seconds, 294 milliseconds) [info] - Start pod creation from template (11 seconds, 10 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (9 seconds, 956 milliseconds) [info] - Run SparkR on simple dataframe.R example (12 seconds, 992 milliseconds) [info] Run completed in 10 minutes, 15 seconds. [info] Total number of tests run: 36 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 36, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 751 s (12:31), completed Sep 13, 2022, 11:47:24 AM ``` Closes #37872 from dongjoon-hyun/SPARK-40417. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a934cabd24afa5c8f6e8e1d2341829166129a5c8) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 September 2022, 19:52:33 UTC
1741aab [SPARK-40362][SQL][3.3] Fix BinaryComparison canonicalization ### What changes were proposed in this pull request? Change canonicalization to a one pass process and move logic from `Canonicalize.reorderCommutativeOperators` to the respective commutative operators' `canonicalize`. ### Why are the changes needed? https://github.com/apache/spark/pull/34883 improved expression canonicalization performance but introduced regression when a commutative operator is under a `BinaryComparison`. This is because children reorder by their hashcode can't happen in `preCanonicalized` phase when children are not yet "final". ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #37866 from peter-toth/SPARK-40362-fix-binarycomparison-canonicalization-3.3. Lead-authored-by: Peter Toth <ptoth@cloudera.com> Co-authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 September 2022, 19:38:02 UTC
0a180c0 [SPARK-40292][SQL] Fix column names in "arrays_zip" function when arrays are referenced from nested structs ### What changes were proposed in this pull request? This PR fixes an issue in `arrays_zip` function where a field index was used as a column name in the resulting schema which was a regression from Spark 3.1. With this change, the original behaviour is restored: a corresponding struct field name will be used instead of a field index. Example: ```sql with q as ( select named_struct( 'my_array', array(1, 2, 3), 'my_array2', array(4, 5, 6) ) as my_struct ) select arrays_zip(my_struct.my_array, my_struct.my_array2) from q ``` would return schema: ``` root |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false) | |-- element: struct (containsNull = false) | | |-- 0: integer (nullable = true) | | |-- 1: integer (nullable = true) ``` which is somewhat inaccurate. PR adds handling of `GetStructField` expression to return the struct field names like this: ``` root |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false) | |-- element: struct (containsNull = false) | | |-- my_array: integer (nullable = true) | | |-- my_array2: integer (nullable = true) ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, `arrays_zip` function returns struct field names now as in Spark 3.1 instead of field indices. Some users might have worked around this issue so this patch would affect them by bringing back the original behaviour. ### How was this patch tested? Existing unit tests. I also added a test case that reproduces the problem. Closes #37833 from sadikovi/SPARK-40292. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 443eea97578c41870c343cdb88cf69bfdf27033a) Signed-off-by: Max Gekk <max.gekk@gmail.com> 12 September 2022, 04:33:50 UTC
052d60c [SPARK-40228][SQL][3.3] Do not simplify multiLike if child is not a cheap expression This PR backport https://github.com/apache/spark/pull/37672 to branch-3.3. The original PR's description: ### What changes were proposed in this pull request? Do not simplify multiLike if child is not a cheap expression. ### Why are the changes needed? 1. Simplifying multiLike in this cases can not benefit the query because it cannot be pushed down. 2. Reduce the number of evaluations for these expressions. For example: ```sql select * from t1 where substr(name, 1, 5) like any('%a', 'b%', '%c%'); ``` ``` == Physical Plan == *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) OR Contains(substr(name#0, 1, 5), c)) +- *(1) ColumnarToRow +- FileScan parquet default.t1[name#0] Batched: true, DataFilters: [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:string> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37813 from wangyum/SPARK-40228-branch-3.3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 September 2022, 23:48:34 UTC
18fc8e8 [SPARK-39915][SQL][3.3] Dataset.repartition(N) may not create N partitions Non-AQE part ### What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/37706 for branch-3.3 Skip optimize the root user-specified repartition in `PropagateEmptyRelation`. ### Why are the changes needed? Spark should preserve the final repatition which can affect the final output partition which is user-specified. For example: ```scala spark.sql("select * from values(1) where 1 < rand()").repartition(1) // before: == Optimized Logical Plan == LocalTableScan <empty>, [col1#0] // after: == Optimized Logical Plan == Repartition 1, true +- LocalRelation <empty>, [col1#0] ``` ### Does this PR introduce _any_ user-facing change? yes, the empty plan may change ### How was this patch tested? add test Closes #37730 from ulysses-you/empty-3.3. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 September 2022, 21:43:19 UTC
4f69c98 [SPARK-39830][SQL][TESTS][3.3] Add a test case to read ORC table that requires type promotion ### What changes were proposed in this pull request? Increase ORC test coverage. [ORC-1205](https://issues.apache.org/jira/browse/ORC-1205) Size of batches in some ConvertTreeReaders should be ensured before using ### Why are the changes needed? When spark reads an orc with type promotion, an `ArrayIndexOutOfBoundsException` may be thrown, which has been fixed in version 1.7.6 and 1.8.0. ```java java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add UT Closes #37808 from cxzl25/SPARK-39830-3.3. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 September 2022, 21:36:39 UTC
aaa8292 [SPARK-40389][SQL][FOLLOWUP][3.3] Fix a test failure in SQLQuerySuite ### What changes were proposed in this pull request? Fix a test failure in SQLQuerySuite on branch-3.3. It's from the backport of https://github.com/apache/spark/pull/37832 since there is no error class "CAST_OVERFLOW" on branch-3.3 ### Why are the changes needed? Fix test failure ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #37848 from gengliangwang/fixTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 September 2022, 18:57:00 UTC
b18d582 [SPARK-40280][SQL][FOLLOWUP][3.3] Fix 'ParquetFilterSuite' issue ### What changes were proposed in this pull request? ### Why are the changes needed? Fix 'ParquetFilterSuite' issue after merging #37747 : The `org.apache.parquet.filter2.predicate.Operators.In` was added in the parquet 1.12.3, but spark branch-3.3 uses the parquet 1.12.2. Use `Operators.And` instead of `Operators.In`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #37847 from zzcclp/SPARK-40280-hotfix-3.3. Authored-by: Zhichao Zhang <zhangzc@apache.org> Signed-off-by: huaxingao <huaxin_gao@apple.com> 09 September 2022, 18:31:16 UTC
cd9f564 [SPARK-40389][SQL] Decimals can't upcast as integral types if the cast can overflow ### What changes were proposed in this pull request? In Spark SQL, the method `canUpCast` returns true iff we can safely up-cast the `from` type to `to` type without truncating or precision loss or possible runtime failures. Meanwhile, DecimalType(10, 0) is considered as `canUpCast` to Integer type. This is wrong since casting `9000000000BD` as Integer type will overflow. As a result: * The optimizer rule `SimplifyCasts` replies on the method `canUpCast` and it will mistakenly convert `cast(cast(9000000000BD as int) as long)` as `cast(9000000000BD as long)` * The STRICT store assignment policy relies on this method too. With the policy enabled, inserting `9000000000BD` into integer columns will pass compiling time check and insert an unexpected value `410065408`. * etc... ### Why are the changes needed? Bug fix on the method `Cast.canUpCast` ### Does this PR introduce _any_ user-facing change? Yes, fix a bug on the checking condition of whether a decimal can safely cast as integral types. ### How was this patch tested? New UT Closes #37832 from gengliangwang/SPARK-40389. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 17982519a749bd4ca2aa7eca12fba00ccc1520aa) Signed-off-by: Gengliang Wang <gengliang@apache.org> 08 September 2022, 20:23:52 UTC
0cdb081 [SPARK-40280][SQL] Add support for parquet push down for annotated int and long ### What changes were proposed in this pull request? This fixes SPARK-40280 by normalizing a parquet int/long that has optional metadata with it to look like the expected version that does not have the extra metadata. ## Why are the changes needed? This allows predicate push down in parquet to work when reading files that are complaint with the parquet specification, but different from what Spark writes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I added unit tests that cover this use case. I also did some manual testing on some queries to verify that less data is actually read after this change. Closes #37747 from revans2/normalize_int_long_parquet_push. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 24b3baf0177fc1446bf59bb34987296aefd4b318) Signed-off-by: Thomas Graves <tgraves@apache.org> 08 September 2022, 13:55:48 UTC
10cd3ac [SPARK-40380][SQL] Fix constant-folding of InvokeLike to avoid non-serializable literal embedded in the plan ### What changes were proposed in this pull request? Block `InvokeLike` expressions with `ObjectType` result from constant-folding, to ensure constant-folded results are trusted to be serializable. This is a conservative fix for ease of backport to Spark 3.3. A separate future change can relax the restriction and support constant-folding to serializable `ObjectType` as well. ### Why are the changes needed? This fixes a regression introduced by https://github.com/apache/spark/pull/35207 . It enabled the constant-folding logic to aggressively fold `InvokeLike` expressions (e.g. `Invoke`, `StaticInvoke`), when all arguments are foldable and the expression itself is deterministic. But it could go overly aggressive and constant-fold to non-serializable results, which would be problematic when that result needs to be serialized and sent over the wire. In the wild, users of sparksql-scalapb have hit this issue. The constant folding logic would fold a chain of `Invoke` / `StaticInvoke` expressions from only holding onto a serializable literal to holding onto a non-serializable literal: ``` Literal(com.example.protos.demo.Person$...).scalaDescriptor.findFieldByNumber.get ``` this expression works fine before constant-folding because the literal that gets sent to the executors is serializable, but when aggressive constant-folding kicks in it ends up as a `Literal(scalapb.descriptors.FieldDescriptor...)` which isn't serializable. The following minimal repro demonstrates this issue: ``` import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions.Literal import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke} import org.apache.spark.sql.types.{LongType, ObjectType} class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = longVal + other } case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) } val litExpr = Literal.fromObject(SerializableBoxedLong(42L), ObjectType(classOf[SerializableBoxedLong])) val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", ObjectType(classOf[NotSerializableBoxedLong])) val addExpr = Invoke(toNotSerializableExpr, "add", LongType, Seq(UnresolvedAttribute.quotedString("id"))) val df = spark.range(2).select(new Column(addExpr)) df.collect ``` would result in an error if aggressive constant-folding kicked in: ``` ... Caused by: java.io.NotSerializableException: NotSerializableBoxedLong Serialization stack: - object not serializable (class: NotSerializableBoxedLong, value: NotSerializableBoxedLong71231636) - element of array (index: 1) - array (class [Ljava.lang.Object;, size 2) - element of array (index: 1) - array (class [Ljava.lang.Object;, size 3) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389185db22c) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441) ``` ### Does this PR introduce _any_ user-facing change? Yes, a regression in ObjectType expression starting from Spark 3.3.0 is fixed. ### How was this patch tested? The existing test cases in `ConstantFoldingSuite` continues to pass; added a new test case to demonstrate the regression issue. Closes #37823 from rednaxelafx/fix-invokelike-constantfold. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5b96e82ad6a4f5d5e4034d9d7112077159cf5044) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 September 2022, 13:21:15 UTC
81cb08b add back a mistakenly removed test case 07 September 2022, 11:29:39 UTC
433469f [SPARK-40149][SQL] Propagate metadata columns through Project This PR fixes a regression caused by https://github.com/apache/spark/pull/32017 . In https://github.com/apache/spark/pull/32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of https://github.com/apache/spark/pull/32017 . After propagating metadata columns, a problem from https://github.com/apache/spark/pull/31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. fix a regression For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. new tests Closes #37758 from cloud-fan/metadata. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 99ae1d9a897909990881f14c5ea70a0d1a0bf456) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 September 2022, 11:03:52 UTC
84c0918 [SPARK-40315][SQL] Add hashCode() for Literal of ArrayBasedMapData ### What changes were proposed in this pull request? There is no explicit `hashCode()` function override for `ArrayBasedMapData`. As a result, there is a non-deterministic error where the `hashCode()` computed for `Literal`s of `ArrayBasedMapData` can be different for two equal objects (`Literal`s of `ArrayBasedMapData` with equal keys and values). In this PR, we add a `hashCode` function so that it works exactly as we expect. ### Why are the changes needed? This is a bug fix for a non-deterministic error. It is also more consistent with the rest of Spark if we implement the `hashCode` method instead of relying on defaults. We can't add the `hashCode` directly to `ArrayBasedMapData` because of SPARK-9415. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A simple unit test was added. Closes #37807 from c27kwan/SPARK-40315-lit. Authored-by: Carmen Kwan <carmen.kwan@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e85a4ffbdfa063c8da91b23dfbde77e2f9ed62e9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 September 2022, 13:11:59 UTC
1324f7d [SPARK-40297][SQL] CTE outer reference nested in CTE main body cannot be resolved This PR fixes a bug where a CTE reference cannot be resolved if this reference occurs in an inner CTE definition nested in the outer CTE's main body FROM clause. E.g., ``` WITH cte_outer AS ( SELECT 1 ) SELECT * FROM ( WITH cte_inner AS ( SELECT * FROM cte_outer ) SELECT * FROM cte_inner ) ``` This fix is to change the `CTESubstitution`'s traverse order from `resolveOperatorsUpWithPruning` to `resolveOperatorsDownWithPruning` and also to recursively call `traverseAndSubstituteCTE` for CTE main body. Bug fix. Without the fix an `AnalysisException` would be thrown for CTE queries mentioned above. No. Added UTs. Closes #37751 from maryannxue/spark-40297. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 September 2022, 04:41:37 UTC
9473840 [SPARK-38404][SQL][3.3] Improve CTE resolution when a nested CTE references an outer CTE ### What changes were proposed in this pull request? Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with https://github.com/apache/spark/pull/34929. This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions. ### Why are the changes needed? Small improvement + additional tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test case. Closes #37760 from peter-toth/SPARK-38404-nested-cte-references-outer-cte-3.3. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 September 2022, 04:34:39 UTC
b066561 [SPARK-40326][BUILD] Upgrade `fasterxml.jackson.version` to 2.13.4 upgrade `com.fasterxml.jackson.dataformat:jackson-dataformat-yaml` and `fasterxml.jackson.databind.version` from 2.13.3 to 2.13.4 [CVE-2022-25857](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857) [SNYK-JAVA-ORGYAML](https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360) No. Pass GA Closes #37796 from bjornjorgensen/upgrade-fasterxml.jackson-to-2.13.4. Authored-by: Bjørn <bjornjorgensen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a82a006df80ac3aa6900d8688eb5bf77b804785d) Signed-off-by: Sean Owen <srowen@gmail.com> 06 September 2022, 00:55:21 UTC
284954a Revert "[SPARK-33861][SQL] Simplify conditional in predicate" This reverts commit 32d4a2b and 3aa4e11. Closes #37729 from wangyum/SPARK-33861. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 43cbdc6ec9dbcf9ebe0b48e14852cec4af18b4ec) Signed-off-by: Yuming Wang <yumwang@ebay.com> 03 September 2022, 08:14:43 UTC
399c397 [SPARK-40304][K8S][TESTS] Add `decomTestTag` to K8s Integration Test ### What changes were proposed in this pull request? This PR aims to add a new test tag, `decomTestTag`, to K8s Integration Test. ### Why are the changes needed? Decommission-related tests took over 6 minutes (`363s`). It would be helpful we can run them selectively. ``` [info] - Test basic decommissioning (44 seconds, 51 milliseconds) [info] - Test basic decommissioning with shuffle cleanup (44 seconds, 450 milliseconds) [info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 43 seconds) [info] - Test decommissioning timeouts (44 seconds, 389 milliseconds) [info] - SPARK-37576: Rolling decommissioning (1 minute, 8 seconds) ``` ### Does this PR introduce _any_ user-facing change? No, this is a test-only change. ### How was this patch tested? Pass the CIs and test manually. ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests \ -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" \ -Dtest.exclude.tags=minikube,local,decom ... [info] KubernetesSuite: [info] - Run SparkPi with no resources (12 seconds, 441 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (11 seconds, 949 milliseconds) [info] - Run SparkPi with a very long application name. (11 seconds, 999 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (11 seconds, 846 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (11 seconds, 176 milliseconds) [info] - Run SparkPi with an argument. (11 seconds, 868 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (11 seconds, 858 milliseconds) [info] - All pods have the same service account by default (11 seconds, 5 milliseconds) [info] - Run extraJVMOptions check on driver (5 seconds, 757 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (12 seconds, 467 milliseconds) [info] - Run SparkPi with env and mount secrets. (21 seconds, 119 milliseconds) [info] - Run PySpark on simple pi.py example (13 seconds, 129 milliseconds) [info] - Run PySpark to test a pyfiles example (14 seconds, 937 milliseconds) [info] - Run PySpark with memory customization (12 seconds, 195 milliseconds) [info] - Run in client mode. (11 seconds, 343 milliseconds) [info] - Start pod creation from template (11 seconds, 975 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (11 seconds, 901 milliseconds) [info] - Run SparkR on simple dataframe.R example (14 seconds, 305 milliseconds) ... ``` Closes #37755 from dongjoon-hyun/SPARK-40304. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fd0498f81df72c196f19a5b26053660f6f3f4d70) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 September 2022, 16:35:06 UTC
7c19df6 [SPARK-40302][K8S][TESTS] Add `YuniKornSuite` This PR aims the followings. 1. Add `YuniKornSuite` integration test suite which extends `KubernetesSuite` on Apache YuniKorn scheduler. 2. Support `--default-exclude-tags` command to override `test.default.exclude.tags`. To improve test coverage. No. This is a test suite addition. Since this requires `Apache YuniKorn` installation, the test suite is disabled by default. So, CI K8s integration test should pass without running this suite. In order to run the tests, we need to override `test.default.exclude.tags` like the following. **SBT** ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests \ -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" \ -Dtest.exclude.tags=minikube,local \ -Dtest.default.exclude.tags= ``` **MAVEN** ``` $ dev/dev-run-integration-tests.sh --deploy-mode docker-desktop \ --exclude-tag minikube,local \ --default-exclude-tags '' ``` Closes #37753 from dongjoon-hyun/SPARK-40302. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b2e38e16bfc547a62957e0a67085985b3c65d525) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 September 2022, 16:27:33 UTC
03556ca [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs ### What changes were proposed in this pull request? Add a section under [customized-kubernetes-schedulers-for-spark-on-kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#customized-kubernetes-schedulers-for-spark-on-kubernetes) to explain how to run Spark with Apache YuniKorn. This is based on the review comments from #35663. ### Why are the changes needed? Explain how to run Spark with Apache YuniKorn ### Does this PR introduce _any_ user-facing change? No Closes #37622 from yangwwei/SPARK-40187. Authored-by: Weiwei Yang <wwei@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4b1877398410fb23a285ed0d2c6b34711f52fc43) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 September 2022, 05:26:53 UTC
c04aa36 [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed ### Why are the changes needed? This PR aims to fix the case ```scala sql("create table t1(a decimal(3, 0)) using parquet") sql("insert into t1 values(100), (10), (1)") sql("select * from t1 where a in(100000, 1.00)").show ``` ``` java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) ``` 1. the rule `UnwrapCastInBinaryComparison` transforms the expression `In` to Equals ``` CAST(a as decimal(12,2)) IN (100000.00,1.00) OR( CAST(a as decimal(12,2)) = 100000.00, CAST(a as decimal(12,2)) = 1.00 ) ``` 2. using `UnwrapCastInBinaryComparison.unwrapCast()` to optimize each `EqualTo` ``` // Expression1 CAST(a as decimal(12,2)) = 100000.00 => CAST(a as decimal(12,2)) = 100000.00 // Expression2 CAST(a as decimal(12,2)) = 1.00 => a = 1 ``` 3. return the new unwrapped cast expression `In` ``` a IN (100000.00, 1.00) ``` Before this PR: the method `UnwrapCastInBinaryComparison.unwrapCast()` returns the original expression when downcasting to a decimal type fails (the `Expression1`),returns the original expression if the downcast to the decimal type succeeds (the `Expression2`), the two expressions have different data type which would break the structural integrity ``` a IN (100000.00, 1.00) | | decimal(12, 2) | decimal(3, 0) ``` After this PR: the PR transform the downcasting failed expression to `falseIfNotNull(fromExp)` ``` ((isnull(a) AND null) OR a IN (1.00) ``` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? Unit test. Closes #37439 from cfmcgrady/SPARK-39896. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6e62b93f3d1ef7e2d6be0a3bb729ab9b2d55a36d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 August 2022, 05:32:34 UTC
e46d2e2 [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style This PR make `compute.max_rows` option as `None` working in `DataFrame.style`, as expected instead of throwing an exception., by collecting it all to a pandas DataFrame. To make the configuration working as expected. Yes. ```python import pyspark.pandas as ps ps.set_option("compute.max_rows", None) ps.get_option("compute.max_rows") ps.range(1).style ``` **Before:** ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style pdf = self.head(max_results + 1)._to_internal_pandas() TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' ``` **After:** ``` <pandas.io.formats.style.Styler object at 0x7fdf78250430> ``` Manually tested, and unittest was added. Closes #37718 from HyukjinKwon/SPARK-40270. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 August 2022, 07:26:24 UTC
c9710c5 [SPARK-40152][SQL][TESTS][FOLLOW-UP][3.3] Capture a different error message in 3.3 ### What changes were proposed in this pull request? This PR fixes the error message in branch-3.3. Different error message is thrown at the test added in https://github.com/apache/spark/commit/4b0c3bab1ab082565a051990bf45774f15962deb. ### Why are the changes needed? `branch-3.3` is broken because of the different error message being thrown (https://github.com/apache/spark/runs/8065373173?check_suite_focus=true). ``` [info] - elementAt *** FAILED *** (996 milliseconds) [info] (non-codegen mode) Expected error message is `The index 0 is invalid`, but `SQL array indices start at 1` found (ExpressionEvalHelper.scala:176) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.fail(Assertions.scala:933) [info] at org.scalatest.Assertions.fail$(Assertions.scala:929) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkExceptionInExpression$1(ExpressionEvalHelper.scala:176) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) [info] at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkException$1(ExpressionEvalHelper.scala:163) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression(ExpressionEvalHelper.scala:183) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression$(ExpressionEvalHelper.scala:156) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkExceptionInExpression(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression(ExpressionEvalHelper.scala:153) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression$(ExpressionEvalHelper.scala:150) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkExceptionInExpression(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.$anonfun$new$365(CollectionExpressionsSuite.scala:1555) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should test it out. Closes #37708 from HyukjinKwon/SPARK-40152-3.3. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> 29 August 2022, 14:39:29 UTC
60bd91f [SPARK-40247][SQL] Fix BitSet equality check ### What changes were proposed in this pull request? Spark's `BitSet` doesn't implement `equals()` and `hashCode()` but it is used in `FileSourceScanExec` for bucket pruning. ### Why are the changes needed? Without proper equality check reuse issues can occur. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #37696 from peter-toth/SPARK-40247-fix-bitset-equals. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 527ddece8fdbe703dcd239401c97ddb2c6122182) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 August 2022, 07:26:17 UTC
e3f6b6d [SPARK-40212][SQL] SparkSQL castPartValue does not properly handle byte, short, or float The `castPartValueToDesiredType` function now returns byte for ByteType and short for ShortType, rather than ints; also floats for FloatType rather than double. Previously, attempting to read back in a file partitioned on one of these column types would result in a ClassCastException at runtime (for Byte, `java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Byte`). I can't think this is anything but a bug, as returning the correct data type prevents the crash. Yes: it changes the observed behavior when reading in a byte/short/float-partitioned file. Added unit test. Without the `castPartValueToDesiredType` updates, the test fails with the stated exception. === I'll note that I'm not familiar enough with the spark repo to know if this will have ripple effects elsewhere, but tests pass on my fork and since the very similar https://github.com/apache/spark/pull/36344/files only needed to touch these two files I expect this change is self-contained as well. Closes #37659 from BrennanStein/spark40212. Authored-by: Brennan Stein <brennan.stein@ekata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 146f187342140635b83bfe775b6c327755edfbe1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 August 2022, 01:57:31 UTC
694e4e7 [SPARK-40241][DOCS] Correct the link of GenericUDTF ### What changes were proposed in this pull request? Correct the link ### Why are the changes needed? existing link was wrong ### Does this PR introduce _any_ user-facing change? yes, a link was updated ### How was this patch tested? Manually check Closes #37685 from zhengruifeng/doc_fix_udtf. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 8ffcecb68fafd0466e839281588aab50cd046b49) Signed-off-by: Yuming Wang <yumwang@ebay.com> 27 August 2022, 10:39:43 UTC
2de364a [SPARK-40152][SQL][TESTS][FOLLOW-UP] Disable ANSI for out of bound test at ElementAt This PR proposes to fix the test to pass with ANSI mode on. Currently `elementAt` test fails when ANSI mode is on: ``` [info] - elementAt *** FAILED *** (309 milliseconds) [info] Exception evaluating element_at(stringsplitsql(11.12.13, .), 10, Some(), true) (ExpressionEvalHelper.scala:205) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.fail(Assertions.scala:949) [info] at org.scalatest.Assertions.fail$(Assertions.scala:945) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:205) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkEvaluationWithoutCodegen(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkEvaluation(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.$anonfun$new$333(CollectionExpressionsSuite.scala:1546) ``` https://github.com/apache/spark/runs/8046961366?check_suite_focus=true To recover the build with ANSI mode. No, test-only. Unittest fixed. Closes #37684 from HyukjinKwon/SPARK-40152. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4b0c3bab1ab082565a051990bf45774f15962deb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 August 2022, 06:55:07 UTC
167f3ff [SPARK-40152][SQL][TESTS] Move tests from SplitPart to elementAt Move tests from SplitPart to elementAt in CollectionExpressionsSuite. Simplify test. No. N/A. Closes #37637 from wangyum/SPARK-40152-3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 06997d6eb73f271aede5b159d86d1db80a73b89f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 August 2022, 06:55:02 UTC
784cb0f [SPARK-40218][SQL] GROUPING SETS should preserve the grouping columns ### What changes were proposed in this pull request? This PR fixes a bug caused by https://github.com/apache/spark/pull/32022 . Although we deprecate `GROUP BY ... GROUPING SETS ...`, it should still work if it worked before. https://github.com/apache/spark/pull/32022 made a mistake that it didn't preserve the order of user-specified group by columns. Usually it's not a problem, as `GROUP BY a, b` is no different from `GROUP BY b, a`. However, the `grouping_id(...)` function requires the input to be exactly the same with the group by columns. This PR fixes the problem by preserve the order of user-specified group by columns. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now a query that worked before 3.2 can work again. ### How was this patch tested? new test Closes #37655 from cloud-fan/grouping. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1ed592ef28abdb14aa1d8c8a129d6ac3084ffb0c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 August 2022, 07:24:32 UTC
d9dc280 [SPARK-40213][SQL] Support ASCII value conversion for Latin-1 characters ### What changes were proposed in this pull request? This PR proposes to support ASCII value conversion for Latin-1 Supplement characters. ### Why are the changes needed? `ascii()` should be the inverse of `chr()`. But for latin-1 char, we get incorrect ascii value. For example: ```sql select ascii('§') -- output: -62, expect: 167 select chr(167) -- output: '§' ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes the incorrect ASCII conversion for Latin-1 Supplement characters ### How was this patch tested? UT Closes #37651 from linhongliu-db/SPARK-40213. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c07852380471f02955d6d17cddb3150231daa71f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 August 2022, 04:19:00 UTC
d725d9c [SPARK-40124][SQL][TEST][3.3] Update TPCDS v1.4 q32 for Plan Stability tests ### What changes were proposed in this pull request? This is port of SPARK-40124 to Spark 3.3. Fix query 32 for TPCDS v1.4 ### Why are the changes needed? Current q32.sql seems to be wrong. It is just selection `1`. Reference for query template: https://github.com/databricks/tpcds-kit/blob/eff5de2c30337b71cc0dc1976147742d2c65d378/query_templates/query32.tpl#L41 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test change only Closes #37615 from mskapilks/change-q32-3.3. Authored-by: Kapil Kumar Singh <kapilsingh@microsoft.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 August 2022, 00:02:36 UTC
6572c66 [SPARK-40172][ML][TESTS] Temporarily disable flaky test cases in ImageFileFormatSuite ### What changes were proposed in this pull request? 3 test cases in ImageFileFormatSuite become flaky in the GitHub action tests: https://github.com/apache/spark/runs/7941765326?check_suite_focus=true https://github.com/gengliangwang/spark/runs/7928658069 Before they are fixed(https://issues.apache.org/jira/browse/SPARK-40171), I suggest disabling them in OSS. ### Why are the changes needed? Disable flaky tests before they are fixed. The test cases keep failing from time to time, while they always pass on local env. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing CI Closes #37605 from gengliangwang/disableFlakyTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 50f2f506327b7d51af9fb0ae1316135905d2f87d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 August 2022, 17:55:17 UTC
008b3a3 [SPARK-40152][SQL][TESTS] Add tests for SplitPart ### What changes were proposed in this pull request? Add tests for `SplitPart`. ### Why are the changes needed? Improve test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. Closes #37626 from wangyum/SPARK-40152-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 4f525eed7d5d461498aee68c4d3e57941f9aae2c) Signed-off-by: Sean Owen <srowen@gmail.com> 23 August 2022, 13:55:34 UTC
db121c3 [SPARK-39184][SQL][FOLLOWUP] Make interpreted and codegen paths for date/timestamp sequences the same ### What changes were proposed in this pull request? Change how the length of the new result array is calculated in `InternalSequenceBase.eval` to match how the same is calculated in the generated code. ### Why are the changes needed? This change brings the interpreted mode code in line with the generated code. Although I am not aware of any case where the current interpreted mode code fails, the generated code is more correct (it handles the case where the result array must grow more than once, whereas the current interpreted mode code does not). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #37542 from bersprockets/date_sequence_array_size_issue_follow_up. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit d718867a16754c62cb8c30a750485f4856481efc) Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 August 2022, 16:17:39 UTC
e16467c [SPARK-40089][SQL] Fix sorting for some Decimal types ### What changes were proposed in this pull request? This fixes https://issues.apache.org/jira/browse/SPARK-40089 where the prefix can overflow in some cases and the code assumes that the overflow is always on the negative side, not the positive side. ### Why are the changes needed? This adds a check when the overflow does happen to know what is the proper prefix to return. ### Does this PR introduce _any_ user-facing change? No, unless you consider getting the sort order correct a user facing change. ### How was this patch tested? I tested manually with the file in the JIRA and I added a small unit test. Closes #37540 from revans2/fix_dec_sort. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8dfd3dfc115d6e249f00a9a434b866d28e2eae45) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 August 2022, 08:34:01 UTC
233a54d [SPARK-40152][SQL] Fix split_part codegen compilation issue ### What changes were proposed in this pull request? Fix `split_part` codegen compilation issue: ```sql SELECT split_part(str, delimiter, partNum) FROM VALUES ('11.12.13', '.', 3) AS v1(str, delimiter, partNum); ``` ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 1: Expression "project_isNull_0 = false" is not a type ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37589 from wangyum/SPARK-40152. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit cf1a80eeae8bf815270fb39568b1846c2bd8d437) Signed-off-by: Sean Owen <srowen@gmail.com> 21 August 2022, 19:30:17 UTC
7c69614 [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns ### What changes were proposed in this pull request? This PR fixes a correctness issue in Parquet DSv1 FileFormat when projection does not contain columns referenced in pushed filters. This typically happens when partition columns and data columns overlap. This could result in empty result when in fact there were records matching predicate as can be seen in the provided fields. The problem is especially visible with `count()` and `show()` reporting different results, for example, show() would return 1+ records where the count() would return 0. In Parquet, when the predicate is provided and column index is enabled, we would try to filter row ranges to figure out what the count should be. Unfortunately, there is an issue that if the projection is empty or is not in the set of filter columns, any checks on columns would fail and 0 rows are returned (`RowRanges.EMPTY`) even though there is data matching the filter. Note that this is rather a mitigation, a quick fix. The actual fix needs to go into Parquet-MR: https://issues.apache.org/jira/browse/PARQUET-2170. The fix is not required in DSv2 where the overlapping columns are removed in `FileScanBuilder::readDataSchema()`. ### Why are the changes needed? Fixes a correctness issue when projection columns are not referenced by columns in pushed down filters or the schema is empty in Parquet DSv1. Downsides: Parquet column filter would be disabled if it had not been explicitly enabled which could affect performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces this behaviour. The test fails without the fix and passes with the fix. Closes #37419 from sadikovi/SPARK-39833. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cde71aaf173aadd14dd6393b09e9851b5caad903) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 August 2022, 10:00:00 UTC
88f8ac6 [SPARK-40065][K8S] Mount ConfigMap on executors with non-default profile as well ### What changes were proposed in this pull request? This fixes a bug where ConfigMap is not mounted on executors if they are under a non-default resource profile. ### Why are the changes needed? When `spark.kubernetes.executor.disableConfigMap` is `false`, expected behavior is that the ConfigMap is mounted regardless of executor's resource profile. However, it is not mounted if the resource profile is non-default. ### Does this PR introduce _any_ user-facing change? Executors with non-default resource profile will have the ConfigMap mounted that was missing before if `spark.kubernetes.executor.disableConfigMap` is `false` or default. If certain users need to keep that behavior for some reason, they would need to explicitly set `spark.kubernetes.executor.disableConfigMap` to `true`. ### How was this patch tested? A new test case is added just below the existing ConfigMap test case. Closes #37504 from nsuke/SPARK-40065. Authored-by: Aki Sukegawa <nsuke@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 41ca6299eff4155aa3ac28656fe96501a7573fb0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 August 2022, 19:28:57 UTC
87f957d [SPARK-35542][ML] Fix: Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? Fix: Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #37568 from WeichenXu123/SPARK-35542. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 876ce6a5df118095de51c3c4789d6db6da95eb23) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 19 August 2022, 04:26:59 UTC
bd79706 [SPARK-40134][BUILD] Update ORC to 1.7.6 This PR aims to update ORC to 1.7.6. This will bring the latest changes and bug fixes. https://github.com/apache/orc/releases/tag/v1.7.6 - ORC-1204: ORC MapReduce writer to flush when long arrays - ORC-1205: `nextVector` should invoke `ensureSize` when reusing vectors - ORC-1215: Remove a wrong `NotNull` annotation on `value` of `setAttribute` - ORC-1222: Upgrade `tools.hadoop.version` to 2.10.2 - ORC-1227: Use `Constructor.newInstance` instead of `Class.newInstance` - ORC-1228: Fix `setAttribute` to handle null value No. Pass the CIs. Closes #37563 from williamhyun/ORC-176. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a1a049f01986c15e50a2f76d1fa8538ca3b6307e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 August 2022, 20:01:03 UTC
e1c5f90 [SPARK-40132][ML] Restore rawPredictionCol to MultilayerPerceptronClassifier.setParams ### What changes were proposed in this pull request? Restore rawPredictionCol to MultilayerPerceptronClassifier.setParams ### Why are the changes needed? This param was inadvertently removed in the refactoring in https://github.com/apache/spark/commit/40cdb6d51c2befcfeac8fb5cf5faf178d1a5ee7b#r81473316 Without it, using this param in the constructor fails. ### Does this PR introduce _any_ user-facing change? Not aside from the bug fix. ### How was this patch tested? Existing tests. Closes #37561 from srowen/SPARK-40132. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 6768d9cc38a320f7e1c6781afcd170577c5c7d0f) Signed-off-by: Sean Owen <srowen@gmail.com> 18 August 2022, 05:24:00 UTC
1a01a49 [SPARK-40121][PYTHON][SQL] Initialize projection used for Python UDF ### What changes were proposed in this pull request? This PR proposes to initialize the projection so non-deterministic expressions can be evaluated with Python UDFs. ### Why are the changes needed? To make the Python UDF working with non-deterministic expressions. ### Does this PR introduce _any_ user-facing change? Yes. ```python from pyspark.sql.functions import udf, rand spark.range(10).select(udf(lambda x: x, "double")(rand())).show() ``` **Before** ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) ``` **After** ``` +----------------------------------+ |<lambda>rand(-2507211707257730645)| +----------------------------------+ | 0.7691724424045242| | 0.09602244075319044| | 0.3006471278112862| | 0.4182649571961977| | 0.29349096650900974| | 0.7987097908937618| | 0.5324802583101007| | 0.72460930912789| | 0.1367749768412846| | 0.17277322931919348| +----------------------------------+ ``` ### How was this patch tested? Manually tested, and unittest was added. Closes #37552 from HyukjinKwon/SPARK-40121. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 336c9bc535895530cc3983b24e7507229fa9570d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 August 2022, 03:23:14 UTC
9601be9 [SPARK-39887][SQL][FOLLOW-UP] Do not exclude Union's first child attributes when traversing other children in RemoveRedundantAliases ### What changes were proposed in this pull request? Do not exclude `Union`'s first child attributes when traversing other children in `RemoveRedundantAliases`. ### Why are the changes needed? We don't need to exclude those attributes that `Union` inherits from its first child. See discussion here: https://github.com/apache/spark/pull/37496#discussion_r944509115 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #37534 from peter-toth/SPARK-39887-keep-attributes-of-unions-first-child-follow-up. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e732232dac420826af269d8cf5efacb52933f59a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 August 2022, 06:57:53 UTC
0db7842 [SPARK-40117][PYTHON][SQL] Convert condition to java in DataFrameWriterV2.overwrite ### What changes were proposed in this pull request? Fix DataFrameWriterV2.overwrite() fails to convert the condition parameter to java. This prevents the function from being called. It is caused by the following commit that deleted the `_to_java_column` call instead of fixing it: https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9 ### Why are the changes needed? DataFrameWriterV2.overwrite() cannot be called. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually checked whether the arguments are sent to JVM or not. Closes #37547 from looi/fix-overwrite. Authored-by: Wenli Looi <wlooi@ucalgary.ca> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 46379863ab0dd2ee8fcf1e31e76476ff18397f60) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 August 2022, 06:29:04 UTC
21acaae [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences ### What changes were proposed in this pull request? Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence. ### Why are the changes needed? `InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array. `getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed. `getSequenceLength` sometimes underestimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR). For example: ``` select sequence( timestamp'2022-03-13 00:00:00', timestamp'2022-03-14 00:00:00', interval 1 day) as x; ``` In the America/Los_Angeles time zone, this results in the following error: ``` java.lang.ArrayIndexOutOfBoundsException: 1 at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77) ``` This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1. However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array. The new unit test includes examples of problematic date sequences. This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #37513 from bersprockets/date_sequence_array_size_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 3a1136aa05dd5e16de81c7ec804416b3498ca967) Signed-off-by: Max Gekk <max.gekk@gmail.com> 16 August 2022, 08:53:51 UTC
2ee196d [SPARK-40079] Add Imputer inputCols validation for empty input case Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? Add Imputer inputCols validation for empty input case ### Why are the changes needed? If Imputer inputCols is empty, the `fit` works fine but when saving model, error will be raised: > AnalysisException: Datasource does not support writing empty or nested empty schemas. Please make sure the data schema has at least one or more column(s). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37518 from WeichenXu123/imputer-param-validation. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 87094f89655b7df09cdecb47c653461ae855b0ac) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 15 August 2022, 10:04:14 UTC
a6957d3 Revert "[SPARK-40047][TEST] Exclude unused `xalan` transitive dependency from `htmlunit`" ### What changes were proposed in this pull request? This pr revert SPARK-40047 due to mvn test still need this dependency. ### Why are the changes needed? mvn test still need `xalan` dependency although GA passed before this pr. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: ``` mvn clean install -DskipTests -pl core -am build/mvn clean test -pl core -Dtest=noen -DwildcardSuites=org.apache.spark.ui.UISeleniumSuite ``` **Before** ``` UISeleniumSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/apache/xml/utils/PrefixResolver at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2583) at java.lang.Class.getField0(Class.java:2975) at java.lang.Class.getField(Class.java:1701) at com.gargoylesoftware.htmlunit.svg.SvgElementFactory.<clinit>(SvgElementFactory.java:64) at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.<clinit>(HtmlUnitNekoHtmlParser.java:77) at com.gargoylesoftware.htmlunit.DefaultPageCreator.<clinit>(DefaultPageCreator.java:93) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:191) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:273) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:263) ... Cause: java.lang.ClassNotFoundException: org.apache.xml.utils.PrefixResolver at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2583) at java.lang.Class.getField0(Class.java:2975) at java.lang.Class.getField(Class.java:1701) at com.gargoylesoftware.htmlunit.svg.SvgElementFactory.<clinit>(SvgElementFactory.java:64) at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.<clinit>(HtmlUnitNekoHtmlParser.java:77) ... ``` **After** ``` UISeleniumSuite: - all jobs page should be rendered even though we configure the scheduling mode to fair - effects of unpersist() / persist() should be reflected - failed stages should not appear to be active - spark.ui.killEnabled should properly control kill button display - jobs page should not display job group name unless some job was submitted in a job group - job progress bars should handle stage / task failures - job details page should display useful information for stages that haven't started - job progress bars / cells reflect skipped stages / tasks - stages that aren't run appear as 'skipped stages' after a job finishes - jobs with stages that are skipped should show correct link descriptions on all jobs page - attaching and detaching a new tab - kill stage POST/GET response is correct - kill job POST/GET response is correct - stage & job retention - live UI json application list - job stages should have expected dotfile under DAG visualization - stages page should show skipped stages - Staleness of Spark UI should not last minutes or hours - description for empty jobs - Support disable event timeline Run completed in 17 seconds, 986 milliseconds. Total number of tests run: 20 Suites: completed 2, aborted 0 Tests: succeeded 20, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #37508 from LuciferYang/revert-40047. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit afd7098c7fb6c95aece39acc32cdad764c984cd2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 August 2022, 18:42:11 UTC
d7af1d2 [SPARK-39976][SQL] ArrayIntersect should handle null in left expression correctly ### What changes were proposed in this pull request? `ArrayInterscet` miss judge if null contains in right expression's hash set. ``` >>> a = [1, 2, 3] >>> b = [3, None, 5] >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) >>> df.show() +---------+------------+ | a| b| +---------+------------+ |[1, 2, 3]|[3, null, 5]| +---------+------------+ >>> df.selectExpr("array_intersect(a,b)").show() +---------------------+ |array_intersect(a, b)| +---------------------+ | [3]| +---------------------+ >>> df.selectExpr("array_intersect(b,a)").show() +---------------------+ |array_intersect(b, a)| +---------------------+ | [3, null]| +---------------------+ ``` In origin code gen's code path, when handle `ArrayIntersect`'s array1, it use the below code ``` def withArray1NullAssignment(body: String) = if (left.dataType.asInstanceOf[ArrayType].containsNull) { if (right.dataType.asInstanceOf[ArrayType].containsNull) { s""" |if ($array1.isNullAt($i)) { | if ($foundNullElement) { | $nullElementIndex = $size; | $foundNullElement = false; | $size++; | $builder.$$plus$$eq($nullValueHolder); | } |} else { | $body |} """.stripMargin } else { s""" |if (!$array1.isNullAt($i)) { | $body |} """.stripMargin } } else { body } ``` We have a flag `foundNullElement` to indicate if array2 really contains a null value. But when implement https://issues.apache.org/jira/browse/SPARK-36829, misunderstand the meaning of `ArrayType.containsNull`, so when implement `SQLOpenHashSet.withNullCheckCode()` ``` def withNullCheckCode( arrayContainsNull: Boolean, setContainsNull: Boolean, array: String, index: String, hashSet: String, handleNotNull: (String, String) => String, handleNull: String): String = { if (arrayContainsNull) { if (setContainsNull) { s""" |if ($array.isNullAt($index)) { | if (!$hashSet.containsNull()) { | $hashSet.addNull(); | $handleNull | } |} else { | ${handleNotNull(array, index)} |} """.stripMargin } else { s""" |if (!$array.isNullAt($index)) { | ${handleNotNull(array, index)} |} """.stripMargin } } else { handleNotNull(array, index) } } ``` The code path of ` if (arrayContainsNull && setContainsNull) ` is misinterpreted that array's openHashSet really have a null value. In this pr we add a new parameter `additionalCondition ` to complements the previous implementation of `foundNullElement`. Also refactor the method's parameter name. ### Why are the changes needed? Fix data correct issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #37436 from AngersZhuuuu/SPARK-39776-FOLLOW_UP. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dff5c2f2e9ce233e270e0e5cde0a40f682ba9534) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 August 2022, 02:54:09 UTC
21d9db3 [SPARK-39887][SQL][3.3] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique ### What changes were proposed in this pull request? Keep the output attributes of a `Union` node's first child in the `RemoveRedundantAliases` rule to avoid correctness issues. ### Why are the changes needed? To fix the result of the following query: ``` SELECT a, b AS a FROM ( SELECT a, a AS b FROM (SELECT a FROM VALUES (1) AS t(a)) UNION ALL SELECT a, b FROM (SELECT a, b FROM VALUES (1, 2) AS t(a, b)) ) ``` Before this PR the query returns the incorrect result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 2| 2| +---+---+ ``` After this PR it returns the expected result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 1| 2| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UTs. Closes #37472 from peter-toth/SPARK-39887-keep-attributes-of-unions-first-child-3.3. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 August 2022, 02:41:37 UTC
221fee8 [SPARK-40047][TEST] Exclude unused `xalan` transitive dependency from `htmlunit` ### What changes were proposed in this pull request? This pr exclude `xalan` from `htmlunit` to clean warning of CVE-2022-34169: ``` Provides transitive vulnerable dependency xalan:xalan:2.7.2 CVE-2022-34169 7.5 Integer Coercion Error vulnerability with medium severity found Results powered by Checkmarx(c) ``` `xalan:xalan:2.7.2` is the latest version, the code base has not been updated for 5 years, so can't solve by upgrading `xalan`. ### Why are the changes needed? The vulnerability is described is [CVE-2022-34169](https://github.com/advisories/GHSA-9339-86wc-4qgf), better to exclude it although it's just test dependency for Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Manual test: run `mvn dependency:tree -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive | grep xalan` to check that `xalan` is not matched after this pr Closes #37481 from LuciferYang/exclude-xalan. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7f3baa77acbf7747963a95d0f24e3b8868c7b16a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 August 2022, 22:10:50 UTC
248e8b4 [SPARK-40043][PYTHON][SS][DOCS] Document DataStreamWriter.toTable and DataStreamReader.table ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30835 that adds `DataStreamWriter.toTable` and `DataStreamReader.table` into PySpark documentation. ### Why are the changes needed? To document both features. ### Does this PR introduce _any_ user-facing change? Yes, both API will be shown in PySpark reference documentation. ### How was this patch tested? Manually built the documentation and checked. Closes #37477 from HyukjinKwon/SPARK-40043. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 447003324d2cf9f2bfa799ef3a1e744a5bc9277d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 August 2022, 06:01:25 UTC
42b30ee [SPARK-40022][YARN][TESTS] Ignore pyspark suites in `YarnClusterSuite` when python3 is unavailable ### What changes were proposed in this pull request? This pr adds `assume(isPythonAvailable)` to `testPySpark` method in `YarnClusterSuite` to make `YarnClusterSuite` test succeeded in an environment without Python 3 configured. ### Why are the changes needed? `YarnClusterSuite` should not `ABORTED` when `python3` is not configured. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Manual test Run ``` mvn clean test -pl resource-managers/yarn -am -Pyarn -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -Dtest=none ``` in an environment without Python 3 configured: **Before** ``` YarnClusterSuite: org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** java.lang.RuntimeException: Unable to load a Suite class that was discovered in the runpath: org.apache.spark.deploy.yarn.YarnClusterSuite at org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81) at org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:286) ... Run completed in 833 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 *** 1 SUITE ABORTED *** ``` **After** ``` YarnClusterSuite: - run Spark in yarn-client mode - run Spark in yarn-cluster mode - run Spark in yarn-client mode with unmanaged am - run Spark in yarn-client mode with different configurations, ensuring redaction - run Spark in yarn-cluster mode with different configurations, ensuring redaction - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' and gateway-replacement path - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path containing an environment variable - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'file' - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'file' - run Spark in yarn-cluster mode unsuccessfully - run Spark in yarn-cluster mode failure after sc initialized - run Python application in yarn-client mode !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - run Python application in yarn-cluster mode !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - user class path first in client mode - user class path first in cluster mode - monitor app using launcher library - running Spark in yarn-cluster mode displays driver log links - timeout to get SparkContext in cluster mode triggers failure - executor env overwrite AM env in client mode - executor env overwrite AM env in cluster mode - SPARK-34472: ivySettings file with no scheme or file:// scheme should be localized on driver in cluster mode - SPARK-34472: ivySettings file with no scheme or file:// scheme should retain user provided path in client mode - SPARK-34472: ivySettings file with non-file:// schemes should throw an error Run completed in 7 minutes, 2 seconds. Total number of tests run: 25 Suites: completed 2, aborted 0 Tests: succeeded 25, failed 0, canceled 3, ignored 0, pending 0 All tests passed. ``` Closes #37454 from LuciferYang/yarnclustersuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8e472443081342a0e0dc37aa154e30a0a6df39b7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 August 2022, 00:59:49 UTC
9353437 [SPARK-40002][SQL] Don't push down limit through window using ntile ### What changes were proposed in this pull request? Change `LimitPushDownThroughWindow` so that it no longer supports pushing down a limit through a window using ntile. ### Why are the changes needed? In an unpartitioned window, the ntile function is currently applied to the result of the limit. This behavior produces results that conflict with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268 #### Example Assume this data: ``` create table t1 stored as parquet as select * from range(101); ``` Also assume this query: ``` select id, ntile(10) over (order by id) as nt from t1 limit 10; ``` With Spark 3.2.2, Spark 3.3.0, and master, the limit is applied before the ntile function: ``` +---+---+ |id |nt | +---+---+ |0 |1 | |1 |2 | |2 |3 | |3 |4 | |4 |5 | |5 |6 | |6 |7 | |7 |8 | |8 |9 | |9 |10 | +---+---+ ``` With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied _after_ ntile. Spark 3.1.3: ``` +---+---+ |id |nt | +---+---+ |0 |1 | |1 |1 | |2 |1 | |3 |1 | |4 |1 | |5 |1 | |6 |1 | |7 |1 | |8 |1 | |9 |1 | +---+---+ ``` Hive 2.3.9: ``` +-----+-----+ | id | nt | +-----+-----+ | 0 | 1 | | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6 | 1 | | 7 | 1 | | 8 | 1 | | 9 | 1 | +-----+-----+ 10 rows selected (1.72 seconds) ``` Prestodb 0.268: ``` id | nt ----+---- 0 | 1 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 6 | 1 7 | 1 8 | 1 9 | 1 (10 rows) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Two new unit tests. Closes #37443 from bersprockets/pushdown_ntile. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c9156e5a3b9cb290c7cdda8db298c9875e67aa5e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 August 2022, 02:40:08 UTC
66faaa5 [SPARK-39965][K8S] Skip PVC cleanup when driver doesn't own PVCs ### What changes were proposed in this pull request? This PR aims to skip PVC cleanup logic when `spark.kubernetes.driver.ownPersistentVolumeClaim=false`. ### Why are the changes needed? To simplify Spark termination log by removing unnecessary log containing Exception message when Spark jobs have no PVC permission and at the same time `spark.kubernetes.driver.ownPersistentVolumeClaim` is `false`. ### Does this PR introduce _any_ user-facing change? Only in the termination logs of Spark jobs that has no PVC permission. ### How was this patch tested? Manually. Closes #37433 from dongjoon-hyun/SPARK-39965. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: pralabhkumar <pralabhkumar@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 87b312a9c9273535e22168c3da73834c22e1fbbb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 August 2022, 16:58:25 UTC
369b014 [SPARK-38034][SQL] Optimize TransposeWindow rule ### What changes were proposed in this pull request? Optimize the TransposeWindow rule to extend applicable cases and optimize time complexity. TransposeWindow rule will try to eliminate unnecessary shuffle: but the function compatiblePartitions will only take the first n elements of the window2 partition sequence, for some cases, this will not take effect, like the case below:  val df = spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS d") df.selectExpr( "sum(`d`) OVER(PARTITION BY `b`,`a`) as e", "sum(`c`) OVER(PARTITION BY `a`) as f" ).explain Current plan == Physical Plan == *(5) Project [e#10L, f#11L] +- Window [sum(c#4L) windowspecdefinition(a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#11L], [a#2L]    +- *(4) Sort [a#2L ASC NULLS FIRST], false, 0       +- Exchange hashpartitioning(a#2L, 200), true, [id=#41]          +- *(3) Project [a#2L, c#4L, e#10L]             +- Window [sum(d#5L) windowspecdefinition(b#3L, a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#10L], [b#3L, a#2L]                +- *(2) Sort [b#3L ASC NULLS FIRST, a#2L ASC NULLS FIRST], false, 0                   +- Exchange hashpartitioning(b#3L, a#2L, 200), true, [id=#33]                      +- *(1) Project [id#0L AS d#5L, id#0L AS b#3L, id#0L AS a#2L, id#0L AS c#4L]                         +- *(1) Range (0, 10, step=1, splits=10) Expected plan: == Physical Plan == *(4) Project [e#924L, f#925L] +- Window [sum(d#43L) windowspecdefinition(b#41L, a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#924L], [b#41L, a#40L]  +- *(3) Sort [b#41L ASC NULLS FIRST, a#40L ASC NULLS FIRST], false, 0       +- *(3) Project [d#43L, b#41L, a#40L, f#925L]          +- Window [sum(c#42L) windowspecdefinition(a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#925L], [a#40L]             +- *(2) Sort [a#40L ASC NULLS FIRST], false, 0                +- Exchange hashpartitioning(a#40L, 200), true, [id=#282]                   +- *(1) Project [id#38L AS d#43L, id#38L AS b#41L, id#38L AS a#40L, id#38L AS c#42L]                      +- *(1) Range (0, 10, step=1, splits=10) Also the permutations method has a O(n!) time complexity, which is very expensive when there are many partition columns, we could try to optimize it. ### Why are the changes needed? We could apply the rule for more cases, which could improve the execution performance by eliminate unnecessary shuffle, and by reducing the time complexity from O(n!) to O(n2), the performance for the rule itself could improve ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? UT Closes #35334 from constzhou/SPARK-38034_optimize_transpose_window_rule. Authored-by: xzhou <15210830305@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0cc331dc7e51e53000063052b0c8ace417eb281b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2022, 10:43:16 UTC
26f0d50 [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast This PR is a followup of https://github.com/apache/spark/pull/37283. It missed `throw` keyword in the interpreted path. To throw an exception as intended instead of returning an exception itself. Yes, it will throw an exception as expected in the interpreted path. Haven't tested because it's too much straightforward. Closes #37414 from HyukjinKwon/SPARK-39981. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e6b9c6166a08ad4dca2550bbbb151fa575b730a8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 August 2022, 09:32:36 UTC
c358ee6 [SPARK-39775][CORE][AVRO] Disable validate default values when parsing Avro schemas ### What changes were proposed in this pull request? This PR disables validate default values when parsing Avro schemas. ### Why are the changes needed? Spark will throw exception if upgrade to Spark 3.2. We have fixed the Hive serde tables before: SPARK-34512. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37191 from wangyum/SPARK-39775. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5c1b99f441ec5e178290637a9a9e7902aaa116e1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2022, 03:26:14 UTC
15ebd56 [SPARK-39952][SQL] SaveIntoDataSourceCommand should recache result relation ### What changes were proposed in this pull request? recacheByPlan the result relation inside `SaveIntoDataSourceCommand` ### Why are the changes needed? The behavior of `SaveIntoDataSourceCommand` is similar with `InsertIntoDataSourceCommand` which supports append or overwirte data. In order to keep data consistent, we should always do recacheByPlan the relation on post hoc. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #37380 from ulysses-you/refresh. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5fe0b245f7891a05bc4e1e641fd0aa9130118ea4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 August 2022, 17:04:10 UTC
6e9a58f [SPARK-39947][BUILD] Upgrade Jersey to 2.36 ### What changes were proposed in this pull request? This pr upgrade Jersey from 2.35 to 2.36. ### Why are the changes needed? This version adapts to Jack 2.13.3, which is also used by Spark currently - [Adopt Jackson 2.13](https://github.com/eclipse-ee4j/jersey/pull/4928) - [Update Jackson to 2.13.3](https://github.com/eclipse-ee4j/jersey/pull/5076) The release notes as follows: - https://github.com/eclipse-ee4j/jersey/releases/tag/2.36 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #37375 from LuciferYang/jersey-236. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit d1c145b0b0b892fcbf1e1adda7b8ecff75c56f6d) Signed-off-by: Sean Owen <srowen@gmail.com> # Conflicts: # dev/deps/spark-deps-hadoop-2-hive-2.3 # dev/deps/spark-deps-hadoop-3-hive-2.3 # pom.xml 03 August 2022, 13:32:23 UTC
630dc7e [SPARK-39900][SQL] Address partial or negated condition in binary format's predicate pushdown ### What changes were proposed in this pull request? fix `BinaryFileFormat` filter push down bug. Before modification, when Filter tree is: ```` -Not - - IsNotNull ```` Since `IsNotNull` cannot be matched, `IsNotNull` will return a result that is always true (that is, `case _ => (_ => true)`), that is, no filter pushdown is performed. But because there is still a `Not`, after negation, it will return a result that is always False, that is, no result can be returned. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test suit in `BinaryFileFormatSuite` ``` testCreateFilterFunction( Seq(Not(IsNull(LENGTH))), Seq((t1, true), (t2, true), (t3, true))) ``` Closes #37350 from zzzzming95/SPARK-39900. Lead-authored-by: zzzzming95 <505306252@qq.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a0dc7d9117b66426aaa2257c8d448a2f96882ecd) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2022, 12:23:04 UTC
3b02351 [SPARK-39867][SQL][3.3] Fix scala style ### What changes were proposed in this pull request? fix scala style ### Why are the changes needed? fix failed test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI Closes #37394 from ulysses-you/style. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2022, 08:48:12 UTC
bd3f36f [SPARK-39962][PYTHON][SQL] Apply projection when group attributes are empty ### What changes were proposed in this pull request? This PR proposes to apply the projection to respect the reordered columns in its child when group attributes are empty. ### Why are the changes needed? To respect the column order in the child. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug as below: ```python import pandas as pd from pyspark.sql import functions as f f.pandas_udf("double") def AVG(x: pd.Series) -> float: return x.mean() abc = spark.createDataFrame([(1.0, 5.0, 17.0)], schema=["a", "b", "c"]) abc.agg(AVG("a"), AVG("c")).show() abc.select("c", "a").agg(AVG("a"), AVG("c")).show() ``` **Before** ``` +------+------+ |AVG(a)|AVG(c)| +------+------+ | 17.0| 1.0| +------+------+ ``` **After** ``` +------+------+ |AVG(a)|AVG(c)| +------+------+ | 1.0| 17.0| +------+------+ ``` ### How was this patch tested? Manually tested, and added an unittest. Closes #37390 from HyukjinKwon/SPARK-39962. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5335c784ae76c9cc0aaa7a4b57b3cd6b3891ad9a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2022, 07:11:33 UTC
back to top