https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
6b1ff22 Preparing Spark release v3.4.1-rc1 19 June 2023, 22:17:28 UTC
864b986 [SPARK-44018][SQL] Improve the hashCode and toString for some DS V2 Expression ### What changes were proposed in this pull request? The `hashCode() `of `UserDefinedScalarFunc` and `GeneralScalarExpression` is not good enough. Take for example, `GeneralScalarExpression` uses `Objects.hash(name, children)`, it adopt the hash code of `name` and `children`'s reference and then combine them together as the `GeneralScalarExpression`'s hash code. In fact, we should adopt the hash code for each element in `children`. Because `UserDefinedAggregateFunc` and `GeneralAggregateFunc` missing `hashCode()`, this PR also want add them. This PR also improve the toString for `UserDefinedAggregateFunc` and `GeneralAggregateFunc` by using bool primitive comparison instead `Objects.equals`. Because the performance of bool primitive comparison better than `Objects.equals`. ### Why are the changes needed? Improve the hash code for some DS V2 Expression. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? N/A Closes #41543 from beliefer/SPARK-44018. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8c84d2c9349d7b607db949c2e114df781f23e438) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 June 2023, 07:55:24 UTC
0c593a4 [MINOR][K8S][DOCS] Fix all dead links for K8s doc ### What changes were proposed in this pull request? This PR fixes all dead links for K8s doc. ### Why are the changes needed? <img width="797" alt="image" src="https://github.com/apache/spark/assets/5399861/3ba3f048-776c-42e6-b455-86e90b6ef22f"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #41635 from wangyum/kubernetes. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 1ff670488c3b402984ceb24e1d6eaf5a16176f1d) Signed-off-by: Yuming Wang <yumwang@ebay.com> 17 June 2023, 06:40:38 UTC
89bfb1f [SPARK-44070][BUILD] Bump snappy-java 1.1.10.1 ### What changes were proposed in this pull request? Bump snappy-java from 1.1.10.0 to 1.1.10.1. ### Why are the changes needed? This mostly is a security version, the notable changes are CVE fixing. - CVE-2023-34453 Integer overflow in shuffle - CVE-2023-34454 Integer overflow in compress - CVE-2023-34455 Unchecked chunk length Full changelog: https://github.com/xerial/snappy-java/releases/tag/v1.1.10.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #41616 from pan3793/SPARK-44070. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 0502a42dda4d0822e2572a3d1ae6928d90b792a9) Signed-off-by: Yuming Wang <yumwang@ebay.com> 16 June 2023, 15:49:30 UTC
7edb3d9 [SPARK-44040][SQL] Fix compute stats when AggregateExec node above QueryStageExec ### What changes were proposed in this pull request? This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`. For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example: ```sql SELECT count(*) FROM tbl WHERE false; ``` The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891). ### Why are the changes needed? Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan === !Aggregate [id#5L], [id#5L] Project [id#5L] +- Union false, false +- Union false, false :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #41576 from wangyum/SPARK-44040. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 55ba63c257b6617ec3d2aca5bc1d0989d4f29de8) Signed-off-by: Yuming Wang <yumwang@ebay.com> 16 June 2023, 03:42:19 UTC
0355a6f [SPARK-44053][BUILD][3.4] Update ORC to 1.8.4 ### What changes were proposed in this pull request? This PR aims to update ORC to 1.8.4. ### Why are the changes needed? This will bring the following bug fixes. - https://issues.apache.org/jira/projects/ORC/versions/12353041 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41592 from guiyanakuang/ORC-1.8.4. Authored-by: Yiqun Zhang <guiyanakuang@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 June 2023, 16:38:18 UTC
b1a5873 [SPARK-44038][DOCS][K8S] Update YuniKorn docs with v1.3 ### What changes were proposed in this pull request? This PR aims to update `Apache YuniKorn` batch scheduler docs with v1.3.0 to recommend it in Apache Spark 3.4.1 and 3.5.0 users. ### Why are the changes needed? Apache YuniKorn v1.3.0 was released on 2023-06-12 with 160 resolved JIRAs. https://yunikorn.apache.org/release-announce/1.3.0 I installed YuniKorn v1.3.0 and tested manually. ``` $ helm list -n yunikorn NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION yunikorn yunikorn 1 2023-06-13 01:56:32.784863 -0700 PDT deployed yunikorn-1.3.0 ``` ``` $ build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/testOnly *.YuniKornSuite" -Dtest.exclude.tags=minikube,local,decom,r -Dtest.default.exclude.tags= ... [info] YuniKornSuite: [info] - SPARK-42190: Run SparkPi with local[*] (12 seconds, 49 milliseconds) [info] - Run SparkPi with no resources (20 seconds, 378 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (20 seconds, 583 milliseconds) [info] - Run SparkPi with a very long application name. (20 seconds, 606 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (19 seconds, 676 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (20 seconds, 631 milliseconds) [info] - Run SparkPi with an argument. (22 seconds, 320 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (20 seconds, 469 milliseconds) [info] - All pods have the same service account by default (22 seconds, 537 milliseconds) [info] - Run extraJVMOptions check on driver (12 seconds, 268 milliseconds) ... ``` ``` $ k describe pod spark-test-app-33ec515e453e4301a90f626812db1153-driver -n spark-dbe522106eac40d4a17447bfa2947c45 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduling 82s yunikorn spark-dbe522106eac40d4a17447bfa2947c45/spark-test-app-33ec515e453e4301a90f626812db1153-driver is queued and waiting for allocation Normal Scheduled 82s yunikorn Successfully assigned spark-dbe522106eac40d4a17447bfa2947c45/spark-test-app-33ec515e453e4301a90f626812db1153-driver to node docker-desktop Normal PodBindSuccessful 82s yunikorn Pod spark-dbe522106eac40d4a17447bfa2947c45/spark-test-app-33ec515e453e4301a90f626812db1153-driver is successfully bound to node docker-desktop Normal Pulled 82s kubelet Container image "docker.io/kubespark/spark:dev" already present on machine Normal Created 82s kubelet Created container spark-kubernetes-driver Normal Started 82s kubelet Started container spark-kubernetes-driver ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes #41571 from dongjoon-hyun/SPARK-44038. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 223c196d242f9c79ca76b226bf9489759652203a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 June 2023, 10:37:16 UTC
12d8e80 Revert "[SPARK-44031][BUILD] Upgrade silencer to 1.7.13" This reverts commit b04232c65dc593c22db5fb5e18eab79aebf3a2ca. 13 June 2023, 03:51:16 UTC
b04232c [SPARK-44031][BUILD] Upgrade silencer to 1.7.13 ### What changes were proposed in this pull request? This PR aims to upgrade `silencer` to 1.7.13. ### Why are the changes needed? `silencer` 1.7.13 supports `Scala 2.12.18 & 2.13.11`. - https://github.com/ghik/silencer/releases/tag/v1.7.13 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41560 from dongjoon-hyun/silencer_1.7.13. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 04d84df87e33111f79525b88ad78c6d1bddab78c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 June 2023, 03:50:45 UTC
e010365 [SPARK-32559][SQL] Fix the trim logic did't handle ASCII control characters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in https://github.com/apache/spark/pull/29375 trim ASCII control characters unexpectly. Before this patch ![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7) And hive ![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe) ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes #41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <wangjunbo@qiyi.com> Co-authored-by: Junbo wang <1042815068@qq.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 80588e4ddfac1d2e2fdf4f9a7783c56be6a97cdd) Signed-off-by: Kent Yao <yao@apache.org> 13 June 2023, 03:44:12 UTC
1431df0 [SPARK-43398][CORE] Executor timeout should be max of idle shuffle and rdd timeout ### What changes were proposed in this pull request? Executor timeout should be max of idle, shuffle and rdd timeout ### Why are the changes needed? Wrong timeout value when combining idle, shuffle and rdd timeout ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test in `ExecutorMonitorSuite` Closes #41082 from warrenzhu25/max-timeout. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7107742a381cde2e6de9425e3e436282a8c0d27c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 June 2023, 07:40:21 UTC
238da78 [SPARK-43404][SS][3.4] Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error NOTE: This ports back the commit https://github.com/apache/spark/commit/d3b9f4e7b1be7d89466cf80c3e38890c7add7625 (PR https://github.com/apache/spark/pull/41089) to branch-3.4. This is a clean cherry-pick. ### What changes were proposed in this pull request? Skip reusing sst file for same version of RocksDB state store to avoid id mismatch error ### Why are the changes needed? In case of task retry on the same executor, its possible that the original task completed the phase of creating the SST files and uploading them to the object store. In this case, we also might have added an entry to the in-memory map for `versionToRocksDBFiles` for the given version. When the retry task creates the local checkpoint, its possible the file name and size is the same, but the metadata ID embedded within the file may be different. So, when we try to load this version on successful commit, the metadata zip file points to the old SST file which results in a RocksDB mismatch id error. ``` Mismatch in unique ID on table file 24220. Expected: {9692563551998415634,4655083329411385714} Actual: {9692563551998415639,10299185534092933087} in file /local_disk0/spark-f58a741d-576f-400c-9b56-53497745ac01/executor-18e08e59-20e8-4a00-bd7e-94ad4599150b/spark-5d980399-3425-4951-894a-808b943054ea/StateStoreId(opId=2147483648,partId=53,name=default)-d89e082e-4e33-4371-8efd-78d927ad3ba3/workingDir-9928750e-f648-4013-a300-ac96cb6ec139/MANIFEST-024212 ``` This change avoids reusing files for the same version on the same host based on the map entries to reduce the chance of running into the error above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test RocksDBSuite ``` [info] Run completed in 35 seconds, 995 milliseconds. [info] Total number of tests run: 33 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 33, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` Closes #41530 from HeartSaVioR/SPARK-43404-3.4. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 09 June 2023, 11:49:25 UTC
020eb69 [SPARK-42290][SQL] Fix the OOM error can't be reported when AQE on ### What changes were proposed in this pull request? When we use spark shell to submit job like this: ```scala $ spark-shell --conf spark.driver.memory=1g val df = spark.range(5000000).withColumn("str", lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf")) val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer") df2.collect ``` This will cause the driver to hang indefinitely. When we disable AQE, the `java.lang.OutOfMemoryError` will be throws. After I check the code, the reason are wrong way to use `Throwable::initCause`. It happened when OOM be throw on https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L184 . Then https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala#L2401 will be executed. It use `new SparkException(..., case=oe).initCause(oe.getCause)`. The doc in `Throwable::initCause` say ``` This method can be called at most once. It is generally called from within the constructor, or immediately after creating the throwable. If this throwable was created with Throwable(Throwable) or Throwable(String, Throwable), this method cannot be called even once. ``` So when we call it, the `IllegalStateException` will be throw. Finally, the `promise.tryFailure(ex)` never be called. The driver will be blocked. ### Why are the changes needed? Fix the OOM never be reported bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test Closes #41517 from Hisoka-X/SPARK-42290_OOM_AQE_On. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4168e1ac3c1b44298d2c6eae31e7f6cf948614a3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 June 2023, 20:12:52 UTC
8a71a15 [MINOR][SQL][TESTS] Move ResolveDefaultColumnsSuite to 'o.a.s.sql' This PR moves `ResolveDefaultColumnsSuite` from `catalyst/analysis` package to `sql` package. To fix the code layout. No. Pass the CI Closes #41520 from dongjoon-hyun/move. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 07cc04d3e3f37684946e09a9ab1144efaed6f6ec) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 June 2023, 19:50:31 UTC
77e077a [SPARK-43973][SS][UI][TESTS][FOLLOWUP][3.4] Fix compilation by switching QueryTerminatedEvent constructor ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/41468 to fix `branch-3.4`'s compilation issue. ### Why are the changes needed? To recover the compilation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41484 from dongjoon-hyun/SPARK-43973. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 June 2023, 18:30:32 UTC
778beb3 [SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in event logs ### What changes were proposed in this pull request? This prevents NPE by handling the case where `modifiedConfigs` doesn't exist in event logs. ### Why are the changes needed? Basically, this is the same solution for that case. - https://github.com/apache/spark/pull/34907 The new code was added here, but we missed the corner case. - https://github.com/apache/spark/pull/35972 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41472 from dongjoon-hyun/SPARK-43976. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b4ab34bf9b22d0f0ca4ab13f9b6106f38ccfaebe) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 June 2023, 16:34:48 UTC
63d5995 [SPARK-43510][YARN] Fix YarnAllocator internal state when adding running executor after processing completed containers ### What changes were proposed in this pull request? Keep track of completed container ids in YarnAllocator and don't update internal state of a container if it's already completed. ### Why are the changes needed? YarnAllocator updates internal state adding running executors after executor launch in a separate thread. That can happen after the containers are already completed (e.g. preempted) and processed by YarnAllocator. Then YarnAllocator mistakenly thinks there are still running executors which are already lost. As a result, application hangs without any running executors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #41173 from manuzhang/spark-43510. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 89d44d092af4ae53fec296ca6569e240ad4c2bc5) Signed-off-by: Thomas Graves <tgraves@apache.org> 06 June 2023, 13:29:19 UTC
b705106 [SPARK-43973][SS][UI] Structured Streaming UI should display failed queries correctly ### What changes were proposed in this pull request? Handle the `exception` message from Structured Streaming's `QueryTerminatedEvent` in `StreamingQueryStatusListener`, so that the Structured Streaming UI can display the status and error message correctly for failed queries. ### Why are the changes needed? The original implementation of the Structured Streaming UI had a copy-and-paste bug where it forgot to handle the `exception` from `QueryTerminatedEvent` in `StreamingQueryStatusListener.onQueryTerminated`, so that field is never updated in the UI data, i.e. it's always `None`. In turn, the UI would always show the status of failed queries as `FINISHED` instead of `FAILED`. ### Does this PR introduce _any_ user-facing change? Yes. Failed Structured Streaming queries would show incorrectly as `FINISHED` before the fix: ![Screenshot 2023-06-05 at 3 58 16 PM](https://github.com/apache/spark/assets/107834/32d91148-7a27-4723-9748-ec8a7596ee61) and show correctly as `FAILED` after the fix: ![Screenshot 2023-06-05 at 3 56 07 PM](https://github.com/apache/spark/assets/107834/16b1df32-1804-4f49-aede-d775f8db7666) The example query is: ```scala implicit val ctx = spark.sqlContext import org.apache.spark.sql.execution.streaming.MemoryStream spark.conf.set("spark.sql.ansi.enabled", "true") val inputData = MemoryStream[(Int, Int)] val df = inputData.toDF().selectExpr("_1 / _2 as a") inputData.addData((1, 2), (3, 4), (5, 6), (7, 0)) val testQuery = df.writeStream.format("memory").queryName("kristest").outputMode("append").start testQuery.processAllAvailable() ``` ### How was this patch tested? Added UT to `StreamingQueryStatusListenerSuite` Closes #41468 from rednaxelafx/fix-streaming-ui. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 51a919ea8d65e38f829717ae822f31a1a7f57beb) Signed-off-by: Gengliang Wang <gengliang@apache.org> 06 June 2023, 05:07:48 UTC
7b30403 Revert "[SPARK-43911][SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array" This reverts commit 93709918affba4846a30cbae8692a6a328b5a448. 06 June 2023, 00:27:35 UTC
9370991 [SPARK-43911][SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array ### What changes were proposed in this pull request? When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for dynamic partition pruning, it will put all the keys in an Array, and then call the distinct of the Array to remove the duplicates. In general, Broadcast HashedRelation may have many rows, and the repetition rate of this key is high. Doing so will cause this Array to occupy a large amount of memory (and this memory is not managed by MemoryManager), which may trigger OOM. The approach here is to directly call the toSet of the iterator to deduplicate, which can prevent the creation of a large array. ### Why are the changes needed? Avoid the occurrence of the following OOM exceptions: ```text Exception in thread "dynamicpruning-0" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:49) at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:62) at scala.collection.generic.Growable$$Lambda$7/1514840818.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.sql.execution.SubqueryBroadcastExec.$anonfun$relationFuture$2(SubqueryBroadcastExec.scala:92) at org.apache.spark.sql.execution.SubqueryBroadcastExec$$Lambda$4212/5099232.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:140) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Production environment manual verification && Pass existing unit tests Closes #41419 from mcdull-zhang/reduce_memory_usage. Authored-by: mcdull-zhang <work4dong@163.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 595ad30e6259f7e4e4252dfee7704b73fd4760f7) Signed-off-by: Yuming Wang <yumwang@ebay.com> 04 June 2023, 07:08:22 UTC
e140bf7 [SPARK-43956][SQL][3.4] Fix the bug doesn't display column's sql for Percentile[Cont|Disc] ### What changes were proposed in this pull request? This PR used to backport https://github.com/apache/spark/pull/41436 to 3.4 ### Why are the changes needed? Fix the bug doesn't display column's sql for Percentile[Cont|Disc]. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see the correct sql information. ### How was this patch tested? Test cases updated. Closes #41445 from beliefer/SPARK-43956_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 03 June 2023, 19:15:15 UTC
746c906 [SPARK-43949][PYTHON] Upgrade cloudpickle to 2.2.1 This PR proposes to upgrade Cloudpickle from 2.2.0 to 2.2.1. Cloudpickle 2.2.1 has a fix (https://github.com/cloudpipe/cloudpickle/pull/495) for namedtuple issue (https://github.com/cloudpipe/cloudpickle/issues/460). PySpark relies on namedtuple heavily especially for RDD. We should upgrade and fix it. Yes, see https://github.com/cloudpipe/cloudpickle/issues/460. Relies on cloudpickle's unittests. Existing test cases should pass too. Closes #41433 from HyukjinKwon/cloudpickle-upgrade. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 085dfeb2bed61f6d43d9b99b299373e797ac8f17) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 June 2023, 10:39:24 UTC
a2c915d [SPARK-43760][SQL][3.4] Nullability of scalar subquery results ### What changes were proposed in this pull request? Backport of https://github.com/apache/spark/pull/41287. Makes sure that the results of scalar subqueries are declared as nullable. ### Why are the changes needed? This is an existing correctness bug, see https://issues.apache.org/jira/browse/SPARK-43760 ### Does this PR introduce _any_ user-facing change? Fixes a correctness issue, so it is user-facing. ### How was this patch tested? Query tests. Closes #41408 from agubichev/spark-43760-nullability-branch-3.4. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 June 2023, 07:10:11 UTC
0e1401d [SPARK-43894][PYTHON] Fix bug in df.cache() ### What changes were proposed in this pull request? Previously calling `df.cache()` would result in an invalid plan input exception because we did not invoke `persist()` with the right arguments. This patch simplifies the logic and makes it compatible to the behavior in Spark itself. ### Why are the changes needed? Bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #41399 from grundprinzip/df_cache. Authored-by: Martin Grund <martin.grund@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit d3f76c6ca07a7a11fd228dde770186c0fbc3f03f) Signed-off-by: Herman van Hovell <herman@databricks.com> 31 May 2023, 15:56:22 UTC
efae536 [SPARK-42421][CORE] Use the utils to get the switch for dynamic allocation used in local checkpoint ### What changes were proposed in this pull request? Use the utils to get the switch for dynamic allocation used in local checkpoint ### Why are the changes needed? In RDD's local checkpoint, only through retrieve the value from configuration, but not adjudge the local master and testing for dynamic allocation which unified in Utils. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Used the existing UTs Closes #39998 from jiwq/SPARK-42421. Authored-by: Wanqiang Ji <jiwq@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit c4b880f8d3936c1b7c8ddd9621ab392c70ace1e7) Signed-off-by: Kent Yao <yao@apache.org> 29 May 2023, 02:47:32 UTC
b06f0f1 [SPARK-43802][SQL][3.4] Fix codegen for unhex and unbase64 with failOnError=true ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/41317. Fixes an error with codegen for unhex and unbase64 expression when failOnError is enabled introduced in https://github.com/apache/spark/pull/37483. ### Why are the changes needed? Codegen fails and Spark falls back to interpreted evaluation: ``` Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 1: Unknown variable or type "BASE64" ``` in the code block: ``` /* 107 */ if (!org.apache.spark.sql.catalyst.expressions.UnBase64.isValidBase64(project_value_1)) { /* 108 */ throw QueryExecutionErrors.invalidInputInConversionError( /* 109 */ ((org.apache.spark.sql.types.BinaryType$) references[1] /* to */), /* 110 */ project_value_1, /* 111 */ BASE64, /* 112 */ "try_to_binary"); /* 113 */ } ``` ### Does this PR introduce _any_ user-facing change? Bug fix. ### How was this patch tested? Added to the existing tests so evaluate an expression with failOnError enabled to test that path of the codegen. Closes #41334 from Kimahriman/to-binary-codegen-backport. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 May 2023, 02:30:14 UTC
f8a2498 [SPARK-43751][SQL][DOC] Document `unbase64` behavior change ### What changes were proposed in this pull request? After SPARK-37820, `select unbase64("abcs==")`(malformed input) always throws an exception, this PR does not help in that case, it only improves the error message for `to_binary()`. So, `unbase64()`'s behavior for malformed input changed silently after SPARK-37820: - before: return a best-effort result, because it uses [LENIENT](https://github.com/apache/commons-codec/blob/rel/commons-codec-1.15/src/main/java/org/apache/commons/codec/binary/Base64InputStream.java#L46) policy: any trailing bits are composed into 8-bit bytes where possible. The remainder are discarded. - after: throw an exception And there is no way to restore the previous behavior. To tolerate the malformed input, the user should migrate `unbase64(<input>)` to `try_to_binary(<input>, 'base64')` to get NULL instead of interrupting by exception. ### Why are the changes needed? Add the behavior change to migration guide. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Manuelly review. Closes #41280 from pan3793/SPARK-43751. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit af6c1ec7c795584c28e15e4963eed83917e2f06a) Signed-off-by: Kent Yao <yao@apache.org> 26 May 2023, 03:33:56 UTC
7480f37 [SPARK-43759][SQL][PYTHON] Expose TimestampNTZType in pyspark.sql.types ### What changes were proposed in this pull request? Exposes `TimestampNTZType` in `pyspark.sql.types`. ```py >>> from pyspark.sql.types import * >>> >>> TimestampNTZType() TimestampNTZType() ``` ### Why are the changes needed? `TimestampNTZType` is missing in `_all_` list in `pyspark.sql.types`. ```py >>> from pyspark.sql.types import * >>> >>> TimestampNTZType() Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'TimestampNTZType' is not defined ``` ### Does this PR introduce _any_ user-facing change? Users won't need to explicitly import `TimestampNTZType` but wildcard will work. ### How was this patch tested? Existing tests. Closes #41286 from ueshin/issues/SPARK-43759/timestamp_ntz. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 1f04bb6320dd11a81a43f2623670c8dea10d72ae) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 24 May 2023, 04:59:20 UTC
1863e47 [SPARK-43758][BUILD][FOLLOWUP][3.4] Update Hadoop 2 dependency manifest ### What changes were proposed in this pull request? This Is a follow-up of https://github.com/apache/spark/pull/41285. ### Why are the changes needed? When merging to branch-3.4, `hadoop-2` dependency manifest is missed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41291 from dongjoon-hyun/SPARK-43758. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 May 2023, 04:44:45 UTC
7bf0b7a [SPARK-43758][BUILD] Upgrade snappy-java to 1.1.10.0 This PR upgrades `snappy-java` version to 1.1.10.0 from 1.1.9.1. The new `snappy-java` version fixes a potential issue for Graviton support when used with old GLIBC versions. See https://github.com/xerial/snappy-java/issues/417. No Existing tests. Closes #41285 from sunchao/snappy-java. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1e17c86f77893b02a1b304800cd17f814f979dd2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 May 2023, 02:51:26 UTC
a855e33 [MINOR][PS][TESTS] Fix `SeriesDateTimeTests.test_quarter` to work properly ### What changes were proposed in this pull request? This PR proposes to fix `SeriesDateTimeTests.test_quarter` to work properly. ### Why are the changes needed? Test has not been properly testing ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested, and the existing CI should pass Closes #41274 from itholic/minor_quarter_test. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5c53384def22b01b8ef28bee6f2d10648bce1a1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 07:04:04 UTC
05039f3 [SPARK-43719][WEBUI] Handle `missing row.excludedInStages` field ### What changes were proposed in this pull request? This PR aims to handle a corner case when `row.excludedInStages` field is missing. ### Why are the changes needed? To fix the following type error when Spark loads some very old 2.4.x or 3.0.x logs. ![missing](https://github.com/apache/spark/assets/9700541/f402df10-bf92-4c9f-8bd1-ec9b98a67966) We have two places and this PR protects both places. ``` $ git grep row.excludedInStages core/src/main/resources/org/apache/spark/ui/static/executorspage.js: if (typeof row.excludedInStages === "undefined" || row.excludedInStages.length == 0) { core/src/main/resources/org/apache/spark/ui/static/executorspage.js: return "Active (Excluded in Stages: [" + row.excludedInStages.join(", ") + "])"; ``` ### Does this PR introduce _any_ user-facing change? No, this will remove the error case only. ### How was this patch tested? Manual review. Closes #41266 from dongjoon-hyun/SPARK-43719. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit eeab2e701330f7bc24e9b09ce48925c2c3265aa8) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 05:17:22 UTC
4cae2ad [SPARK-43718][SQL] Set nullable correctly for keys in USING joins ### What changes were proposed in this pull request? In `Anaylzer#commonNaturalJoinProcessing`, set nullable correctly when adding the join keys to the Project's hidden_output tag. ### Why are the changes needed? The value of the hidden_output tag will be used for resolution of attributes in parent operators, so incorrect nullabilty can cause problems. For example, assume this data: ``` create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); ``` The following query produces incorrect results: ``` spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> ``` Similar issues occur with right outer join and left outer join. `t1.c1` and `t2.c1` have the wrong nullability at the time the array is resolved, so the array's `containsNull` value is incorrect. `UpdateNullability` will update the nullability of `t1.c1` and `t2.c1` in the `CreateArray` arguments, but will not update `containsNull` in the function's data type. Queries that don't use arrays also can get wrong results. Assume this data: ``` create or replace temp view t1 as values (0), (1), (2) as (c1); create or replace temp view t2 as values (1), (2), (3) as (c1); create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); ``` The following query produces incorrect results: ``` select t1.c1 as t1_c1, t2.c1 as t2_c1, b from t1 full outer join t2 using (c1), lateral ( select b from t3 where a = coalesce(t2.c1, 1) ) lt3; 1 1 2 NULL 3 4 Time taken: 2.395 seconds, Fetched 2 row(s) spark-sql (default)> ``` The result should be the following: ``` 0 NULL 2 1 1 2 NULL 3 4 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #41267 from bersprockets/using_anomaly. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 217b30a4f7ca18ade19a9552c2b87dd4caeabe57) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 04:47:17 UTC
1db2f5c [SPARK-43589][SQL] Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString` This PR aims to fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString` instead of shift operations. To avoid user confusion by giving more accurate values. For example, `maxBroadcastTableBytes` is 1GB and `dataSize` is `2GB - 1 byte`. **BEFORE** ``` Cannot broadcast the table that is larger than 1GB: 1 GB. ``` **AFTER** ``` Cannot broadcast the table that is larger than 1024.0 MiB: 2048.0 MiB. ``` Yes, but only error message. Pass the CIs with newly added test case. Closes #41232 from dongjoon-hyun/SPARK-43589. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 04759570395c99bc17961742733d06e19a917abd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2023, 09:21:37 UTC
4448c76 [SPARK-43587][CORE][TESTS] Run `HealthTrackerIntegrationSuite` in a dedicated JVM ### What changes were proposed in this pull request? This PR aims to run `HealthTrackerIntegrationSuite` in a dedicated JVM to mitigate a flaky tests. ### Why are the changes needed? `HealthTrackerIntegrationSuite` has been flaky and SPARK-25400 and SPARK-37384 increased the timeout `from 1s to 10s` and `10s to 20s`, respectively. The usual suspect of this flakiness is some unknown side-effect like GCs. In this PR, we aims to run this in a separate JVM instead of increasing the timeout more. https://github.com/apache/spark/blob/abc140263303c409f8d4b9632645c5c6cbc11d20/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala#L56-L58 This is the recent failure. - https://github.com/apache/spark/actions/runs/5020505360/jobs/9002039817 ``` [info] HealthTrackerIntegrationSuite: [info] - If preferred node is bad, without excludeOnFailure job will fail (92 milliseconds) [info] - With default settings, job can succeed despite multiple bad executors on node (3 seconds, 163 milliseconds) [info] - Bad node with multiple executors, job will still succeed with the right confs *** FAILED *** (20 seconds, 43 milliseconds) [info] java.util.concurrent.TimeoutException: Futures timed out after [20 seconds] [info] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) [info] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187) [info] at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:355) [info] at org.apache.spark.scheduler.SchedulerIntegrationSuite.awaitJobTermination(SchedulerIntegrationSuite.scala:276) [info] at org.apache.spark.scheduler.HealthTrackerIntegrationSuite.$anonfun$new$9(HealthTrackerIntegrationSuite.scala:92) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41229 from dongjoon-hyun/SPARK-43587. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit eb2456bce2522779bf6b866a5fbb728472d35097) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2023, 06:31:29 UTC
da8f5a6 [SPARK-43541][SQL][3.4] Propagate all `Project` tags in resolving of expressions and missing columns ### What changes were proposed in this pull request? In the PR, I propose to propagate all tags in a `Project` while resolving of expressions and missing columns in `ColumnResolutionHelper.resolveExprsAndAddMissingAttrs()`. This is a backport of https://github.com/apache/spark/pull/41204. ### Why are the changes needed? To fix the bug reproduced by the query below: ```sql spark-sql (default)> WITH > t1 AS (select key from values ('a') t(key)), > t2 AS (select key from values ('a') t(key)) > SELECT t1.key > FROM t1 FULL OUTER JOIN t2 USING (key) > WHERE t1.key NOT LIKE 'bb.%'; [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `t1`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4 pos 7; ``` ### Does this PR introduce _any_ user-facing change? No. It fixes a bug, and outputs the expected result: `a`. ### How was this patch tested? By new test added to `using-join.sql`: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z using-join.sql" ``` and the related test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.HiveContextCompatibilitySuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 09d5742a8679839d0846f50e708df98663a6d64c) Closes #41220 from MaxGekk/fix-using-join-3.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 May 2023, 22:10:39 UTC
079594a Revert "[SPARK-43313][SQL] Adding missing column DEFAULT values for MERGE INSERT actions" This reverts commit 3a0e6bde2aaa11e1165f4fde040ff02e1743795e. 18 May 2023, 10:39:38 UTC
5937a97 [SPARK-43450][SQL][TESTS] Add more `_metadata` filter test cases Add additional tests for metadata filters following up after f0a901058aaec8496ec9426c5811d9dea66c195d To cover more edge cases in testing No - Closes #39608 from olaky/additional-tests-for-metadata-filtering. Authored-by: Ole Sasse <ole.sasse@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit bc801c1777994c5cb14ce21533c4053281df5aa8) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 May 2023, 07:09:35 UTC
a18d71a [SPARK-43522][SQL] Fix creating struct column name with index of array ### What changes were proposed in this pull request? When creating a struct column in Dataframe, the code that ran without problems in version 3.3.1 does not work in version 3.4.0. In 3.3.1 ```scala val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, ",")).withColumn("map_entry", transform(col("key_value"), x => struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )) testDF.show() +-----------+---------------+--------------------+ | value| key_value| map_entry| +-----------+---------------+--------------------+ |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| +-----------+---------------+--------------------+ ``` In 3.4.0 ``` org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot resolve "struct(split(namedlambdavariable(), =, -1)[0], split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only foldable `STRING` expressions are allowed to appear at odd position, but they are ["0", "1"].; 'Project [value#41, key_value#45, transform(key_value#45, lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48] +- Project [value#41, split(value#41, ,, -1) AS key_value#45] +- LocalRelation [value#41] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) .... ``` The reason is `CreateNamedStruct` will use last expr of value `Expression` as column name. And will check it must are `String`. But array `Expression`'s last expr are `Integer`. The check will failed. So we can skip match with `UnresolvedExtractValue` when last expr not `String`. Then it will when fall back to the default name. ### Why are the changes needed? Fix the bug when creating struct column name with index of array ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new test Closes #41187 from Hisoka-X/SPARK-43522_struct_name_array. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f2a29176de6e0b1628de6ca962cbf5036b145e0a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 May 2023, 06:29:46 UTC
5cc4c5d [SPARK-43157][SQL] Clone InMemoryRelation cached plan to prevent cloned plan from referencing same objects ### What changes were proposed in this pull request? This is the most narrow fix for the issue observed in SPARK-43157. It does not attempt to identify or solve all potential correctness and concurrency issues from TreeNode.tags being modified in multiple places. It solves the issue described in SPARK-43157 by cloning the cached plan when populating `InMemoryRelation.innerChildren`. I chose to do the clone at this point to limit the scope to tree traversal used for building up the string representation of the plan, which is where we see the issue. I do not see any other uses for `TreeNode.innerChildren`. I did not clone any earlier because the caching objects have mutable state that I wanted to avoid touching to be extra safe. Another solution I tried was to modify `InMemoryRelation.clone` to create a new `CachedRDDBuilder` and pass in a cloned `cachedPlan`. I opted not to go with this approach because `CachedRDDBuilder` has mutable state that needs to be moved to the new object and I didn't want to add that complexity if not needed. ### Why are the changes needed? When caching is used the cached part of the SparkPlan is leaked to new clones of the plan. This leakage is an issue because if the TreeNode.tags are modified in one plan, it impacts the other plan. This is a correctness issue and a concurrency issue if the TreeNode.tags are set in different threads for the cloned plans. See the description of [SPARK-43157](https://issues.apache.org/jira/browse/SPARK-43157) for an example of the concurrency issue. ### Does this PR introduce _any_ user-facing change? Yes. It fixes a driver hanging issue the user can observe. ### How was this patch tested? Unit test added and I manually verified `Dataset.explain("formatted")` still had the expected output. ```scala spark.range(10).cache.filter($"id" > 5).explain("formatted") == Physical Plan == * Filter (4) +- InMemoryTableScan (1) +- InMemoryRelation (2) +- * Range (3) (1) InMemoryTableScan Output [1]: [id#0L] Arguments: [id#0L], [(id#0L > 5)] (2) InMemoryRelation Arguments: [id#0L], CachedRDDBuilder(org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer418b946b,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) Range (0, 10, step=1, splits=16) ,None), [id#0L ASC NULLS FIRST] (3) Range [codegen id : 1] Output [1]: [id#0L] Arguments: Range (0, 10, step=1, splits=Some(16)) (4) Filter [codegen id : 1] Input [1]: [id#0L] Condition : (id#0L > 5) ``` I also verified that the `InMemory.innerChildren` is cloned when the entire plan is cloned. ```scala import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.columnar.InMemoryTableScanExec import spark.implicits._ def findCacheOperator(plan: SparkPlan): Option[InMemoryTableScanExec] = { if (plan.isInstanceOf[InMemoryTableScanExec]) { Some(plan.asInstanceOf[InMemoryTableScanExec]) } else if (plan.children.isEmpty && plan.subqueries.isEmpty) { None } else { (plan.subqueries.flatMap(p => findCacheOperator(p)) ++ plan.children.flatMap(findCacheOperator)).headOption } } val df = spark.range(10).filter($"id" < 100).cache() val df1 = df.limit(1) val df2 = df.limit(1) // Get the cache operator (InMemoryTableScanExec) in each plan val plan1 = findCacheOperator(df1.queryExecution.executedPlan).get val plan2 = findCacheOperator(df2.queryExecution.executedPlan).get // Check if InMemoryTableScanExec references point to the same object println(plan1.eq(plan2)) // returns false// Check if InMemoryRelation references point to the same object println(plan1.relation.eq(plan2.relation)) // returns false // Check if the cached SparkPlan references point to the same object println(plan1.relation.innerChildren.head.eq(plan2.relation.innerChildren.head)) // returns false // This shows the issue is fixed ``` Closes #40812 from robreeves/roreeves/explain_util. Authored-by: Rob Reeves <roreeves@linkedin.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5e5999538899732bf3cdd04b974f1abeb949ccd0) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 May 2023, 06:25:20 UTC
b0f5a7a [SPARK-43547][3.4][PS][DOCS] Update "Supported Pandas API" page to point out the proper pandas docs ### What changes were proposed in this pull request? This PR proposes to fix [Supported pandas API](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html#supported-pandas-api) page to point out the proper pandas version. ### Why are the changes needed? Currently we're supporting pandas 1.5, but it says we support latest pandas which is 2.0.1. ### Does this PR introduce _any_ user-facing change? No, it's document change ### How was this patch tested? The existing CI should pass. Closes #41208 from itholic/update_supported_api. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 May 2023, 03:04:51 UTC
f89e520 [SPARK-42826][3.4][FOLLOWUP][PS][DOCS] Update migration notes for pandas API on Spark ### What changes were proposed in this pull request? This is follow-up for https://github.com/apache/spark/pull/40459 to fix the incorrect information and to elaborate more detailed changes. - We're not fully support the pandas 2.0.0, so the information "Pandas API on Spark follows for the pandas 2.0" is not correct. - We should list all the APIs that no longer support `inplace` parameter. ### Why are the changes needed? Correctness for migration notes. ### Does this PR introduce _any_ user-facing change? No, only updating migration notes. ### How was this patch tested? The existing CI should pass Closes #41207 from itholic/migration_guide_followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 May 2023, 03:02:26 UTC
7b148f0 [SPARK-43527][PYTHON] Fix `catalog.listCatalogs` in PySpark ### What changes were proposed in this pull request? Fix `catalog.listCatalogs` in PySpark ### Why are the changes needed? existing implementation outputs incorrect results ### Does this PR introduce _any_ user-facing change? yes before this PR: ``` In [1]: spark.catalog.listCatalogs() Out[1]: [CatalogMetadata(name=<py4j.java_gateway.JavaMember object at 0x1031f08b0>, description=<py4j.java_gateway.JavaMember object at 0x1049ac2e0>)] ``` after this PR: ``` In [1]: spark.catalog.listCatalogs() Out[1]: [CatalogMetadata(name='spark_catalog', description=None)] ``` ### How was this patch tested? added doctest Closes #41186 from zhengruifeng/py_list_catalog. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a232083f50ddfdc81f2027fd3ffa89dfaa3ba199) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 May 2023, 23:31:36 UTC
f68ece9 [SPARK-43043][CORE] Improve the performance of MapOutputTracker.updateMapOutput ### What changes were proposed in this pull request? The PR changes the implementation of MapOutputTracker.updateMapOutput() to search for the MapStatus under the help of a mapping from mapId to mapIndex, previously it was performing a linear search, which would become performance bottleneck if a large proportion of all blocks in the map are migrated. ### Why are the changes needed? To avoid performance bottleneck when block decommission is enabled and a lot of blocks are migrated within a short time window. ### Does this PR introduce _any_ user-facing change? No, it's pure performance improvement. ### How was this patch tested? Manually test. Closes #40690 from jiangxb1987/SPARK-43043. Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Co-authored-by: Jiang Xingbo <jiangxb1987@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> (cherry picked from commit 66a2eb8f8957c22c69519b39be59beaaf931822b) Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> 16 May 2023, 18:34:57 UTC
792680e [SPARK-43517][PYTHON][DOCS] Add a migration guide for namedtuple monkey patch ### What changes were proposed in this pull request? This PR proposes to add a migration guide for https://github.com/apache/spark/pull/38700. ### Why are the changes needed? To guide users about the workaround of bringing the namedtuple patch back. ### Does this PR introduce _any_ user-facing change? Yes, it adds the migration guides for end-users. ### How was this patch tested? CI in this PR will test it out. Closes #41177 from HyukjinKwon/update-migration-namedtuple. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 23c072d2a0ef046f45893d9a13f5788e6ec09ea5) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 May 2023, 02:16:36 UTC
ee12c81 [SPARK-43281][SQL] Fix concurrent writer does not update file metrics ### What changes were proposed in this pull request? `DynamicPartitionDataConcurrentWriter` it uses temp file path to get file status after commit task. However, the temp file has already moved to new path during commit task. This pr calls `closeFile` before commit task. ### Why are the changes needed? fix bug ### Does this PR introduce _any_ user-facing change? yes, after this pr the metrics is correct ### How was this patch tested? add test Closes #40952 from ulysses-you/SPARK-43281. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 592e92262246a6345096655270e2ca114934d0eb) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 May 2023, 01:42:37 UTC
cacaed8 [SPARK-43483][SQL][DOCS] Adds SQL references for OFFSET clause ### What changes were proposed in this pull request? Spark 3.4.0 released the new syntax: `OFFSET clause`. But the SQL reference missing the description for it. ### Why are the changes needed? Adds SQL reference for `OFFSET` clause. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could find out the SQL reference for `OFFSET` clause. ### How was this patch tested? Manual verify. ![image](https://github.com/apache/spark/assets/8486025/55398194-5193-45eb-ac04-10f5f0793f7f) ![image](https://github.com/apache/spark/assets/8486025/fef0abc1-7dfa-44e2-b2e0-a56fa82a0817) ![image](https://github.com/apache/spark/assets/8486025/5ab9dc39-6812-45b4-a758-85668ab040f1) ![image](https://github.com/apache/spark/assets/8486025/b726abd4-daae-4de4-a78e-45120573e699) Closes #41151 from beliefer/SPARK-43483. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 15 May 2023, 13:06:57 UTC
6e44365 [SPARK-43471][CORE] Handle missing hadoopProperties and metricsProperties ### What changes were proposed in this pull request? This PR aims to handle a corner case where `hadoopProperties` and `metricsProperties` is null which means not loaded. ### Why are the changes needed? To prevent NPE. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test suite. Closes #41145 from TQJADE/SPARK-43471. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Qi Tan <qi_tan@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1dba7b803ecffd09c544009b79a6a3219f56d4e0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 May 2023, 22:30:23 UTC
8c4ff5b [SPARK-43441][CORE] `makeDotNode` should not fail when DeterministicLevel is absent ### What changes were proposed in this pull request? This PR aims to make `makeDotNode` method handle the missing `DeterministicLevel`. ### Why are the changes needed? Some old Spark generated data do not have that field. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly added test case. Closes #41124 from TQJADE/SPARK-43441. Lead-authored-by: Qi Tan <qi_tan@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cfd42355ccbdf34ca9bd5a125c4310b9da14100c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 May 2023, 04:20:04 UTC
95af6f6 [SPARK-43425][SQL][3.4] Add `TimestampNTZType` to `ColumnarBatchRow` ### What changes were proposed in this pull request? Noticed this one was missing when implementing `TimestampNTZType` in Iceberg. ### Why are the changes needed? Able to read `TimestampNTZType` using the `ColumnarBatchRow`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #41103 from Fokko/fd-add-timestampntz. Authored-by: Fokko Driesprong <fokkotabular.io> Closes #41112 from Fokko/fd-backport. Authored-by: Fokko Driesprong <fokko@tabular.io> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 May 2023, 00:44:39 UTC
10fd9b4 [SPARK-43414][TESTS] Fix flakiness in Kafka RDD suites due to port binding configuration issue ### What changes were proposed in this pull request? This PR addresses a test flakiness issue in Kafka connector RDD suites https://github.com/apache/spark/pull/34089#pullrequestreview-872739182 (Spark 3.4.0) upgraded Spark to Kafka 3.1.0, which requires a different configuration key for configuring the broker listening port. That PR updated the `KafkaTestUtils.scala` used in SQL tests, but there's a near-duplicate of that code in a different `KafkaTestUtils.scala` used by RDD API suites which wasn't updated. As a result, the RDD suites began using Kafka's default port 9092 and this results in flakiness as multiple concurrent suites hit port conflicts when trying to bind to that default port. This PR fixes that by simply copying the updated configuration from the SQL copy of `KafkaTestUtils.scala`. ### Why are the changes needed? Fix test flakiness due to port conflicts. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran 20 concurrent copies of `org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite` in my CI environment and confirmed that this PR's changes resolve the test flakiness. Closes #41095 from JoshRosen/update-kafka-test-utils-to-fix-port-binding-flakiness. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 175fcfd2106652baff53fdfb1be84638a4f6a9c9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 May 2023, 23:06:22 UTC
6662441 [SPARK-43342][K8S] Revert SPARK-39006 Show a directional error message for executor PVC dynamic allocation failure ### What changes were proposed in this pull request? This reverts commit b065c945fe27dd5869b39bfeaad8e2b23a8835b5. ### Why are the changes needed? To remove the regression from SPARK-39006. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41057 Closes #41069 from dongjoon-hyun/SPARK-43342. Authored-by: Qian.Sun <qian.sun2020@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3ba1fa3678a4fcc0aaba8abb0d4312e8fb42efba) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 May 2023, 01:26:56 UTC
cd2a6f3 [SPARK-43395][BUILD] Exclude macOS tar extended metadata in make-distribution.sh ### What changes were proposed in this pull request? Add args `--no-mac-metadata --no-xattrs --no-fflags` to `tar` on macOS in `dev/make-distribution.sh` to exclude macOS-specific extended metadata. ### Why are the changes needed? The binary tarball created on macOS includes extended macOS-specific metadata and xattrs, which causes warnings when unarchiving it on Linux. Step to reproduce 1. create tarball on macOS (13.3.1) ``` ➜ apache-spark git:(master) tar --version bsdtar 3.5.3 - libarchive 3.5.3 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.8 ``` ``` ➜ apache-spark git:(master) dev/make-distribution.sh --tgz ``` 2. unarchive the binary tarball on Linux (CentOS-7) ``` ➜ ~ tar --version tar (GNU tar) 1.26 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by John Gilmore and Jay Fenlason. ``` ``` ➜ ~ tar -xzf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz tar: Ignoring unknown extended header keyword `SCHILY.fflags' tar: Ignoring unknown extended header keyword `LIBARCHIVE.xattr.com.apple.FinderInfo' ``` ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Create binary tarball on macOS then unarchive on Linux, warnings disappear after this change. Closes #41074 from pan3793/SPARK-43395. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 2d0240df3c474902e263f67b93fb497ca13da00f) Signed-off-by: Sean Owen <srowen@gmail.com> 06 May 2023, 14:38:07 UTC
73437ce [SPARK-43374][INFRA] Move protobuf-java to BSD 3-clause group and update the license copy ### What changes were proposed in this pull request? protobuf-java is licensed under the BSD 3-clause, not the 2 we claimed. And the copy should be updated via https://github.com/protocolbuffers/protobuf/commit/9e080f7ac007b75dacbd233b214e5c0cb2e48e0f ### Why are the changes needed? fix license issue ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Existing tests. Closes #41043 from yaooqinn/SPARK-43374. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4351d414ab03643fdb6ec18b8e8a4ac718a9a2b2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 05 May 2023, 23:07:13 UTC
0cc21ba [SPARK-43340][CORE] Handle missing stack-trace field in eventlogs This PR fixes a regression introduced by #36885 which broke JsonProtocol's ability to handle missing fields from exception field. old eventlogs missing a `Stack Trace` will raise a NPE. As a result, SHS misinterprets failed-jobs/SQLs as `Active/Incomplete` This PR solves this problem by checking the JsonNode for null. If it is null, an empty array of `StackTraceElements` Fix a case which prevents the history server from identifying failed jobs if the stacktrace was not set. Example eventlog ``` { "Event":"SparkListenerJobEnd", "Job ID":31, "Completion Time":1616171909785, "Job Result":{ "Result":"JobFailed", "Exception":{ "Message":"Job aborted" } } } ``` **Original behavior:** The job is marked as `incomplete` Error from the SHS logs: ``` 23/05/01 21:57:16 INFO FsHistoryProvider: Parsing file:/tmp/nds_q86_fail_test to re-build UI... 23/05/01 21:57:17 ERROR ReplayListenerBus: Exception parsing Spark event log: file:/tmp/nds_q86_fail_test java.lang.NullPointerException at org.apache.spark.util.JsonProtocol$JsonNodeImplicits.extractElements(JsonProtocol.scala:1589) at org.apache.spark.util.JsonProtocol$.stackTraceFromJson(JsonProtocol.scala:1558) at org.apache.spark.util.JsonProtocol$.exceptionFromJson(JsonProtocol.scala:1569) at org.apache.spark.util.JsonProtocol$.jobResultFromJson(JsonProtocol.scala:1423) at org.apache.spark.util.JsonProtocol$.jobEndFromJson(JsonProtocol.scala:967) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:878) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:865) .... 23/05/01 21:57:17 ERROR ReplayListenerBus: Malformed line #24368: {"Event":"SparkListenerJobEnd","Job ID":31,"Completion Time":1616171909785,"Job Result":{"Result":"JobFailed","Exception": {"Message":"Job aborted"} }} ``` **After the fix:** Job 31 is marked as `failedJob` No. Added new unit test in JsonProtocolSuite. Closes #41050 from amahussein/aspark-43340-b. Authored-by: Ahmed Hussein <ahussein@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dcd710d3e12f6cc540cea2b8c747bb6b61254504) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 05 May 2023, 23:01:56 UTC
e04025e [SPARK-43337][UI][3.4] Asc/desc arrow icons for sorting column does not get displayed in the table column ### What changes were proposed in this pull request? Remove css `!important` tag for asc/desc arrow icons in jquery.dataTables.1.10.25.min.css ### Why are the changes needed? Upgrading to DataTables 1.10.25 broke asc/desc arrow icons for sorting column. The sorting icon is not displayed when the column is clicked to sort by asc/desc. This is because the new DataTables 1.10.25's jquery.dataTables.1.10.25.min.css file added `!important` rule preventing the override set in webui-dataTables.css ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ![image](https://user-images.githubusercontent.com/52679095/236394863-e0004e7b-5173-495a-af23-32c1343e0ee6.png) ![image](https://user-images.githubusercontent.com/52679095/236394879-db0e5e0e-f6b3-48c3-9c79-694dd9abcb76.png) Closes #41061 from maytasm/fix-arrow-4. Authored-by: Maytas Monsereenusorn <maytasm@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> 05 May 2023, 15:25:46 UTC
b35e42f [SPARK-43284][SQL][FOLLOWUP] Return URI encoded path, and add a test Return URI encoded path, and add a test Fix regression in spark 3.4. Yes, fixes a regression in `_metadata.file_path`. New explicit test. Closes #41054 from databricks-david-lewis/SPARK-43284-2. Lead-authored-by: David Lewis <david.lewis@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 79d5d908e5dd7d6ab6755c46d4de3ed2fcdf9e6b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 May 2023, 08:57:52 UTC
8c6442f [SPARK-43284] Switch back to url-encoded strings Update `_metadata.file_path` and `_metadata.file_name` to return url-encoded strings, rather than un-encoded strings. This was a regression introduced in Spark 3.4.0. This was an inadvertent behavior change. Yes, fix regression! New test added to validate that the `file_path` and `path_name` are returned as encoded strings. Closes #40947 from databricks-david-lewis/SPARK-43284. Authored-by: David Lewis <david.lewis@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1a777967ed109b0838793588c330a2e404627fb1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 May 2023, 02:56:44 UTC
b51e860 [SPARK-43378][CORE] Properly close stream objects in deserializeFromChunkedBuffer ### What changes were proposed in this pull request? Fixes a that SerializerHelper.deserializeFromChunkedBuffer does not call close on the deserialization stream. For some serializers like Kryo this creates a performance regressions as the kryo instances are not returned to the pool. ### Why are the changes needed? This causes a performance regression in Spark 3.4.0 for some workloads. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #41049 from eejbyfeldt/SPARK-43378. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit cb26ad88c522070c66e979ab1ab0f040cd1bdbe7) Signed-off-by: Sean Owen <srowen@gmail.com> 05 May 2023, 00:34:22 UTC
09fb739 [SPARK-43313][SQL] Adding missing column DEFAULT values for MERGE INSERT actions ### What changes were proposed in this pull request? This PR updates column DEFAULT assignment to add missing values for MERGE INSERT actions. This brings the behavior to parity with non-MERGE INSERT commands. * It also adds a small convenience feature where if the provided default value is a literal of a wider type than the target column, but the literal value fits within the narrower type, just coerce it for convenience. For example, `CREATE TABLE t (col INT DEFAULT 42L)` returns an error before this PR because `42L` has a long integer type which is wider than `col`, but after this PR we just coerce it to `42` since the value fits within the short integer range. * We also add the `SupportsCustomSchemaWrite` interface which tables may implement to exclude certain pseudocolumns from consideration when resolving column DEFAULT values. ### Why are the changes needed? These changes make column DEFAULT values more usable in more types of situations. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds new unit test coverage. Closes #40996 from dtenedor/merge-actions. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 3a0e6bde2aaa11e1165f4fde040ff02e1743795e) Signed-off-by: Gengliang Wang <gengliang@apache.org> 04 May 2023, 20:18:27 UTC
f4e53fa [SPARK-43336][SQL] Casting between Timestamp and TimestampNTZ requires timezone ### What changes were proposed in this pull request? Casting between Timestamp and TimestampNTZ requires a timezone since the timezone id is used in the evaluation. This PR updates the method `Cast.needsTimeZone` to include the conversion between Timestamp and TimestampNTZ. As a result: * Casting between Timestamp and TimestampNTZ is considered as unresolved unless the timezone is defined * Canonicalizing cast will include the time zone ### Why are the changes needed? Timezone id is used in the evaluation between Timestamp and TimestampNTZ, thus we should mark such conversion as "needsTimeZone" ### Does this PR introduce _any_ user-facing change? No. It is more like a fix for potential bugs that the casting between Timestamp and TimestampNTZ is marked as resolved before resolving the timezone. ### How was this patch tested? New UT Closes #41010 from gengliangwang/fixCast. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit a1c7035294ecdc10f507c58a41cb0ed7137717ca) Signed-off-by: Gengliang Wang <gengliang@apache.org> 02 May 2023, 20:09:57 UTC
47c674d [SPARK-43156][SQL][3.4] Fix `COUNT(*) is null` bug in correlated scalar subquery ### What changes were proposed in this pull request? Cherry pick fix COUNT(*) is null bug in correlated scalar subquery cherry pick from #40865 and #40946 ### Why are the changes needed? Fix COUNT(*) is null bug in correlated scalar subquery in branch 3.4 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. Closes #40977 from Hisoka-X/count_bug. Lead-authored-by: Hisoka <fanjiaeminem@qq.com> Co-authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 May 2023, 15:46:46 UTC
3804844 [SPARK-43293][SQL] `__qualified_access_only` should be ignored in normal columns This is a followup of https://github.com/apache/spark/pull/39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes #40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 021f02e02fb88bbbccd810ae000e14e0c854e2e6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2023, 02:53:57 UTC
1c28ec7 [SPARK-43249][CONNECT] Fix missing stats for SQL Command ### What changes were proposed in this pull request? This patch fixes a minor issue in the code where for SQL Commands the plan metrics are not sent to the client. In addition, it renames a method to make clear that the method does not actually send anything but only creates the response object. ### Why are the changes needed? Clarity ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #40899 from grundprinzip/fix_sql_stats. Authored-by: Martin Grund <martin.grund@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 9d050539bed10e5089c3c125887a9995693733c6) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 24 April 2023, 08:33:52 UTC
b70407e [SPARK-43113][SQL][FOLLOWUP] Add comment about copying steam-side variables ### What changes were proposed in this pull request? Add a comment explaining a tricky situation involving the evaluation of stream-side variables. This is a follow-up to #40766. ### Why are the changes needed? Make the code more clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #40881 from bersprockets/SPARK-43113_followup. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6d4ed13c465ae7f35753a1f2b67f78c690881a9e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 April 2023, 02:25:20 UTC
ac9594d [MINOR][CONNECT][PYTHON][DOCS] Fix the doc of parameter `num` in `DataFrame.offset` ### What changes were proposed in this pull request? fix the incorrect doc ### Why are the changes needed? the description of parameter `num` is incorrect, it actually describes the `num` in `DataFrame.{limit, tail}` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? existing UT Closes #40872 from zhengruifeng/fix_doc_offset. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a28d2d7e900e8e2c65953fddbf32708802c4a700) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 April 2023, 01:31:04 UTC
a0804e5 Revert [SPARK-39203][SQL] Rewrite table location to absolute URI based on database URI ### What changes were proposed in this pull request? This reverts https://github.com/apache/spark/pull/36625 and its followup https://github.com/apache/spark/pull/38321 . ### Why are the changes needed? External table location can be arbitrary and has no connection with the database location. It can be wrong to qualify the external table location based on the database location. If a table written by old Spark versions does not have a qualified location, there is no way to restore it as the information is already lost. People can manually fix the table locations assuming they are under the same HDFS cluster with the database location, by themselves. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #40871 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit afd9e2cc0a73069514eef5c5eb7a3ebed8e4b8cf) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 April 2023, 01:28:30 UTC
217f174 [SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](https://github.com/apache/spark/pull/35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 74ce620901a958a1ddd76360e2faed7d3a111d4e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 April 2023, 14:05:21 UTC
b6be707 [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause ### What changes were proposed in this pull request? Fix a correctness bug for scalar subqueries with COUNT and a GROUP BY clause, for example: ``` create view t1(c1, c2) as values (0, 1), (1, 2); create view t2(c1, c2) as values (0, 2), (0, 3); select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from t1; -- Correct answer: [(0, 1, 2), (1, 2, null)] +---+---+------------------+ |c1 |c2 |scalarsubquery(c1)| +---+---+------------------+ |0 |1 |2 | |1 |2 |0 | +---+---+------------------+ ``` This is due to a bug in our "COUNT bug" handling for scalar subqueries. For a subquery with COUNT aggregate but no GROUP BY clause, 0 is the correct output on empty inputs, and we use the COUNT bug handling to construct the plan that yields 0 when there were no matched rows. But when there is a GROUP BY clause then NULL is the correct output (i.e. there is no COUNT bug), but we still incorrectly use the COUNT bug handling and therefore incorrectly output 0. Instead, we need to only apply the COUNT bug handling when the scalar subquery had no GROUP BY clause. To fix this, we need to track whether the scalar subquery has a GROUP BY, i.e. a non-empty groupingExpressions for the Aggregate node. This need to be checked before subquery decorrelation, because that adds the correlated outer refs to the group-by list so after that the group-by is always non-empty. We save it in a boolean in the ScalarSubquery node until later when we rewrite the subquery into a join in constructLeftJoins. This is a long-standing bug. This bug affected both the current DecorrelateInnerQuery framework and the old code (with spark.sql.optimizer.decorrelateInnerQuery.enabled = false), and this PR fixes both. ### Why are the changes needed? Fix a correctness bug. ### Does this PR introduce _any_ user-facing change? Yes, fix incorrect query results. ### How was this patch tested? Add SQL tests and unit tests. (Note that there were 2 existing unit tests for queries of this shape, which had the incorrect results as golden results.) Closes #40811 from jchen5/count-bug. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c1a02e7e304934a7e671c0ae2f5a1fcedd6806c0) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 April 2023, 01:42:26 UTC
55e152a [SPARK-43113][SQL] Evaluate stream-side variables when generating code for a bound condition ### What changes were proposed in this pull request? In `JoinCodegenSupport#getJoinCondition`, evaluate any referenced stream-side variables before using them in the generated code. This patch doesn't evaluate the passed stream-side variables directly, but instead evaluates a copy (`streamVars2`). This is because `SortMergeJoin#codegenFullOuter` will want to evaluate the stream-side vars within a different scope than the condition check, so we mustn't delete the initialization code from the original `ExprCode` instances. ### Why are the changes needed? When a bound condition of a full outer join references the same stream-side column more than once, wholestage codegen generates bad code. For example, the following query fails with a compilation error: ``` create or replace temp view v1 as select * from values (1, 1), (2, 2), (3, 1) as v1(key, value); create or replace temp view v2 as select * from values (1, 22, 22), (3, -1, -1), (7, null, null) as v2(a, b, c); select * from v1 full outer join v2 on key = a and value > b and value > c; ``` The error is: ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 277, Column 9: Redefinition of local variable "smj_isNull_7" ``` The same error occurs with code generated from ShuffleHashJoinExec: ``` select /*+ SHUFFLE_HASH(v2) */ * from v1 full outer join v2 on key = a and value > b and value > c; ``` In this case, the error is: ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 174, Column 5: Redefinition of local variable "shj_value_1" ``` Neither `SortMergeJoin#codegenFullOuter` nor `ShuffledHashJoinExec#doProduce` evaluate the stream-side variables before calling `consumeFullOuterJoinRow#getJoinCondition`. As a result, `getJoinCondition` generates definition/initialization code for each referenced stream-side variable at the point of use. If a stream-side variable is used more than once in the bound condition, the definition/initialization code is generated more than once, resulting in the "Redefinition of local variable" error. In the end, the query succeeds, since Spark disables wholestage codegen and tries again. (In the case other join-type/strategy pairs, either the implementations don't call `JoinCodegenSupport#getJoinCondition`, or the stream-side variables are pre-evaluated before the call is made, so no error happens in those cases). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes #40766 from bersprockets/full_join_codegen_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 119ec5b2ea86b73afaeabcb1d52136029326cac7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 April 2023, 04:09:49 UTC
bbb21a3 [SPARK-42078][PYTHON][FOLLOWUP] Add `CapturedException` to utils ### What changes were proposed in this pull request? This is follow-up for https://github.com/apache/spark/pull/39591 based on the suggestions from comment https://github.com/apache/spark/commit/1de83500f4621d4af91f25a339dd5057c59cfc1e#r109023188. ### Why are the changes needed? For backward compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing CI should pass Closes #40816 from itholic/42078-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 045721967f31a39e9227a951774d7250a6be14dc) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 April 2023, 06:09:47 UTC
d906f73 [SPARK-43141][BUILD] Ignore generated Java files in checkstyle This PR excludes Java files in `core/target` when running checkstyle with SBT. Files such as .../spark/core/target/scala-2.12/src_managed/main/org/apache/spark/status/protobuf/StoreTypes.java are checked in checkstyle. We shouldn't check them in the linter. No, dev-only. Manually ran: ```bash ./dev/sbt-checkstyle ``` Closes #40792 from HyukjinKwon/SPARK-43141. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ca2ddf3c2079dda93053e64070ebda1610aa1968) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 April 2023, 02:57:20 UTC
bf45fad Revert "[SPARK-42475][DOCS][FOLLOW-UP] Fix the version string with dev0 to work in Binder integration" This reverts commit 70e86bafe58a0d0dac418867534e5debd01f80fa. 17 April 2023, 02:50:06 UTC
70e86ba [SPARK-42475][DOCS][FOLLOW-UP] Fix the version string with dev0 to work in Binder integration ### What changes were proposed in this pull request? This PR fixes Binder integration version strings in case `dev0` is specified. It should work in master branch too (when users manually build the docs and test) ### Why are the changes needed? For end users to run quickstarts. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing quick start. ### How was this patch tested? Manually tested at https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-42475-followup-2?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb Closes #40815 from HyukjinKwon/SPARK-42475-followup-2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c408360615928535cdca905feccf09caa2c27bd2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 April 2023, 02:39:48 UTC
23a10cb [SPARK-43158][DOCS] Set upperbound of pandas version for Binder integration ### What changes were proposed in this pull request? This PR proposes to set the upperbound for pandas in Binder integration. We don't currently support pandas 2.0.0 properly, see also https://issues.apache.org/jira/browse/SPARK-42618 ### Why are the changes needed? To make the quickstarts working. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the quickstart. ### How was this patch tested? Tested in: - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Closes #40814 from HyukjinKwon/set-lower-bound-pandas. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a2592e6d46caf82c642f4a01a3fd5c282bbe174e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 April 2023, 01:12:02 UTC
d77214e [SPARK-42475][DOCS][FOLLOW-UP] Fix PySpark connect Quickstart binder link ### What changes were proposed in this pull request? This PR fixes PySpark Connect quick start working (https://mybinder.org/v2/gh/apache/spark/87a5442f7e?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb) ### Why are the changes needed? For end users to try them out easily. Currently it fails. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the quickstart. ### How was this patch tested? Manually tested at https://mybinder.org/v2/gh/HyukjinKwon/spark/quickstart-connect-working?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb Closes #40813 from HyukjinKwon/quickstart-connect-working. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 701273229285a96f3a5057bc5cc6e0ee819c168a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 April 2023, 00:49:50 UTC
180c7bb [SPARK-43139][SQL][DOCS] Fix incorrect column names in sql-ref-syntax-dml-insert-table.md ### What changes were proposed in this pull request? This PR fixes incorrect column names in [sql-ref-syntax-dml-insert-table.md](https://spark.apache.org/docs/3.4.0/sql-ref-syntax-dml-insert-table.html). ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. Closes #40807 from wangyum/SPARK-43139. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 4db8099ac543cc659a19c231b442d891dcc06a1d) Signed-off-by: Yuming Wang <yumwang@ebay.com> 16 April 2023, 06:58:55 UTC
b89b927 [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions ### What changes were proposed in this pull request? This PR fixes construct aggregate expressions by replacing grouping functions if a expression is part of aggregation. In the following example, the second `b` should also be replaced: <img width="545" alt="image" src="https://user-images.githubusercontent.com/5399861/230415618-84cd6334-690e-4b0b-867b-ccc4056226a8.png"> ### Why are the changes needed? Fix bug: ``` spark-sql (default)> SELECT CASE WHEN a IS NULL THEN count(b) WHEN b IS NULL THEN count(c) END > FROM grouping > GROUP BY GROUPING SETS (a, b, c); [MISSING_AGGREGATION] The non-aggregating expression "b" is based on columns which are not participating in the GROUP BY clause. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #40685 from wangyum/SPARK-43050. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 45b84cd37add1b9ce274273ad5e519e6bc1d8013) Signed-off-by: Yuming Wang <yumwang@ebay.com> # Conflicts: # sql/core/src/test/resources/sql-tests/analyzer-results/grouping_set.sql.out 15 April 2023, 01:54:19 UTC
97b1463 [SPARK-43125][CONNECT] Fix Connect Server Can't Handle Exception With Null Message ### What changes were proposed in this pull request? Fix the bug when Connect Server throw Exception without message. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unnecessary Closes #40780 from Hisoka-X/SPARK-43125_Exception_NPE. Authored-by: Hisoka <fanjiaeminem@qq.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ea49637a1136759c724d40133a2efe75c8f42416) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 April 2023, 03:46:08 UTC
1a0c6e2 [SPARK-43126][SQL] Mark two Hive UDF expressions as stateful ### What changes were proposed in this pull request? Due to the recent refactor, I realized that the two Hive UDF expressions are stateful as they both keep an array to store the input arguments. This PR fix it. ### Why are the changes needed? to avoid issues in a muti-thread environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Too hard to write unit tests and the fix itself is very obvious. Closes #40781 from cloud-fan/hive. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit e117f624d51b4bdd85921281814603a3a5a506a9) Signed-off-by: Chao Sun <sunchao@apple.com> 14 April 2023, 00:47:00 UTC
9235fb4 [SPARK-43085][SQL] Support column DEFAULT assignment for multi-part table names ### What changes were proposed in this pull request? This PR adds support for column DEFAULT assignment for multi-part table names. ### Why are the changes needed? Spark SQL workloads that refer to tables with multi-part names may want to use column DEFAULT functionality. This PR enables this. ### Does this PR introduce _any_ user-facing change? Yes, column DEFAULT assignment now works with multi-part table names. ### How was this patch tested? This PR adds unit tests. Closes #40732 from dtenedor/fix-three-part-name-bug. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit fa4978a66014889dc9894e122ad712dc1e2c4328) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 April 2023, 21:58:19 UTC
76da13b [SPARK-43083][SQL][TESTS] Mark `*StateStoreSuite` as `ExtendedSQLTest` ### What changes were proposed in this pull request? This PR aims to mark `StateStoreSuite` and `RocksDBStateStoreSuite` as `ExtendedSQLTest`. ### Why are the changes needed? To balance GitHub Action jobs by offloading heavy tests. - `sql - other tests` took [2 hour 55 minutes](https://github.com/apache/spark/actions/runs/4641961434/jobs/8215437737) - `sql - slow tests` took [1 hour 46 minutes](https://github.com/apache/spark/actions/runs/4641961434/jobs/8215437616) ``` - maintenance (2 seconds, 4 milliseconds) - SPARK-40492: maintenance before unload (2 minutes) - snapshotting (1 second, 96 milliseconds) - SPARK-21145: Restarted queries create new provider instances (1 second, 261 milliseconds) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #40727 from dongjoon-hyun/SPARK-43083. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 086e25bbc049e6e2def81845b5e4c957607f2e69) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 April 2023, 23:24:03 UTC
ec03e44 [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation ### What changes were proposed in this pull request? This PR extends column default support to allow the ORDER BY, LIMIT, and OFFSET clauses at the end of a SELECT query in the INSERT source relation. For example: ``` create table t1(i boolean, s bigint default 42) using parquet; insert into t1 values (true, 41), (false, default); create table t2(i boolean default true, s bigint default 42, t string default 'abc') using parquet; insert into t2 (i, s) select default, s from t1 order by s limit 1; select * from t2; > true, 41L, "abc" ``` ### Why are the changes needed? This improves usability and helps prevent confusing error messages. ### Does this PR introduce _any_ user-facing change? Yes, SQL queries that previously failed will now succeed. ### How was this patch tested? This PR adds new unit test coverage. Closes #40710 from dtenedor/column-default-more-patterns. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 54e84fead76fb58a65307c84f8539ae37c60563b) Signed-off-by: Gengliang Wang <gengliang@apache.org> 10 April 2023, 23:09:33 UTC
dc44932 [SPARK-43072][DOC] Include TIMESTAMP_NTZ type in ANSI Compliance doc ### What changes were proposed in this pull request? There are important syntax rules about Cast/Store assignment/Type precedent list in the [ANSI Compliance doc](https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html) As we are going to release [timestamp_ntz](https://issues.apache.org/jira/browse/SPARK-35662) type in Spark 3.4.0, we should update the doc page as well. ### Why are the changes needed? Better documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual build and verify <img width="1183" alt="image" src="https://user-images.githubusercontent.com/1097932/230692965-02ec5a6e-8b8a-48dc-8049-9a87d26b2ce5.png"> <img width="1068" alt="image" src="https://user-images.githubusercontent.com/1097932/230692988-bd35508c-0577-44c5-8448-f8d3b0aef2ea.png"> <img width="764" alt="image" src="https://user-images.githubusercontent.com/1097932/230693005-cb61a760-ea11-4e6d-bdcb-2738c7c507c6.png"> Closes #40711 from gengliangwang/ntz_in_ansi_doc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c5db391b5d8447fe1384f7f957e4f200801c6823) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 April 2023, 02:10:11 UTC
49301f3 [MINOR][SQL][TESTS] Tests in `SubquerySuite` should not drop view created in `beforeAll` ### What changes were proposed in this pull request? Change tests to avoid replacing and dropping a temporary view that is created in `SubquerySuite#beforeAll`. ### Why are the changes needed? When I added a test for SPARK-42937, it tried to use the view `t`, which is created in `beforeAll`. But because other tests would replace and drop this view, the new test would fail. As a result, that new test had to re-create `t` from scratch. This change will allow `t` to be used by new tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #40717 from bersprockets/bad_drop_view. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b55bf3cbbe50b8386a974bd98625e4dceb1d457f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 April 2023, 00:42:41 UTC
c7a3cab [SPARK-43075][CONNECT] Change `gRPC` to `grpcio` when it is not installed. ### What changes were proposed in this pull request? Change `gRPC` to `grpcio` This is ONLY in the printing, for users that haven't install `gRPC` ### Why are the changes needed? Users that don't have install `gRPC` will get this error when starting connect. ModuleNotFoundError Traceback (most recent call last) File /opt/spark/python/pyspark/sql/connect/utils.py:45, in require_minimum_grpc_version() 44 try: ---> 45 import grpc 46 except ImportError as error: ModuleNotFoundError: No module named 'grpc' The above exception was the direct cause of the following exception: ImportError Traceback (most recent call last) Cell In[1], line 11 9 import pyarrow 10 from pyspark import SparkConf, SparkContext ---> 11 from pyspark import pandas as ps 12 from pyspark.sql import SparkSession 13 from pyspark.sql.functions import col, concat, concat_ws, expr, lit, trim File /opt/spark/python/pyspark/pandas/__init__.py:59 50 warnings.warn( 51 "'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to " 52 "set this environment variable to '1' in both driver and executor sides if you use " (...) 55 "already launched." 56 ) 57 os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" ---> 59 from pyspark.pandas.frame import DataFrame 60 from pyspark.pandas.indexes.base import Index 61 from pyspark.pandas.indexes.category import CategoricalIndex File /opt/spark/python/pyspark/pandas/frame.py:88 85 from pyspark.sql.window import Window 87 from pyspark import pandas as ps # For running doctests and reference resolution in PyCharm. ---> 88 from pyspark.pandas._typing import ( 89 Axis, 90 DataFrameOrSeries, 91 Dtype, 92 Label, 93 Name, 94 Scalar, 95 T, 96 GenericColumn, 97 ) 98 from pyspark.pandas.accessors import PandasOnSparkFrameMethods 99 from pyspark.pandas.config import option_context, get_option File /opt/spark/python/pyspark/pandas/_typing.py:25 22 from pandas.api.extensions import ExtensionDtype 24 from pyspark.sql.column import Column as PySparkColumn ---> 25 from pyspark.sql.connect.column import Column as ConnectColumn 26 from pyspark.sql.dataframe import DataFrame as PySparkDataFrame 27 from pyspark.sql.connect.dataframe import DataFrame as ConnectDataFrame File /opt/spark/python/pyspark/sql/connect/column.py:19 1 # 2 # Licensed to the Apache Software Foundation (ASF) under one or more 3 # contributor license agreements. See the NOTICE file distributed with (...) 15 # limitations under the License. 16 # 17 from pyspark.sql.connect.utils import check_dependencies ---> 19 check_dependencies(__name__) 21 import datetime 22 import decimal File /opt/spark/python/pyspark/sql/connect/utils.py:35, in check_dependencies(mod_name) 33 require_minimum_pandas_version() 34 require_minimum_pyarrow_version() ---> 35 require_minimum_grpc_version() File /opt/spark/python/pyspark/sql/connect/utils.py:47, in require_minimum_grpc_version() 45 import grpc 46 except ImportError as error: ---> 47 raise ImportError( 48 "grpc >= %s must be installed; however, " "it was not found." % minimum_grpc_version 49 ) from error 50 if LooseVersion(grpc.__version__) < LooseVersion(minimum_grpc_version): 51 raise ImportError( 52 "gRPC >= %s must be installed; however, " 53 "your version was %s." % (minimum_grpc_version, grpc.__version__) 54 ) ImportError: grpc >= 1.48.1 must be installed; however, it was not found. The last line tells that there is a module named `grpc` that's missing. `pip install grpc` Collecting grpc Downloading grpc-1.0.0.tar.gz (5.2 kB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [6 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/tmp/pip-install-vp4d8s4c/grpc_c0f1992ad8f7456b8ac09ecbaeb81750/setup.py", line 33, in <module> raise RuntimeError(HINT) RuntimeError: Please install the official package with: pip install grpcio [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed × Encountered error while generating package metadata. ╰─> See above for output. note: This is an issue with the package mentioned above, not pip. hint: See above for details. Note: you may need to restart the kernel to use updated packages. [The right way to install this is](https://grpc.io/docs/languages/python/quickstart/) `pip install grpcio` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #40716 from bjornjorgensen/grpc-->-grpcio. Authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 07918fe2f15453814cbf1cfacb75166babe5b4b6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 April 2023, 05:51:07 UTC
66b1147 [SPARK-43067][SS] Correct the location of error class resource file in Kafka connector ### What changes were proposed in this pull request? This PR moves the error class resource file in Kafka connector from test to src, so that error class works without test artifacts. ### Why are the changes needed? Refer to the `How was this patch tested?`. ### Does this PR introduce _any_ user-facing change? Yes, but the possibility of encountering this is small enough. ### How was this patch tested? Ran spark-shell with Kafka connector artifacts (without test artifacts) and triggered KafkaExceptions to confirm that exception is properly raised. ``` scala> import org.apache.spark.sql.kafka010.KafkaExceptions import org.apache.spark.sql.kafka010.KafkaExceptions scala> import org.apache.kafka.common.TopicPartition import org.apache.kafka.common.TopicPartition scala> KafkaExceptions.mismatchedTopicPartitionsBetweenEndOffsetAndPrefetched(Set[TopicPartition](), Set[TopicPartition]()) res1: org.apache.spark.SparkException = org.apache.spark.SparkException: Kafka data source in Trigger.AvailableNow should provide the same topic partitions in pre-fetched offset to end offset for each microbatch. The error could be transient - restart your query, and report if you still see the same issue. topic-partitions for pre-fetched offset: Set(), topic-partitions for end offset: Set(). ``` Without the fix, triggering KafkaExceptions failed to load error class resource file and led unexpected exception. ``` scala> KafkaExceptions.mismatchedTopicPartitionsBetweenEndOffsetAndPrefetched(Set[TopicPartition](), Set[TopicPartition]()) java.lang.IllegalArgumentException: argument "src" is null at com.fasterxml.jackson.databind.ObjectMapper._assertNotNull(ObjectMapper.java:4885) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3618) at org.apache.spark.ErrorClassesJsonReader$.org$apache$spark$ErrorClassesJsonReader$$readAsMap(ErrorClassesJSONReader.scala:95) at org.apache.spark.ErrorClassesJsonReader.$anonfun$errorInfoMap$1(ErrorClassesJSONReader.scala:44) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.ErrorClassesJsonReader.<init>(ErrorClassesJSONReader.scala:44) at org.apache.spark.sql.kafka010.KafkaExceptions$.<init>(KafkaExceptions.scala:27) at org.apache.spark.sql.kafka010.KafkaExceptions$.<clinit>(KafkaExceptions.scala) ... 47 elided ``` Closes #40705 from HeartSaVioR/SPARK-43067. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 7434702a120360593a10ef6b3eb9ba0fd7e37de1) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 08 April 2023, 21:38:59 UTC
e137d89 [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin` ### What changes were proposed in this pull request? This PR aims to use `sbt-eclipse` instead of `sbteclipse-plugin`. ### Why are the changes needed? Thanks to SPARK-34959, Apache Spark 3.2+ uses SBT 1.5.0 and we can use `set-eclipse` instead of old `sbteclipse-plugin`. - https://github.com/sbt/sbt-eclipse/releases/tag/6.0.0 ### Does this PR introduce _any_ user-facing change? No, this is a dev-only plugin. ### How was this patch tested? Pass the CIs and manual tests. ``` $ build/sbt eclipse Using /Users/dongjoon/.jenv/versions/1.8 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. Using SPARK_LOCAL_IP=localhost Attempting to fetch sbt Launching sbt from build/sbt-launch-1.8.2.jar [info] welcome to sbt 1.8.2 (AppleJDK-8.0.302.8.1 Java 1.8.0_302) [info] loading settings for project spark-merge-build from plugins.sbt ... [info] loading project definition from /Users/dongjoon/APACHE/spark-merge/project [info] Updating https://repo1.maven.org/maven2/com/github/sbt/sbt-eclipse_2.12_1.0/6.0.0/sbt-eclipse-6.0.0.pom 100.0% [##########] 2.5 KiB (4.5 KiB / s) ... ``` Closes #40708 from dongjoon-hyun/SPARK-43069. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9cba5529d1fc3faf6b743a632df751d84ec86a07) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 April 2023, 19:54:16 UTC
e4eea55 Preparing development version 3.4.1-SNAPSHOT 07 April 2023, 01:28:49 UTC
87a5442 Preparing Spark release v3.4.0-rc7 07 April 2023, 01:28:44 UTC
b2ff4c4 [SPARK-39696][CORE] Fix data race in access to TaskMetrics.externalAccums ### What changes were proposed in this pull request? This PR fixes a data race around concurrent access to `TaskMetrics.externalAccums`. The race occurs between the `executor-heartbeater` thread and the thread executing the task. This data race is not known to cause issues on 2.12 but in 2.13 ~due this change https://github.com/scala/scala/pull/9258~ (LuciferYang bisected this to first cause failures in scala 2.13.7 one possible reason could be https://github.com/scala/scala/pull/9786) leads to an uncaught exception in the `executor-heartbeater` thread, which means that the executor will eventually be terminated due to missing hearbeats. This fix of using of using `CopyOnWriteArrayList` is cherry picked from https://github.com/apache/spark/pull/37206 where is was suggested as a fix by LuciferYang since `TaskMetrics.externalAccums` is also accessed from outside the class `TaskMetrics`. The old PR was closed because at that point there was no clear understanding of the race condition. JoshRosen commented here https://github.com/apache/spark/pull/37206#issuecomment-1189930626 saying that there should be no such race based on because all accumulators should be deserialized as part of task deserialization here: https://github.com/apache/spark/blob/0cc96f76d8a4858aee09e1fa32658da3ae76d384/core/src/main/scala/org/apache/spark/executor/Executor.scala#L507-L508 and therefore no writes should occur while the hearbeat thread will read the accumulators. But my understanding is that is incorrect as accumulators will also be deserialized as part of the taskBinary here: https://github.com/apache/spark/blob/169f828b1efe10d7f21e4b71a77f68cdd1d706d6/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala#L87-L88 which will happen while the heartbeater thread is potentially reading the accumulators. This can both due to user code using accumulators (see the new test case) but also when using the Dataframe/Dataset API as sql metrics will also be `externalAccums`. One way metrics will be sent as part of the taskBinary is when the dep is a `ShuffleDependency`: https://github.com/apache/spark/blob/fbbcf9434ac070dd4ced4fb9efe32899c6db12a9/core/src/main/scala/org/apache/spark/Dependency.scala#L85 with a ShuffleWriteProcessor that comes from https://github.com/apache/spark/blob/fbbcf9434ac070dd4ced4fb9efe32899c6db12a9/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L411-L422 ### Why are the changes needed? The current code has a data race. ### Does this PR introduce _any_ user-facing change? It will fix an uncaught exception in the `executor-hearbeater` thread when using scala 2.13. ### How was this patch tested? This patch adds a new test case, that before the fix was applied consistently produces the uncaught exception in the heartbeater thread when using scala 2.13. Closes #40663 from eejbyfeldt/SPARK-39696. Lead-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Co-authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6ce0822f76e11447487d5f6b3cce94a894f2ceef) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 07 April 2023, 01:14:14 UTC
1d974a7 Preparing development version 3.4.1-SNAPSHOT 06 April 2023, 16:38:33 UTC
28d0723 Preparing Spark release v3.4.0-rc6 06 April 2023, 16:38:28 UTC
9037642 [SPARK-43041][SQL] Restore constructors of exceptions for compatibility in connector API ### What changes were proposed in this pull request? This PR adds back old constructors for exceptions used in the public connector API based on Spark 3.3. ### Why are the changes needed? These changes are needed to avoid breaking connectors when consuming Spark 3.4. Here is a list of exceptions used in the connector API (`org.apache.spark.sql.connector`): ``` NoSuchNamespaceException NoSuchTableException NoSuchViewException NoSuchPartitionException NoSuchPartitionsException (not referenced by public Catalog API but I assume it may be related to the exception above, which is referenced) NoSuchFunctionException NoSuchIndexException NamespaceAlreadyExistsException TableAlreadyExistsException ViewAlreadyExistsException PartitionAlreadyExistsException (not referenced by public Catalog API but I assume it may be related to the exception below, which is referenced) PartitionsAlreadyExistException IndexAlreadyExistsException ``` ### Does this PR introduce _any_ user-facing change? Adds back previously released constructors. ### How was this patch tested? Existing tests. Closes #40679 from aokolnychyi/spark-43041. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 6546988ead06af8de33108ad0eb3f25af839eadc) Signed-off-by: Gengliang Wang <gengliang@apache.org> 06 April 2023, 05:10:57 UTC
f2900f8 [SPARK-43018][SQL] Fix bug for INSERT commands with timestamp literals ### What changes were proposed in this pull request? This PR fixes a correctness bug for INSERT commands with timestamp literals. The bug manifests when: * An INSERT command includes a user-specified column list of fewer columns than the target table. * The provided values include timestamp literals. The bug was that the long integer values stored in the rows to represent these timestamp literals were getting assigned back to `UnresolvedInlineTable` rows without the timestamp type. Then the analyzer inserted an implicit cast from `LongType` to `TimestampType` later, which incorrectly caused the value to change during execution. This PR fixes the bug by propagating the timestamp type directly to the output table instead. ### Why are the changes needed? This PR fixes a correctness bug. ### Does this PR introduce _any_ user-facing change? Yes, this PR fixes a correctness bug. ### How was this patch tested? This PR adds a new unit test suite. Closes #40652 from dtenedor/assign-correct-insert-types. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 9f0bf51a3a7f6175de075198e00a55bfdc491f15) Signed-off-by: Gengliang Wang <gengliang@apache.org> 06 April 2023, 04:17:01 UTC
34c7c3b [MINOR][CONNECT][DOCS] Clarify Spark Connect option in Spark scripts ### What changes were proposed in this pull request? This PR clarifies Spark Connect option to be consistent with other sections. ### Why are the changes needed? To be consistent with other configuration docs, and to be clear about Spark Connect option. ### Does this PR introduce _any_ user-facing change? Yes, it changes the user-facing doc in Spark scripts. ### How was this patch tested? Manually tested. Closes #40671 from HyukjinKwon/another-minor-docs. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a7a0f7a92d0517e553de46193592e1e6d1c3add2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2023, 06:00:57 UTC
c79fc94 [SPARK-42983][CONNECT][PYTHON] Fix createDataFrame to handle 0-dim numpy array properly ### What changes were proposed in this pull request? Fix `createDataFrame` to handle 0-dim numpy array properly. ### Why are the changes needed? When 0-dim numpy array is passed to `createDataFrame`, it raises an unexpected error: ```py >>> import numpy as np >>> spark.createDataFrame(np.array(0)) Traceback (most recent call last): ... TypeError: len() of unsized object ``` The error message should be: ```py ValueError: NumPy array input should be of 1 or 2 dimensions. ``` ### Does this PR introduce _any_ user-facing change? It will show a proper error message. ### How was this patch tested? Enabled/updated the related test. Closes #40669 from ueshin/issues/SPARK-42983/zero_dim_nparray. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6fc9f0221fd273715b9d1047c951e00cec078965) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2023, 03:04:18 UTC
d373661 [MINOR][PYTHON][CONNECT][DOCS] Deduplicate versionchanged directive in Catalog ### What changes were proposed in this pull request? This PR proposes to deduplicate versionchanged directive in Catalog. ### Why are the changes needed? All API is implemented so we don't need to mark individual method. ### Does this PR introduce _any_ user-facing change? Yes, it changes the documentation. ### How was this patch tested? Manually check with documentation build. CI in this PR should test it out too. ![Screen Shot 2023-04-05 at 10 30 50 AM](https://user-images.githubusercontent.com/6477701/229958119-44b253b0-7856-44a9-8718-4be4a857fd59.png) Closes #40670 from HyukjinKwon/minor-doc-catalog. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b86d90d1eb0a5c085965c3e558d2c873a8157830) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2023, 01:48:28 UTC
532d446 [SPARK-43009][SQL][3.4] Parameterized `sql()` with `Any` constants ### What changes were proposed in this pull request? In the PR, I propose to change API of parameterized SQL, and replace type of argument values from `string` to `Any` in Scala/Java/Python and `Expression.Literal` in protobuf API. Language API can accept `Any` objects from which it is possible to construct literal expressions. This is a backport of https://github.com/apache/spark/pull/40623 #### Scala/Java: ```scala def sql(sqlText: String, args: Map[String, Any]): DataFrame ``` values of the `args` map are wrapped by the `lit()` function which leaves `Column` as is and creates a literal from other Java/Scala objects (for more details see the `Scala` tab at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). #### Python: ```python def sql(self, sqlQuery: str, args: Optional[Dict[str, Any]] = None, **kwargs: Any) -> DataFrame: ``` Similarly to Scala/Java `sql`, Python's `sql()` accepts Python objects as values of the `args` dictionary (see more details about acceptable Python objects at https://spark.apache.org/docs/latest/sql-ref-datatypes.html). `sql()` converts dictionary values to `Column` literal expressions by `lit()`. #### Protobuf: ```proto message SqlCommand { // (Required) SQL Query. string sql = 1; // (Optional) A map of parameter names to literal expressions. map<string, Expression.Literal> args = 2; } ``` For example: ```scala scala> val sqlText = """SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name""" sqlText: String = SELECT s FROM VALUES ('Jeff /*__*/ Green'), ('E\'Twaun Moore') AS t(s) WHERE s = :player_name scala> sql(sqlText, args = Map("player_name" -> lit("E'Twaun Moore"))).show(false) +-------------+ |s | +-------------+ |E'Twaun Moore| +-------------+ ``` ### Why are the changes needed? The current implementation the parameterized `sql()` requires arguments as string values parsed to SQL literal expressions that causes the following issues: 1. SQL comments are skipped while parsing, so, some fragments of input might be skipped. For example, `'Europe -- Amsterdam'`. In this case, `-- Amsterdam` is excluded from the input. 2. Special chars in string values must be escaped, for instance `'E\'Twaun Moore'` ### Does this PR introduce _any_ user-facing change? No since the parameterized SQL feature https://github.com/apache/spark/pull/38864 hasn't been released yet. ### How was this patch tested? By running the affected tests: ``` $ build/sbt "test:testOnly *ParametersSuite" $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.tests.connect.test_connect_basic SparkConnectBasicTests.test_sql_with_args' $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session SparkSession.sql' ``` Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 156a12ec0abba8362658a58e00179a0b80f663f2) Closes #40666 from MaxGekk/parameterized-sql-any-3.4-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2023, 00:12:08 UTC
444053f [SPARK-42655][SQL] Incorrect ambiguous column reference error **What changes were proposed in this pull request?** The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error. **Why are the changes needed?** The below query fails incorrectly due to ambiguous reference error. val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) df3.select("id").show() org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id. df3.explain() == Physical Plan == *(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17] Before the fix, attributes matched were: attributes: Vector(id#17, id#17) Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result. unique attributes: Vector(id#17) **Does this PR introduce any user-facing change?** Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3. **How was this patch tested?** Added unit test. Closes #40258 from shrprasa/col_ambiguous_issue. Authored-by: Shrikant Prasad <shrprasa@visa.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b283c6a0e47c3292dbf00400392b3a3f629dd965) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2023, 13:16:19 UTC
back to top