https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
4e3599b Preparing Spark release v3.3.0-rc4 03 June 2022, 09:20:31 UTC
61d22b6 [SPARK-39371][DOCS][CORE] Review and fix issues in Scala/Java API docs of Core module Compare the 3.3.0 API doc with the latest release version 3.2.1. Fix the following issues: * Add missing Since annotation for new APIs * Remove the leaking class/object in API doc Improve API docs No Existing UT Closes #36757 from xuanyuanking/doc. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1fbb1d46feb992c3441f2a4f2c5d5179da465d4b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 June 2022, 08:50:09 UTC
4a0f0ff [SPARK-39259][SQL][3.3] Evaluate timestamps consistently in subqueries ### What changes were proposed in this pull request? Apply the optimizer rule ComputeCurrentTime consistently across subqueries. This is a backport of https://github.com/apache/spark/pull/36654. ### Why are the changes needed? At the moment timestamp functions like now() can return different values within a query if subqueries are involved ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test was added Closes #36752 from olaky/SPARK-39259-spark_3_3. Authored-by: Ole Sasse <ole.sasse@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 03 June 2022, 06:12:26 UTC
8f599ba [SPARK-39367][DOCS][SQL] Review and fix issues in Scala/Java API docs of SQL module ### What changes were proposed in this pull request? Compare the 3.3.0 API doc with the latest release version 3.2.1. Fix the following issues: * Add missing Since annotation for new APIs * Remove the leaking class/object in API doc ### Why are the changes needed? Improve API docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #36754 from gengliangwang/apiDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4c7888dd9159dc203628b0d84f0ee2f90ab4bf13) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 June 2022, 00:42:54 UTC
bc4aab5 [SPARK-39295][DOCS][PYTHON][3.3] Improve documentation of pandas API supported list ### What changes were proposed in this pull request? The description provided in the supported pandas API list document or the code comment needs improvement. ### Why are the changes needed? To improve document readability for users. ### Does this PR introduce _any_ user-facing change? Yes, the "Supported pandas APIs" page has changed as below. <img width="1001" alt="Screen Shot 2022-06-02 at 5 10 39 AM" src="https://user-images.githubusercontent.com/7010554/171596895-67426326-4ce5-4b82-8f14-228316a367e0.png"> ### How was this patch tested? Manually check the links in the documents & the existing doc build should be passed. Closes #36749 from beobest2/SPARK-39295_backport. Authored-by: beobest2 <cleanby@naver.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 02 June 2022, 18:48:54 UTC
4da8f3a [SPARK-39361] Don't use Log4J2's extended throwable conversion pattern in default logging configurations ### What changes were proposed in this pull request? This PR addresses a performance problem in Log4J 2 related to exception logging: in certain scenarios I observed that Log4J2's default exception stacktrace logging can be ~10x slower than Log4J 1. The problem stems from a new log pattern format in Log4J2 called ["extended exception"](https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternExtendedException), which enriches the regular stacktrace string with information on the name of the JAR files that contained the classes in each stack frame. Log4J queries the classloader to determine the source JAR for each class. This isn't cheap, but this information is cached and reused in future exception logging calls. In certain scenarios involving runtime-generated classes, this lookup will fail and the failed lookup result will _not_ be cached. As a result, expensive classloading operations will be performed every time such an exception is logged. In addition to being very slow, these operations take out a lock on the classloader and thus can cause severe lock contention if multiple threads are logging errors. This issue is described in more detail in [a comment on a Log4J2 JIRA](https://issues.apache.org/jira/browse/LOG4J2-2391?focusedCommentId=16667140&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16667140) and in a linked blogpost. Spark frequently uses generated classes and lambdas and thus Spark executor logs will almost always trigger this edge-case and suffer from poor performance. By default, if you do not specify an explicit exception format in your logging pattern then Log4J2 will add this "extended exception" pattern (see PatternLayout's alwaysWriteExceptions flag in Log4J's documentation, plus [the code implementing that flag](https://github.com/apache/logging-log4j2/blob/d6c8ab0863c551cdf0f8a5b1966ab45e3cddf572/log4j-core/src/main/java/org/apache/logging/log4j/core/pattern/PatternParser.java#L206-L209) in Log4J2). In this PR, I have updated Spark's default Log4J2 configurations so that each pattern layout includes an explicit %ex so that it uses the normal (non-extended) exception logging format. This is the workaround that is currently recommended on the Log4J JIRA. ### Why are the changes needed? Avoid performance regressions in Spark programs which use Spark's default Log4J 2 configuration and log many exceptions. Although it's true that any program logging exceptions at a high rate should probably just fix the source of the exceptions, I think it's still a good idea for us to try to fix this out-of-the-box performance difference so that users' existing workloads do not regress when upgrading to 3.3.0. ### Does this PR introduce _any_ user-facing change? Yes: it changes the default exception logging format so that it matches Log4J 1's default rather than Log4J 2's. The new format is consistent with behavior in previous Spark versions, but is different than the behavior in the current Spark 3.3.0-rc3. ### How was this patch tested? Existing tests. Closes #36747 from JoshRosen/disable-log4j2-extended-exception-pattern. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit fd45c3656be6add7cf483ddfb7016b12f77d7c8e) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 02 June 2022, 16:29:14 UTC
7ed3044 [SPARK-38807][CORE] Fix the startup error of spark shell on Windows ### What changes were proposed in this pull request? The File.getCanonicalPath method will return the drive letter in the windows system. The RpcEnvFileServer.validateDirectoryUri method uses the File.getCanonicalPath method to process the baseuri, which will cause the baseuri not to comply with the URI verification rules. For example, the / classes is processed into F: \ classes.This causes the sparkcontext to fail to start on windows. This PR modifies the RpcEnvFileServer.validateDirectoryUri method and replaces `new File(baseUri).getCanonicalPath` with `new URI(baseUri).normalize().getPath`. This method can work normally in windows. ### Why are the changes needed? Fix the startup error of spark shell on Windows system [[SPARK-35691](https://issues.apache.org/jira/browse/SPARK-35691)] introduced this regression. ### Does this PR introduce any user-facing change? No ### How was this patch tested? CI Closes #36447 from 1104056452/master. Lead-authored-by: Ming Li <1104056452@qq.com> Co-authored-by: ming li <1104056452@qq.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a760975083ea0696e8fd834ecfe3fb877b7f7449) Signed-off-by: Sean Owen <srowen@gmail.com> 02 June 2022, 12:44:27 UTC
ef521d3 [SPARK-39354][SQL] Ensure show `Table or view not found` even if there are `dataTypeMismatchError` related to `Filter` at the same time ### What changes were proposed in this pull request? After SPARK-38118, `dataTypeMismatchError` related to `Filter` will be checked and throw in `RemoveTempResolvedColumn`, this will cause compatibility issue with exception message presentation. For example, the following case: ``` spark.sql("create table t1(user_id int, auct_end_dt date) using parquet;") spark.sql("select * from t1 join t2 on t1.user_id = t2.user_id where t1.auct_end_dt >= Date_sub('2020-12-27', 90)").show ``` The expected message is ``` Table or view not found: t2 ``` But the actual message is ``` org.apache.spark.sql.AnalysisException: cannot resolve 'date_sub('2020-12-27', 90)' due to data type mismatch: argument 1 requires date type, however, ''2020-12-27'' is of string type.; line 1 pos 76 ``` For forward compatibility, this pr change to only records `DATA_TYPE_MISMATCH_ERROR_MESSAGE` in the `RemoveTempResolvedColumn` check process , and move `failAnalysis` to `CheckAnalysis#checkAnalysis` ### Why are the changes needed? Fix analysis exception message compatibility. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions and add a new test case Closes #36746 from LuciferYang/SPARK-39354. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 89fdb8a6fb6a669c458891b3abeba236e64b1e89) Signed-off-by: Max Gekk <max.gekk@gmail.com> 02 June 2022, 10:06:26 UTC
fef5695 [SPARK-39346][SQL][3.3] Convert asserts/illegal state exception to internal errors on each phase ### What changes were proposed in this pull request? In the PR, I propose to catch asserts/illegal state exception on each phase of query execution: ANALYSIS, OPTIMIZATION, PLANNING, and convert them to a SparkException w/ the `INTERNAL_ERROR` error class. This is a backport of https://github.com/apache/spark/pull/36704. ### Why are the changes needed? To improve user experience with Spark SQL and unify representation of user-facing errors. ### Does this PR introduce _any_ user-facing change? No. The changes might affect users in corner cases only. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *KafkaMicroBatchV1SourceSuite" $ build/sbt "test:testOnly *KafkaMicroBatchV2SourceSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 8894e785edae42a642351ad91e539324c39da8e4) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36742 from MaxGekk/wrapby-INTERNAL_ERROR-every-phase-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 02 June 2022, 09:24:51 UTC
4bbaf37 [SPARK-38675][CORE] Fix race during unlock in BlockInfoManager ### What changes were proposed in this pull request? This PR fixes a race in the `BlockInfoManager` between `unlock` and `releaseAllLocksForTask`, resulting in a negative reader count for a block (which trips an assert). This happens when the following events take place: 1. [THREAD 1] calls `releaseAllLocksForTask`. This starts by collecting all the blocks to be unlocked for this task. 2. [THREAD 2] calls `unlock` for a read lock for the same task (this means the block is also in the list collected in step 1). It then proceeds to unlock the block by decrementing the reader count. 3. [THREAD 1] now starts to release the collected locks, it does this by decrementing the readers counts for blocks by the number of acquired read locks. The problem is that step 2 made the lock counts for blocks incorrect, and we decrement by one (or a few) too many. This triggers a negative reader count assert. We fix this by adding a check to `unlock` that makes sure we are not in the process of unlocking. We do this by checking if there is a multiset associated with the task that contains the read locks. ### Why are the changes needed? It is a bug. Not fixing this can cause negative reader counts for blocks, and this causes task failures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a regression test in BlockInfoManager suite. Closes #35991 from hvanhovell/SPARK-38675. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 078b505d2f0a0a4958dec7da816a7d672820b637) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 June 2022, 08:48:26 UTC
2268665 [SPARK-39360][K8S] Remove deprecation of `spark.kubernetes.memoryOverheadFactor` and recover doc ### What changes were proposed in this pull request? This PR aims to avoid the deprecation of `spark.kubernetes.memoryOverheadFactor` from Apache Spark 3.3. In addition, also recovers the documentation which is removed mistakenly at the `deprecation`. `Deprecation` is not a removal. ### Why are the changes needed? - Apache Spark 3.3.0 RC complains always about `spark.kubernetes.memoryOverheadFactor` because the configuration has the default value (which is not given by the users). There is no way to remove the warnings which means the directional message is not helpful and makes the users confused in a wrong way. In other words, we still get warnings even we use only new configurations or no configuration. ``` 22/06/01 23:53:49 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor 22/06/01 23:53:49 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor 22/06/01 23:53:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/06/01 23:53:50 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor ``` - The minimum constraint is slightly different because `spark.kubernetes.memoryOverheadFactor` allowed 0 since Apache Spark 2.4 while new configurations disallow `0`. - This documentation removal might be too early because the deprecation is not the removal of configuration. This PR recoveres the removed doc and added the following. ``` This will be overridden by the value set by <code>spark.driver.memoryOverheadFactor</code> and <code>spark.executor.memoryOverheadFactor</code> explicitly. ``` ### Does this PR introduce _any_ user-facing change? No. This is a consistent with the existing behavior. ### How was this patch tested? Pass the CIs. Closes #36744 from dongjoon-hyun/SPARK-39360. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6d43556089a21b26d1a7590fbe1e25bd1ca7cedd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 June 2022, 03:10:08 UTC
37aa079 [SPARK-39040][SQL][FOLLOWUP] Use a unique table name in conditional-functions.sql ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/36376, to use a unique table name in the test. `t` is a quite common table name and may make test environment unstable. ### Why are the changes needed? make tests more stable ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #36739 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4f672db5719549c522a24cffe7b4d0c1e0cb859b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 June 2022, 01:32:02 UTC
000270a [SPARK-39313][SQL] `toCatalystOrdering` should fail if V2Expression can not be translated After reading code changes in #35657, I guess the original intention of changing the return type of `V2ExpressionUtils.toCatalyst` from `Expression` to `Option[Expression]` is, for reading, spark can ignore unrecognized distribution and ordering, but for writing, it should always be strict. Specifically, `V2ExpressionUtils.toCatalystOrdering` should fail if V2Expression can not be translated instead of returning empty Seq. `V2ExpressionUtils.toCatalystOrdering` is used by `DistributionAndOrderingUtils`, the current behavior will break the semantics of `RequiresDistributionAndOrdering#requiredOrdering` in some cases(see UT). No. New UT. Closes #36697 from pan3793/SPARK-39313. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com> 01 June 2022, 16:53:17 UTC
1ad1c18 [SPARK-39283][CORE] Fix deadlock between TaskMemoryManager and UnsafeExternalSorter.SpillableIterator ### What changes were proposed in this pull request? This PR fixes a deadlock between TaskMemoryManager and UnsafeExternalSorter.SpillableIterator. ### Why are the changes needed? We are facing the deadlock issue b/w TaskMemoryManager and UnsafeExternalSorter.SpillableIterator during the join. It turns out that in UnsafeExternalSorter.SpillableIterator#spill() function, it tries to get lock on UnsafeExternalSorter`SpillableIterator` and UnsafeExternalSorter and call `freePage` to free all allocated pages except the last one which takes the lock on TaskMemoryManager. At the same time, there can be another `MemoryConsumer` using `UnsafeExternalSorter` as part of sorting can try to allocatePage needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. There is a similar fix here as well: https://issues.apache.org/jira/browse/SPARK-27338 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #36680 from sandeepvinayak/SPARK-39283. Authored-by: sandeepvinayak <sandeep.pal@outlook.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit 8d0c035f102b005c2e85f03253f1c0c24f0a539f) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 31 May 2022, 22:29:05 UTC
b8904c3 [SPARK-39341][K8S] KubernetesExecutorBackend should allow IPv6 pod IP ### What changes were proposed in this pull request? This PR aims to make KubernetesExecutorBackend allow IPv6 pod IP. ### Why are the changes needed? The `hostname` comes from `SPARK_EXECUTOR_POD_IP`. ``` resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh: --hostname $SPARK_EXECUTOR_POD_IP ``` `SPARK_EXECUTOR_POD_IP` comes from `status.podIP` where it does not have `[]` in case of IPv6. https://github.com/apache/spark/blob/1a54a2bd69e35ab5f0cbd83df673c6f1452df418/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L140-L145 - https://kubernetes.io/docs/concepts/services-networking/dual-stack/ - https://en.wikipedia.org/wiki/IPv6_address ### Does this PR introduce _any_ user-facing change? No, this PR removes only the `[]` constraint from `checkHost`. ### How was this patch tested? Pass the CIs. Closes #36728 from williamhyun/IPv6. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7bb009888eec416eef36587546d4c0ab0077bcf5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 May 2022, 20:07:22 UTC
16788ee [SPARK-39322][CORE][FOLLOWUP] Revise log messages for dynamic allocation and shuffle decommission ### What changes were proposed in this pull request? This PR is a follow-up for #36705 to revise the missed log message change. ### Why are the changes needed? Like the documentation, this PR updates the log message correspondingly. - Lower log level to `INFO` from `WARN` - Provide a specific message according to the configurations. ### Does this PR introduce _any_ user-facing change? No. This is a log-message-only change. ### How was this patch tested? Pass the CIs. Closes #36725 from dongjoon-hyun/SPARK-39322-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit be2c8c0115861e6975b658a7b0455bae828b7553) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 May 2022, 10:54:50 UTC
bb03292 [SPARK-39334][BUILD] Exclude `slf4j-reload4j` from `hadoop-minikdc` test dependency ### What changes were proposed in this pull request? [HADOOP-18088 Replace log4j 1.x with reload4j](https://issues.apache.org/jira/browse/HADOOP-18088) , this pr adds the exclusion of `slf4j-reload4j` for `hadoop-minikdc` to clean up waring message about `Class path contains multiple SLF4J bindings` when run UTs. ### Why are the changes needed? Cleanup `Class path contains multiple SLF4J bindings` waring when run UTs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GA - Manual test for example, run `mvn clean test -pl core` **Before** ``` [INFO] Running test.org.apache.spark.Java8RDDAPISuite SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/xxx/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.17.2/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/xxx/.m2/repository/org/slf4j/slf4j-reload4j/1.7.36/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] ``` **After** no above warnings Closes #36721 from LuciferYang/SPARK-39334. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 93b8cc05d582fe4be1a3cd9452708f18e728f0bb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 May 2022, 08:42:23 UTC
109904e [SPARK-39327][K8S] ExecutorRollPolicy.ID should consider ID as a numerical value This PR aims to make `ExecutorRollPolicy.ID` should consider ID as a numerical value. Currently, the ExecutorRollPolicy chooses the smallest ID from string sorting. No, 3.3.0 is not released yet. Pass the CIs. Closes #36715 from williamhyun/SPARK-39327. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 97f4b0cc1b20ca641d0e968e0b0fb45557085115) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 29 May 2022, 07:29:44 UTC
ad8c867 [SPARK-39322][DOCS] Remove `Experimental` from `spark.dynamicAllocation.shuffleTracking.enabled` ### What changes were proposed in this pull request? This PR aims to remove `Experimental` from `spark.dynamicAllocation.shuffleTracking.enabled`. ### Why are the changes needed? `spark.dynamicAllocation.shuffleTracking.enabled` was added at Apache Spark 3.0.0 and has been used with K8s resource manager. ### Does this PR introduce _any_ user-facing change? No, this is a documentation only change. ### How was this patch tested? Manual. Closes #36705 from dongjoon-hyun/SPARK-39322. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 May 2022, 00:58:46 UTC
4a43b4d [SPARK-36681][CORE][TESTS][FOLLOW-UP] Handle LinkageError when Snappy native library is not available in low Hadoop versions ### What changes were proposed in this pull request? This is a follow-up to https://github.com/apache/spark/pull/36136 to fix `LinkageError` handling in `FileSuite` to avoid test suite abort when Snappy native library is not available in low Hadoop versions: ``` 23:16:22 FileSuite: 23:16:22 org.apache.spark.FileSuite *** ABORTED *** 23:16:22 java.lang.RuntimeException: Unable to load a Suite class that was discovered in the runpath: org.apache.spark.FileSuite 23:16:22 at org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81) 23:16:22 at org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38) 23:16:22 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 23:16:22 at scala.collection.Iterator.foreach(Iterator.scala:941) 23:16:22 at scala.collection.Iterator.foreach$(Iterator.scala:941) 23:16:22 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) 23:16:22 at scala.collection.IterableLike.foreach(IterableLike.scala:74) 23:16:22 at scala.collection.IterableLike.foreach$(IterableLike.scala:73) 23:16:22 at scala.collection.AbstractIterable.foreach(Iterable.scala:56) 23:16:22 at scala.collection.TraversableLike.map(TraversableLike.scala:238) 23:16:22 ... 23:16:22 Cause: java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z 23:16:22 at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method) 23:16:22 at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63) 23:16:22 at org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:136) 23:16:22 at org.apache.spark.FileSuite.$anonfun$new$12(FileSuite.scala:145) 23:16:22 at scala.util.Try$.apply(Try.scala:213) 23:16:22 at org.apache.spark.FileSuite.<init>(FileSuite.scala:141) 23:16:22 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 23:16:22 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 23:16:22 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 23:16:22 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ``` Scala's `Try` can handle only `NonFatal` throwables. ### Why are the changes needed? To make the tests robust. ### Does this PR introduce _any_ user-facing change? Nope, this is test-only. ### How was this patch tested? Manual test. Closes #36687 from peter-toth/SPARK-36681-handle-linkageerror. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dbde77856d2e51ff502a7fc1dba7f10316c2211b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 May 2022, 17:35:13 UTC
2faaf8a [SPARK-39272][SQL][3.3] Increase the start position of query context by 1 ### What changes were proposed in this pull request? Increase the start position of query context by 1 ### Why are the changes needed? Currently, the line number starts from 1, while the start position starts from 0. Thus it's better to increase the start position by 1 for consistency. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #36684 from gengliangwang/portSPARK-39234. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 26 May 2022, 13:05:29 UTC
e24859b [SPARK-39253][DOCS][PYTHON][3.3] Improve PySpark API reference to be more readable ### What changes were proposed in this pull request? Hotfix https://github.com/apache/spark/pull/36647 for branch-3.3. ### Why are the changes needed? The improvement of document readability will also improve the usability for PySpark. ### Does this PR introduce _any_ user-facing change? Yes, now the documentation is categorized by its class or their own purpose more clearly as below: <img width="270" alt="Screen Shot 2022-05-24 at 1 50 23 PM" src="https://user-images.githubusercontent.com/44108233/169951517-f8b9cb72-7408-46d6-8cd7-15ae890a7a7f.png"> ### How was this patch tested? The existing test should cover. Closes #36685 from itholic/SPARK-39253-3.3. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 May 2022, 10:35:35 UTC
997e7f0 [SPARK-39234][SQL][3.3] Code clean up in SparkThrowableHelper.getMessage ### What changes were proposed in this pull request? 1. Remove the starting "\n" in `Origin.context`. The "\n" will be append in the method `SparkThrowableHelper.getMessage` instead. 2. Code clean up the method SparkThrowableHelper.getMessage to eliminate redundant code. ### Why are the changes needed? Code clean up to eliminate redundant code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #36669 from gengliangwang/portSPARK-39234. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 26 May 2022, 07:25:40 UTC
92e82fd [SPARK-39293][SQL] Fix the accumulator of ArrayAggregate to handle complex types properly ### What changes were proposed in this pull request? Fix the accumulator of `ArrayAggregate` to handle complex types properly. The accumulator of `ArrayAggregate` should copy the intermediate result if string, struct, array, or map. ### Why are the changes needed? If the intermediate data of `ArrayAggregate` holds reusable data, the result will be duplicated. ```scala import org.apache.spark.sql.functions._ val reverse = udf((s: String) => s.reverse) val df = Seq(Array("abc", "def")).toDF("array") val testArray = df.withColumn( "agg", aggregate( col("array"), array().cast("array<string>"), (acc, s) => concat(acc, array(reverse(s))))) aggArray.show(truncate=false) ``` should be: ``` +----------+----------+ |array |agg | +----------+----------+ |[abc, def]|[cba, fed]| +----------+----------+ ``` but: ``` +----------+----------+ |array |agg | +----------+----------+ |[abc, def]|[fed, fed]| +----------+----------+ ``` ### Does this PR introduce _any_ user-facing change? Yes, this fixes the correctness issue. ### How was this patch tested? Added a test. Closes #36674 from ueshin/issues/SPARK-39293/array_aggregate. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d6a11cb4b411c8136eb241aac167bc96990f5421) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 May 2022, 01:36:12 UTC
6c4e07d [SPARK-39255][SQL][3.3] Improve error messages ### What changes were proposed in this pull request? In the PR, I propose to improve errors of the following error classes: 1. NON_PARTITION_COLUMN - `a non-partition column name` -> `the non-partition column` 2. UNSUPPORTED_SAVE_MODE - `a not existent path` -> `a non existent path`. 3. INVALID_FIELD_NAME. Quote ids to follow the rules https://github.com/apache/spark/pull/36621. 4. FAILED_SET_ORIGINAL_PERMISSION_BACK. It is renamed to RESET_PERMISSION_TO_ORIGINAL. 5. NON_LITERAL_PIVOT_VALUES - Wrap error's expression by double quotes. The PR adds new helper method `toSQLExpr()` for that. 6. CAST_INVALID_INPUT - Add the recommendation: `... Correct the syntax for the value before casting it, or change the type to one appropriate for the value.` This is a backport of https://github.com/apache/spark/pull/36635. ### Why are the changes needed? To improve user experience with Spark SQL by making error message more clear. ### Does this PR introduce _any_ user-facing change? Yes, it changes user-facing error messages. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "sql/testOnly *QueryCompilationErrorsDSv2Suite" $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite" $ build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite" $ build/sbt "sql/testOnly *QueryParsingErrorsSuite*" ``` Lead-authored-by: Max Gekk <max.gekkgmail.com> Co-authored-by: Maxim Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 625afb4e1aefda59191d79b31f8c94941aedde1e) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36655 from MaxGekk/error-class-improve-msg-3-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 25 May 2022, 16:36:55 UTC
37a2416 [SPARK-39252][PYSPARK][TESTS] Remove flaky test_df_is_empty ### What changes were proposed in this pull request? ### Why are the changes needed? This PR removes flaky `test_df_is_empty` as reported in https://issues.apache.org/jira/browse/SPARK-39252. I will open a follow-up PR to reintroduce the test and fix the flakiness (or see if it was a regression). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #36656 from sadikovi/SPARK-39252. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9823bb385cd6dca7c4fb5a6315721420ad42f80a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 May 2022, 02:40:02 UTC
a33d697 [SPARK-39273][PS][TESTS] Make PandasOnSparkTestCase inherit ReusedSQLTestCase ### What changes were proposed in this pull request? This PR proposes to make `PandasOnSparkTestCase` inherit `ReusedSQLTestCase`. ### Why are the changes needed? We don't need this: ```python classmethod def tearDownClass(cls): # We don't stop Spark session to reuse across all tests. # The Spark session will be started and stopped at PyTest session level. # Please see pyspark/pandas/conftest.py. pass ``` anymore in Apache Spark. This has existed to speed up the tests when the codes are in Koalas repository where the tests run sequentially in single process. In Apache Spark, we run in multiple processes, and we don't need this anymore. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Existing CI should test it out. Closes #36652 from HyukjinKwon/SPARK-39273. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a6dd6076d708713d11585bf7f3401d522ea48822) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 May 2022, 00:56:42 UTC
d491e39 Preparing development version 3.3.1-SNAPSHOT 24 May 2022, 10:15:35 UTC
a725927 Preparing Spark release v3.3.0-rc3 24 May 2022, 10:15:29 UTC
459c4b0 [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up ### What changes were proposed in this pull request? When we have nested subquery expressions, there is a chance that deduplicate relations could replace an attributes with a wrong one. This is because the attributes replacement is done by top down than bottom up. This could happen if the subplan gets deduplicate relations first (thus two same relation with different attributes id), then a more complex plan built on top of the subplan (e.g. a UNION of queries with nested subquery expressions) can trigger this wrong attribute replacement error. For concrete example please see the added unit test. ### Why are the changes needed? This is bug that we can fix. Without this PR, we could hit that outer attribute reference does not exist in the outer relation at certain scenario. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #36503 from amaliujia/testnestedsubqueryexpression. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d9fd36eb76fcfec95763cc4dc594eb7856b0fad2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 May 2022, 05:06:21 UTC
505248d [SPARK-39258][TESTS] Fix `Hide credentials in show create table` ### What changes were proposed in this pull request? [SPARK-35378-FOLLOWUP](https://github.com/apache/spark/pull/36632) changes the return value of `CommandResultExec.executeCollect()` from `InternalRow` to `UnsafeRow`, this change causes the result of `r.tostring` in the following code: https://github.com/apache/spark/blob/de73753bb2e5fd947f237e731ff05aa9f2711677/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L1143-L1148 change from ``` [CREATE TABLE tab1 ( NAME STRING, THEID INT) USING org.apache.spark.sql.jdbc OPTIONS ( 'dbtable' = 'TEST.PEOPLE', 'password' = '*********(redacted)', 'url' = '*********(redacted)', 'user' = 'testUser') ] ``` to ``` [0,10000000d5,5420455441455243,62617420454c4241,414e20200a282031,4e4952545320454d,45485420200a2c47,a29544e49204449,726f20474e495355,6568636170612e67,732e6b726170732e,a6362646a2e6c71,20534e4f4954504f,7462642720200a28,203d2027656c6261,45502e5453455427,200a2c27454c504f,6f77737361702720,2a27203d20276472,2a2a2a2a2a2a2a2a,6574636164657228,2720200a2c272964,27203d20276c7275,2a2a2a2a2a2a2a2a,746361646572282a,20200a2c27296465,3d20277265737527,7355747365742720,a29277265] ``` and the UT `JDBCSuite$Hide credentials in show create table` failed in master branch. This pr is change to use `executeCollectPublic()` instead of `executeCollect()` to fix this UT. ### Why are the changes needed? Fix UT failed in mater branch after [SPARK-35378-FOLLOWUP](https://github.com/apache/spark/pull/36632) ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? - GitHub Action pass - Manual test Run `mvn clean install -DskipTests -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.JDBCSuite` **Before** ``` - Hide credentials in show create table *** FAILED *** "[0,10000000d5,5420455441455243,62617420454c4241,414e20200a282031,4e4952545320454d,45485420200a2c47,a29544e49204449,726f20474e495355,6568636170612e67,732e6b726170732e,a6362646a2e6c71,20534e4f4954504f,7462642720200a28,203d2027656c6261,45502e5453455427,200a2c27454c504f,6f77737361702720,2a27203d20276472,2a2a2a2a2a2a2a2a,6574636164657228,2720200a2c272964,27203d20276c7275,2a2a2a2a2a2a2a2a,746361646572282a,20200a2c27296465,3d20277265737527,7355747365742720,a29277265]" did not contain "TEST.PEOPLE" (JDBCSuite.scala:1146) ``` **After** ``` Run completed in 24 seconds, 868 milliseconds. Total number of tests run: 93 Suites: completed 2, aborted 0 Tests: succeeded 93, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #36637 from LuciferYang/SPARK-39258. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6eb15d12ae6bd77412dbfbf46eb8dbeec1eab466) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2022, 21:28:19 UTC
2a31bf5 [MINOR][ML][DOCS] Fix sql data types link in the ml-pipeline page ### What changes were proposed in this pull request? <img width="939" alt="image" src="https://user-images.githubusercontent.com/8326978/169767919-6c48554c-87ff-4d40-a47d-ec4da0c993f7.png"> [Spark SQL datatype reference](https://spark.apache.org/docs/latest/sql-reference.html#data-types) - `https://spark.apache.org/docs/latest/sql-reference.html#data-types` is invalid and it shall be [Spark SQL datatype reference](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) - `https://spark.apache.org/docs/latest/sql-ref-datatypes.html` https://spark.apache.org/docs/latest/ml-pipeline.html#dataframe ### Why are the changes needed? doc fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? `bundle exec jekyll serve` Closes #36633 from yaooqinn/minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit de73753bb2e5fd947f237e731ff05aa9f2711677) Signed-off-by: huaxingao <huaxin_gao@apple.com> 23 May 2022, 14:46:24 UTC
0f13606 [SPARK-35378][SQL][FOLLOW-UP] Fix incorrect return type in CommandResultExec.executeCollect() ### What changes were proposed in this pull request? This PR is a follow-up for https://github.com/apache/spark/pull/32513 and fixes an issue introduced by that patch. CommandResultExec is supposed to return `UnsafeRow` records in all of the `executeXYZ` methods but `executeCollect` was left out which causes issues like this one: ``` Error in SQL statement: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow ``` We need to return `unsafeRows` instead of `rows` in `executeCollect` similar to other methods in the class. ### Why are the changes needed? Fixes a bug in CommandResultExec. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test to check the return type of all commands. Closes #36632 from sadikovi/fix-command-exec. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a0decfc7db68c464e3ba2c2fb0b79a8b0c464684) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 May 2022, 08:58:56 UTC
047c108 [SPARK-39250][BUILD] Upgrade Jackson to 2.13.3 ### What changes were proposed in this pull request? This PR aims to upgrade Jackson to 2.13.3. ### Why are the changes needed? Although Spark is not affected, Jackson 2.13.0~2.13.2 has the following regression which affects the user apps. - https://github.com/FasterXML/jackson-databind/issues/3446 Here is a full release note. - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.13.3 ### Does this PR introduce _any_ user-facing change? No. The previous version is not released yet. ### How was this patch tested? Pass the CIs. Closes #36627 from dongjoon-hyun/SPARK-39250. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 73438c048fc646f944415ba2e99cb08cc57d856b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 May 2022, 21:13:30 UTC
fa400c6 [SPARK-39243][SQL][DOCS] Rules of quoting elements in error messages ### What changes were proposed in this pull request? In the PR, I propose to describe the rules of quoting elements in error messages introduced by the PRs: - https://github.com/apache/spark/pull/36210 - https://github.com/apache/spark/pull/36233 - https://github.com/apache/spark/pull/36259 - https://github.com/apache/spark/pull/36324 - https://github.com/apache/spark/pull/36335 - https://github.com/apache/spark/pull/36359 - https://github.com/apache/spark/pull/36579 ### Why are the changes needed? To improve code maintenance, and the process of code review. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing GAs. Closes #36621 from MaxGekk/update-error-class-guide. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 2a4d8a4ea709339175257027e31a75bdeed5daec) Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 May 2022, 15:58:36 UTC
3f77be2 [SPARK-39240][INFRA][BUILD] Source and binary releases using different tool to generate hashes for integrity ### What changes were proposed in this pull request? unify the hash generator for release files. ### Why are the changes needed? Currently, we use `shasum` for source but `gpg` for binary, since https://github.com/apache/spark/pull/30123 this confuses me when validating the integrities of spark 3.3.0 RC https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/ ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test script manually Closes #36619 from yaooqinn/SPARK-39240. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 3e783375097d14f1c28eb9b0e08075f1f8daa4a2) Signed-off-by: Sean Owen <srowen@gmail.com> 20 May 2022, 15:54:59 UTC
ab057c7 [SPARK-39237][DOCS] Update the ANSI SQL mode documentation ### What changes were proposed in this pull request? 1. Remove the Experimental notation in ANSI SQL compliance doc 2. Update the description of `spark.sql.ansi.enabled`, since the ANSI reversed keyword is disabled by default now ### Why are the changes needed? 1. The ANSI SQL dialect is GAed in Spark 3.2 release: https://spark.apache.org/releases/spark-release-3-2-0.html We should not mark it as "Experimental" in the doc. 2. The ANSI reversed keyword is disabled by default now ### Does this PR introduce _any_ user-facing change? No, just doc change ### How was this patch tested? Doc preview: <img width="700" alt="image" src="https://user-images.githubusercontent.com/1097932/169444094-de9c33c2-1b01-4fc3-b583-b752c71e16d8.png"> <img width="1435" alt="image" src="https://user-images.githubusercontent.com/1097932/169472239-1edf218f-1f7b-48ec-bf2a-5d043600f1bc.png"> Closes #36614 from gengliangwang/updateAnsiDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 86a351c13d62644d596cc5249fc1c45d318a0bbf) Signed-off-by: Gengliang Wang <gengliang@apache.org> 20 May 2022, 08:58:36 UTC
e8e330f [SPARK-39218][SS][PYTHON] Make foreachBatch streaming query stop gracefully ### What changes were proposed in this pull request? This PR proposes to make the `foreachBatch` streaming query stop gracefully by handling the interrupted exceptions at `StreamExecution.isInterruptionException`. Because there is no straightforward way to access to the original JVM exception, here we rely on string pattern match for now (see also "Why are the changes needed?" below). There is only one place from Py4J https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/protocol.py#L326-L328 so the approach would work at least. ### Why are the changes needed? In `foreachBatch`, the Python user-defined function in the microbatch runs till the end even when `StreamingQuery.stop` is invoked. However, when any Py4J access is attempted within the user-defined function: - With the pinned thread mode disabled, the interrupt exception is not blocked, and the Python function is executed till the end in a different thread. - With the pinned thread mode enabled, the interrupt exception is raised in the same thread, and the Python thread raises a Py4J exception in the same thread. The latter case is a problem because the interrupt exception is first thrown from JVM side (`java.lang. InterruptedException`) -> Python callback server (`py4j.protocol.Py4JJavaError`) -> JVM (`py4j.Py4JException`), and `py4j.Py4JException` is not listed in `StreamExecution.isInterruptionException` which doesn't gracefully stop the query. Therefore, we should handle this exception at `StreamExecution.isInterruptionException`. ### Does this PR introduce _any_ user-facing change? Yes, it will make the query gracefully stop. ### How was this patch tested? Manually tested with: ```python import time def func(batch_df, batch_id): time.sleep(10) print(batch_df.count()) q = spark.readStream.format("rate").load().writeStream.foreachBatch(func).start() time.sleep(5) q.stop() ``` Closes #36589 from HyukjinKwon/SPARK-39218. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 499de87b77944157828a6d905d9b9df37b7c9a67) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 May 2022, 04:02:37 UTC
c3a171d [SPARK-38681][SQL] Support nested generic case classes ### What changes were proposed in this pull request? Master and branch-3.3 will fail to derive schema for case classes with generic parameters if the parameter was not used directly as a field, but instead pass on as a generic parameter to another type. e.g. ``` case class NestedGeneric[T]( generic: GenericData[T]) ``` This is a regression from the latest release of 3.2.1 where this works as expected. ### Why are the changes needed? Support more general case classes that user might have. ### Does this PR introduce _any_ user-facing change? Better support for generic case classes. ### How was this patch tested? New specs in ScalaReflectionSuite and ExpressionEncoderSuite. All the new test cases that does not use value classes pass if added to the 3.2 branch Closes #36004 from eejbyfeldt/SPARK-38681-nested-generic. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 49c68020e702f9258f3c693f446669bea66b12f4) Signed-off-by: Sean Owen <srowen@gmail.com> 20 May 2022, 00:12:47 UTC
2977791 [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators ### What changes were proposed in this pull request? 1. Explicitly return in GeneratorNestedColumnAliasing when the generator is not Explode. 2. Add extensive comment to GeneratorNestedColumnAliasing. 3. An off-hand code refactor to make the code clearer. ### Why are the changes needed? GeneratorNestedColumnAliasing does not handle other generators correctly. We only try to rewrite the generator for Explode but try to rewrite all ExtractValue expressions. This can cause inconsistency for non-Explode generators. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #35850 from minyyy/gnca_non_explode. Authored-by: minyyy <min.yang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 026102489b8edce827a05a1dba3b0ef8687f134f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 May 2022, 14:52:30 UTC
24a3fa9 [SPARK-39233][SQL] Remove the check for TimestampNTZ output in Analyzer ### What changes were proposed in this pull request? In [#36094](https://github.com/apache/spark/pull/36094), a check for failing TimestampNTZ type output(since we are disabling TimestampNTZ in 3.3) is added: ``` case operator: LogicalPlan if !Utils.isTesting && operator.output.exists(attr => attr.resolved && attr.dataType.isInstanceOf[TimestampNTZType]) => operator.failAnalysis("TimestampNTZ type is not supported in Spark 3.3.") ``` However, the check can cause misleading error messages. In 3.3: ``` > sql( "select date '2018-11-17' > 1").show() org.apache.spark.sql.AnalysisException: Invalid call to toAttribute on unresolved object; 'Project [unresolvedalias((2018-11-17 > 1), None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.UnresolvedAlias.toAttribute(unresolved.scala:510) at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:70) ``` In master or 3.2 ``` > sql( "select date '2018-11-17' > 1").show() org.apache.spark.sql.AnalysisException: cannot resolve '(DATE '2018-11-17' > 1)' due to data type mismatch: differing types in '(DATE '2018-11-17' > 1)' (date and int).; line 1 pos 7; 'Project [unresolvedalias((2018-11-17 > 1), None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` We should just remove the check to avoid such regression. It's not necessary for disabling TimestampNTZ anyway. ### Why are the changes needed? Fix regression in the error output of analysis check. ### Does this PR introduce _any_ user-facing change? No, it is not released yet. ### How was this patch tested? Build and try on `spark-shell` Closes #36609 from gengliangwang/fixRegression. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 May 2022, 12:34:50 UTC
88c076d [SPARK-39229][SQL][3.3] Separate query contexts from error-classes.json ### What changes were proposed in this pull request? Separate query contexts for runtime errors from error-classes.json. ### Why are the changes needed? The message is JSON should only contain parameters explicitly thrown. It is more elegant to separate query contexts from error-classes.json. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #36607 from gengliangwang/SPARK-39229-3.3. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 May 2022, 12:02:44 UTC
e088c82 [SPARK-39212][SQL][3.3] Use double quotes for values of SQL configs/DS options in error messages ### What changes were proposed in this pull request? Wrap values of SQL configs and datasource options in error messages by double quotes. Added the `toDSOption()` method to `QueryErrorsBase` to quote DS options. This is a backport of https://github.com/apache/spark/pull/36579. ### Why are the changes needed? 1. To highlight SQL config/DS option values and make them more visible for users. 2. To be able to easily parse values from error text. 3. To be consistent to other outputs of identifiers, sql statement and etc. where Spark uses quotes or ticks. ### Does this PR introduce _any_ user-facing change? Yes, it changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "testOnly *QueryCompilationErrorsSuite" $ build/sbt "testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "testOnly *QueryExecutionErrorsSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 96f4b7dbc1facd1a38be296263606aa312861c95) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36600 from MaxGekk/move-ise-from-query-errors-3.3-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 May 2022, 08:51:20 UTC
669fc1b [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery ### What changes were proposed in this pull request? Makes `CombineUnions` do not collapse projects if it hasCorrelatedSubquery. For example: ```sql SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true) t(x) UNION SELECT 1 AS a ``` It will throw exception: ``` java.lang.IllegalStateException: Couldn't find x#4 in [] ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #36595 from wangyum/SPARK-39216. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 85bb7bf008d0346feaedc2aab55857d8f1b19908) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2022, 06:37:33 UTC
77b1313 [SPARK-39226][SQL] Fix the precision of the return type of round-like functions ### What changes were proposed in this pull request? Currently, the precision of the return type of round-like functions (round, bround, ceil, floor) with decimal inputs has a few problems: 1. It does not reserve one more digit in the integral part, in the case of rounding. As a result, `CEIL(CAST(99 AS DECIMAL(2, 0)), -1)` fails. 2. It should return more accurate precision, to count for the scale loose. For example, `CEIL(1.23456, 1)` does not need to keep the precision as 7 in the result. 3. `round`/`bround` with negative scale fails if the input is decimal type. This is not a bug but a little weird. This PR fixes these issues by correcting the formula of calculating the returned decimal type. ### Why are the changes needed? Fix bugs ### Does this PR introduce _any_ user-facing change? Yes, the new functions in 3.3:`ceil`/`floor` with scale parameter, can report a more accurate precision in the result type, and can run some certain queries which failed before. The old functions: `round` and `bround`, can support negative scale parameter with decimal inputs. ### How was this patch tested? new tests Closes #36598 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit ee0aecca05af9b0cb256fd81a78430958a09d19f) Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 May 2022, 05:40:33 UTC
69b7e1a [SPARK-39219][DOC] Promote Structured Streaming over DStream ### What changes were proposed in this pull request? This PR proposes to add NOTE section for DStream guide doc to promote Structured Streaming. Screenshot: <img width="992" alt="screenshot-spark-streaming-programming-guide-change" src="https://user-images.githubusercontent.com/1317309/168977732-4c32db9a-0fb1-4a82-a542-bf385e5f3683.png"> ### Why are the changes needed? We see efforts of community are more focused on Structured Streaming (based on Spark SQL) than Spark Streaming (DStream). We would like to encourage end users to use Structured Streaming than Spark Streaming whenever possible for their workloads. ### Does this PR introduce _any_ user-facing change? Yes, doc change. ### How was this patch tested? N/A Closes #36590 from HeartSaVioR/SPARK-39219. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 7d153392db2f61104da0af1cb175f4ee7c7fbc38) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 19 May 2022, 02:50:17 UTC
47c47b6 [SPARK-39214][SQL][3.3] Improve errors related to CAST ### What changes were proposed in this pull request? In the PR, I propose to rename the error classes: 1. INVALID_SYNTAX_FOR_CAST -> CAST_INVALID_INPUT 2. CAST_CAUSES_OVERFLOW -> CAST_OVERFLOW and change error messages: CAST_INVALID_INPUT: `The value <value> of the type <sourceType> cannot be cast to <targetType> because it is malformed. ...` CAST_OVERFLOW: `The value <value> of the type <sourceType> cannot be cast to <targetType> due to an overflow....` Also quote the SQL config `"spark.sql.ansi.enabled"` and a function name. This is a backport of https://github.com/apache/spark/pull/36553. ### Why are the changes needed? To improve user experience with Spark SQL by making errors/error classes related to CAST more clear and **unified**. ### Does this PR introduce _any_ user-facing change? Yes, the PR changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "testOnly *CastSuite" $ build/sbt "test:testOnly *DateFormatterSuite" $ build/sbt "test:testOnly *TimestampFormatterSuite" $ build/sbt "testOnly *DSV2SQLInsertTestSuite" $ build/sbt "test:testOnly *InsertSuite" $ build/sbt "testOnly *AnsiCastSuiteWithAnsiModeOff" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 66648e96e4283223003c80a1a997325bbd27f940) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36591 from MaxGekk/error-class-improve-msg-2-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 18 May 2022, 18:39:53 UTC
b5ce32f [SPARK-39162][SQL][3.3] Jdbc dialect should decide which function could be pushed down ### What changes were proposed in this pull request? This PR used to back port https://github.com/apache/spark/pull/36521 to 3.3 ### Why are the changes needed? Let function push-down more flexible. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? Exists tests. Closes #36556 from beliefer/SPARK-39162_3.3. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 May 2022, 12:29:01 UTC
ec6fc74 [SPARK-39210][SQL] Provide query context of Decimal overflow in AVG when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides runtime error query context for the Average expression when WSCG is off. ### Why are the changes needed? Enhance the runtime error query context of Average function. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Closes #36582 from gengliangwang/fixAvgContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 8b5b3e95f8761af97255cbcba35c3d836a419dba) Signed-off-by: Gengliang Wang <gengliang@apache.org> 18 May 2022, 10:52:27 UTC
72eb58a [SPARK-39215][PYTHON] Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred ### What changes were proposed in this pull request? This PR proposes to reduce the number of Py4J calls at `pyspark.sql.utils.is_timestamp_ntz_preferred` by having a single method to check. ### Why are the changes needed? For better performance, and simplicity in the code. ### Does this PR introduce _any_ user-facing change? Yes, the number of Py4J calls will be reduced, and the driver side access will become faster. ### How was this patch tested? Existing tests should cover. Closes #36587 from HyukjinKwon/SPARK-39215. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 28e7764bbe6949b2a68ef1466e210ca6418a3018) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 May 2022, 09:21:24 UTC
0e998d3 [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources ### What changes were proposed in this pull request? When reading JSON/CSV files with inferring timestamp types (`.option("inferTimestamp", true)`), the Timestamp conversion will throw and catch exceptions. As we are putting decent error messages in the exception: ``` def cannotCastToDateTimeError( value: Any, from: DataType, to: DataType, errorContext: String): Throwable = { val valueString = toSQLValue(value, from) new SparkDateTimeException("INVALID_SYNTAX_FOR_CAST", Array(toSQLType(to), valueString, SQLConf.ANSI_ENABLED.key, errorContext)) } ``` Throwing and catching the timestamp parsing exceptions is actually not cheap. It consumes more than 90% of the type inference time. This PR improves the default timestamp parsing by returning optional results instead of throwing/catching the exceptions. With this PR, the schema inference time is reduced by 90% in a local benchmark. Note this PR is for the default timestamp parser. It doesn't cover the scenarios of * users provide a customized timestamp format via option * users enable legacy timestamp formatter We can have follow-ups for it. ### Why are the changes needed? Performance improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Also manual test the runtime to inferring a JSON file of 624MB with inferring timestamp enabled: ``` spark.read.option("inferTimestamp", true).json(file) ``` Before the change, it takes 166 seconds After the change, it only 16 seconds. Closes #36562 from gengliangwang/improveInferTS. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 888bf1b2ef44a27c3d4be716a72175bbaa8c6453) Signed-off-by: Gengliang Wang <gengliang@apache.org> 18 May 2022, 03:00:07 UTC
e52b048 [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe ### What changes were proposed in this pull request? Add `synchronized` on method `isCachedColumnBuffersLoaded` ### Why are the changes needed? `isCachedColumnBuffersLoaded` should has `synchronized` wrapped, otherwise may cause NPE when modify `_cachedColumnBuffers` concurrently. ``` def isCachedColumnBuffersLoaded: Boolean = { _cachedColumnBuffers != null && isCachedRDDLoaded } def isCachedRDDLoaded: Boolean = { _cachedColumnBuffersAreLoaded || { val bmMaster = SparkEnv.get.blockManager.master val rddLoaded = _cachedColumnBuffers.partitions.forall { partition => bmMaster.getBlockStatus(RDDBlockId(_cachedColumnBuffers.id, partition.index), false) .exists { case(_, blockStatus) => blockStatus.isCached } } if (rddLoaded) { _cachedColumnBuffersAreLoaded = rddLoaded } rddLoaded } } ``` ``` java.lang.NullPointerException at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedRDDLoaded(InMemoryRelation.scala:247) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedColumnBuffersLoaded(InMemoryRelation.scala:241) at org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8(CacheManager.scala:189) at org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8$adapted(CacheManager.scala:176) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:395) at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.CacheManager.recacheByCondition(CacheManager.scala:219) at org.apache.spark.sql.execution.CacheManager.uncacheQuery(CacheManager.scala:176) at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3220) at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3231) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT. Closes #36496 from pan3793/SPARK-39104. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 3c8d8d7a864281fbe080316ad8de9b8eac80fa71) Signed-off-by: Sean Owen <srowen@gmail.com> 17 May 2022, 23:27:03 UTC
4fb7fe2 [SPARK-39208][SQL] Fix query context bugs in decimal overflow under codegen mode ### What changes were proposed in this pull request? 1. Fix logical bugs in adding query contexts as references under codegen mode. https://github.com/apache/spark/pull/36040/files#diff-4a70d2f3a4b99f58796b87192143f9838f4c4cf469f3313eb30af79c4e07153aR145 The code ``` val errorContextCode = if (nullOnOverflow) { ctx.addReferenceObj("errCtx", queryContext) } else { "\"\"" } ``` should be ``` val errorContextCode = if (nullOnOverflow) { "\"\"" } else { ctx.addReferenceObj("errCtx", queryContext) } ``` 2. Similar to https://github.com/apache/spark/pull/36557, make `CheckOverflowInSum` support query context when WSCG is not available. ### Why are the changes needed? Bugfix and enhancement in the query context of decimal expressions. ### Does this PR introduce _any_ user-facing change? No, the query context is not released yet. ### How was this patch tested? New UT Closes #36577 from gengliangwang/fixDecimalSumOverflow. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 191e535b975e5813719d3143797c9fcf86321368) Signed-off-by: Gengliang Wang <gengliang@apache.org> 17 May 2022, 14:31:50 UTC
c07f65c [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession ### What changes were proposed in this pull request? This PR proposes: 1. Use the function registry in the Spark Session being used 2. Move function registration into `beforeAll` ### Why are the changes needed? Registration of the function without `beforeAll` at `builtin` can affect other tests. See also https://lists.apache.org/thread/jp0ccqv10ht716g9xldm2ohdv3mpmmz1. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Unittests fixed. Closes #36576 from HyukjinKwon/SPARK-32268-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c5351f85dec628a5c806893aa66777cbd77a4d65) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 May 2022, 14:05:56 UTC
c25624b [SPARK-36718][SQL][FOLLOWUP] Improve the extract-only check in CollapseProject ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/36510 , to fix a corner case: if the `CreateStruct` is only referenced once in non-extract expressions, we should still allow collapsing the projects. ### Why are the changes needed? completely fix the perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes #36572 from cloud-fan/regression. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 98fad57221d4dffc6f1fe28d9aca1093172ecf72) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 May 2022, 07:57:05 UTC
30bb19e [SPARK-39183][BUILD] Upgrade Apache Xerces Java to 2.12.2 ### What changes were proposed in this pull request? Upgrade Apache Xerces Java to 2.12.2 [Release notes](https://xerces.apache.org/xerces2-j/releases.html) ### Why are the changes needed? [Infinite Loop in Apache Xerces Java](https://github.com/advisories/GHSA-h65f-jvqw-m9fj) There's a vulnerability within the Apache Xerces Java (XercesJ) XML parser when handling specially crafted XML document payloads. This causes, the XercesJ XML parser to wait in an infinite loop, which may sometimes consume system resources for prolonged duration. This vulnerability is present within XercesJ version 2.12.1 and the previous versions. References https://nvd.nist.gov/vuln/detail/CVE-2022-23437 https://lists.apache.org/thread/6pjwm10bb69kq955fzr1n0nflnjd27dl http://www.openwall.com/lists/oss-security/2022/01/24/3 https://www.oracle.com/security-alerts/cpuapr2022.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #36544 from bjornjorgensen/Upgrade-xerces-to-2.12.2. Authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 181436bd990d3bdf178a33fa6489ad416f3e7f94) Signed-off-by: Sean Owen <srowen@gmail.com> 16 May 2022, 23:10:17 UTC
e2ce088 [SPARK-39190][SQL] Provide query context for decimal precision overflow error when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides query context for decimal precision overflow error when WSCG is off ### Why are the changes needed? Enhance the runtime error query context of checking decimal overflow. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #36557 from gengliangwang/decimalContextWSCG. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 17b85ff97569a43d7fd33863d17bfdaf62d539e0) Signed-off-by: Gengliang Wang <gengliang@apache.org> 16 May 2022, 09:45:07 UTC
1853eb1 [SPARK-39187][SQL][3.3] Remove `SparkIllegalStateException` ### What changes were proposed in this pull request? Remove `SparkIllegalStateException` and replace it by `IllegalStateException` where it was used. This is a backport of https://github.com/apache/spark/pull/36550. ### Why are the changes needed? To improve code maintenance and be consistent to other places where `IllegalStateException` is used in illegal states (for instance, see https://github.com/apache/spark/pull/36524). After the PR https://github.com/apache/spark/pull/36500, the exception is substituted by `SparkException` w/ the `INTERNAL_ERROR` error class. ### Does this PR introduce _any_ user-facing change? No. Users shouldn't face to the exception in regular cases. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "sql/test:testOnly *QueryExecutionErrorsSuite*" $ build/sbt "test:testOnly *ArrowUtilsSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 1a90512f605c490255f7b38215c207e64621475b) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36558 from MaxGekk/remove-SparkIllegalStateException-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 16 May 2022, 08:39:37 UTC
af38fce Preparing development version 3.3.1-SNAPSHOT 16 May 2022, 05:42:35 UTC
c8c657b Preparing Spark release v3.3.0-rc2 16 May 2022, 05:42:28 UTC
386c756 [SPARK-39186][PYTHON] Make pandas-on-Spark's skew consistent with pandas ### What changes were proposed in this pull request? the logics of computing skewness are different between spark sql and pandas: spark sql: [`sqrt(n) * m3 / sqrt(m2 * m2 * m2))`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L304) pandas: [`(count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2**1.5)`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py#L1221) ### Why are the changes needed? to make skew consistent with pandas ### Does this PR introduce _any_ user-facing change? yes, the logic to compute skew was changed ### How was this patch tested? added UT Closes #36549 from zhengruifeng/adjust_pandas_skew. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7e4519c9a8ba35958ef6d408be3ca4e97917c965) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:31:09 UTC
ab1d986 [SPARK-37544][SQL] Correct date arithmetic in sequences ### What changes were proposed in this pull request? Change `InternalSequenceBase` to pass a time-zone aware value to `DateTimeUtils#timestampAddInterval`, rather than a time-zone agnostic value, when performing `Date` arithmetic. ### Why are the changes needed? The following query gets the wrong answer if run in the America/Los_Angeles time zone: ``` spark-sql> select sequence(date '2021-01-01', date '2022-01-01', interval '3' month) x; [2021-01-01,2021-03-31,2021-06-30,2021-09-30,2022-01-01] Time taken: 0.664 seconds, Fetched 1 row(s) spark-sql> ``` The answer should be ``` [2021-01-01,2021-04-01,2021-07-01,2021-10-01,2022-01-01] ``` `InternalSequenceBase` converts the date to micros by multiplying days by micros per day. This converts the date into a time-zone agnostic timestamp. However, `InternalSequenceBase` uses `DateTimeUtils#timestampAddInterval` to perform the arithmetic, and that function assumes a _time-zone aware_ timestamp. One simple fix would be to call `DateTimeUtils#timestampNTZAddInterval` instead for date arithmetic. However, Spark date arithmetic is typically time-zone aware (see the comment in the test added by this PR), so this PR converts the date to a time-zone aware value before calling `DateTimeUtils#timestampAddInterval`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #36546 from bersprockets/date_sequence_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 14ee0d8f04f218ad61688196a0b984f024151468) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:26:26 UTC
2672624 [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors ### What changes were proposed in this pull request? Document PySpark(SQL, pandas API on Spark, and Py4J) common exceptions/errors and respective solutions. ### Why are the changes needed? Make PySpark debugging easier. There are common exceptions/errors in PySpark SQL, pandas API on Spark, and Py4J. Documenting exceptions and respective solutions help users debug PySpark. ### Does this PR introduce _any_ user-facing change? No. Document change only. ### How was this patch tested? Manual test. <img width="1019" alt="image" src="https://user-images.githubusercontent.com/47337188/165145874-b0de33b1-835a-459d-9062-94086e62e254.png"> Please refer to https://github.com/apache/spark/blob/7a1c7599a21cbbe2778707b72643cf98ac601ab1/python/docs/source/development/debugging.rst#common-exceptions--errors for the whole rendered page. Closes #36267 from xinrong-databricks/common_err. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f940d7adfd6d071bc3bdcc406e01263a7f03e955) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:25:10 UTC
2db9d78 [SPARK-39174][SQL] Catalogs loading swallows missing classname for ClassNotFoundException ### What changes were proposed in this pull request? this PR captures the actual missing classname when catalog loading meets ClassNotFoundException ### Why are the changes needed? ClassNotFoundException can occur when missing dependencies, we shall not always report the catalog class is missing ### Does this PR introduce _any_ user-facing change? yes, when loading catalogs and ClassNotFoundException occurs, it shows the correct missing class. ### How was this patch tested? new test added Closes #36534 from yaooqinn/SPARK-39174. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1b37f19876298e995596a30edc322c856ea1bbb4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 May 2022, 10:45:34 UTC
30834b8 [SPARK-39157][SQL] H2Dialect should override getJDBCType so as make the data type is correct ### What changes were proposed in this pull request? Currently, `H2Dialect` not implement `getJDBCType` of `JdbcDialect`, so the DS V2 push-down will throw exception show below: ``` Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13) (jiaan-gengdembp executor driver): org.h2.jdbc.JdbcSQLNonTransientException: Unknown data type: "STRING"; SQL statement: SELECT "DEPT","NAME","SALARY","BONUS","IS_MANAGER" FROM "test"."employee" WHERE ("BONUS" IS NOT NULL) AND ("DEPT" IS NOT NULL) AND (CAST("BONUS" AS string) LIKE '%30%') AND (CAST("DEPT" AS byte) > 1) AND (CAST("DEPT" AS short) > 1) AND (CAST("BONUS" AS decimal(20,2)) > 1200.00) [50004-210] ``` H2Dialect should implement `getJDBCType` of `JdbcDialect`. ### Why are the changes needed? make the H2 data type is correct. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug for `H2Dialect`. ### How was this patch tested? New tests. Closes #36516 from beliefer/SPARK-39157. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa3f096e02d408fbeab5f69af451ef8bc8f5b3db) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 May 2022, 14:01:27 UTC
e743e68 [SPARK-39178][CORE] SparkFatalException should show root cause when print error stack ### What changes were proposed in this pull request? Our user meet an case when running broadcast, throw `SparkFatalException`, but in error stack, it don't show the error case. ### Why are the changes needed? Make exception more clear ### Does this PR introduce _any_ user-facing change? User can got root cause when application throw `SparkFatalException`. ### How was this patch tested? For ut ``` test("xxxx") { throw new SparkFatalException( new OutOfMemoryError("Not enough memory to build and broadcast the table to all " + "worker nodes. As a workaround, you can either disable broadcast by setting " + s"driver memory by setting ${SparkLauncher.DRIVER_MEMORY} to a higher value.") .initCause(null)) } ``` Before this pr: ``` [info] org.apache.spark.util.SparkFatalException: [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:59) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) ``` After this pr: ``` [info] org.apache.spark.util.SparkFatalException: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting driver memory by setting spark.driver.memory to a higher value. [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:59) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:64) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) [info] Cause: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting driver memory by setting spark.driver.memory to a higher value. [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:58) ``` Closes #36539 from AngersZhuuuu/SPARK-39178. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit d7317b03e975f8dc1a8c276dd0a931e00c478717) Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 13:47:21 UTC
1a49de6 [SPARK-39177][SQL] Provide query context on map key not exists error when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides query context for "map key not exists" runtime error when WSCG is off. ### Why are the changes needed? Enhance the runtime error query context for "map key not exists" runtime error. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #36538 from gengliangwang/fixMapKeyContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 1afddf407436c3b315ec601fab5a4a1b2028e672) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 13:45:20 UTC
1372f31 [SPARK-39164][SQL][3.3] Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions ### What changes were proposed in this pull request? In the PR, I propose to catch `java.lang.IllegalStateException` and `java.lang.AssertionError` (raised by asserts), and wrap them by Spark's exception w/ the `INTERNAL_ERROR` error class. The modification affects only actions so far. This PR affects the case of missing bucket file. After the changes, Spark throws `SparkException` w/ `INTERNAL_ERROR` instead of `IllegalStateException`. Since this is not Spark's illegal state, the exception should be replaced by another runtime exception. Created the ticket SPARK-39163 to fix this. This is a backport of https://github.com/apache/spark/pull/36500. ### Why are the changes needed? To improve user experience with Spark SQL and unify representation of internal errors by using error classes like for other errors. Usually, users shouldn't observe asserts and illegal states, but even if such situation happens, they should see errors in the same way as other errors (w/ error class `INTERNAL_ERROR`). ### Does this PR introduce _any_ user-facing change? Yes. At least, in one particular case, see the modified test suites and SPARK-39163. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *.BucketedReadWithoutHiveSupportSuite" $ build/sbt "test:testOnly *.AdaptiveQueryExecSuite" $ build/sbt "test:testOnly *.WholeStageCodegenSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit f5c3f0c228fef7808d1f927e134595ddd4d31723) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36533 from MaxGekk/class-internal-error-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 13:43:53 UTC
c2bd7ba [SPARK-39165][SQL][3.3] Replace `sys.error` by `IllegalStateException` ### What changes were proposed in this pull request? Replace all invokes of `sys.error()` by throwing of `IllegalStateException` in the `sql` namespace. This is a backport of https://github.com/apache/spark/pull/36524. ### Why are the changes needed? In the context of wrapping all internal errors like asserts/illegal state exceptions (see https://github.com/apache/spark/pull/36500), it is impossible to distinguish `RuntimeException` of `sys.error()` from Spark's exceptions like `SparkRuntimeException`. The last one can be propagated to the user space but `sys.error` exceptions shouldn't be visible to users in regular cases. ### Does this PR introduce _any_ user-facing change? No, shouldn't. sys.error shouldn't propagate exception to user space in regular cases. ### How was this patch tested? By running the existing test suites. Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 95c7efd7571464d8adfb76fb22e47a5816cf73fb) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36532 from MaxGekk/sys_error-internal-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 09:47:53 UTC
27c03e5 [SPARK-39175][SQL] Provide runtime error query context for Cast when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides runtime error query context for the Cast expression when WSCG is off. ### Why are the changes needed? Enhance the runtime error query context of Cast expression. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #36535 from gengliangwang/fixCastContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit cdd33e83c3919c4475e2e1ef387acb604bea81b9) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 09:46:45 UTC
63f20c5 [SPARK-39166][SQL] Provide runtime error query context for binary arithmetic when WSCG is off ### What changes were proposed in this pull request? Currently, for most of the cases, the project https://issues.apache.org/jira/browse/SPARK-38615 is able to show where the runtime errors happen within the original query. However, after trying on production, I found that the following queries won't show where the divide by 0 error happens ``` create table aggTest(i int, j int, k int, d date) using parquet insert into aggTest values(1, 2, 0, date'2022-01-01') select sum(j)/sum(k),percentile(i, 0.9) from aggTest group by d ``` With `percentile` function in the query, the plan can't execute with whole stage codegen. Thus the child plan of `Project` is serialized to executors for execution, from ProjectExec: ``` protected override def doExecute(): RDD[InternalRow] = { child.execute().mapPartitionsWithIndexInternal { (index, iter) => val project = UnsafeProjection.create(projectList, child.output) project.initialize(index) iter.map(project) } } ``` Note that the `TreeNode.origin` is not serialized to executors since `TreeNode` doesn't extend the trait `Serializable`, which results in an empty query context on errors. For more details, please read https://issues.apache.org/jira/browse/SPARK-39140 A dummy fix is to make `TreeNode` extend the trait `Serializable`. However, it can be performance regression if the query text is long (every `TreeNode` carries it for serialization). A better fix is to introduce a new trait `SupportQueryContext` and materialize the truncated query context for special expressions. This PR targets on binary arithmetic expressions only. I will create follow-ups for the remaining expressions which support runtime error query context. ### Why are the changes needed? Improve the error context framework and make sure it works when whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #36525 from gengliangwang/serializeContext. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit e336567c8a9704b500efecd276abaf5bd3988679) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 05:50:40 UTC
3cc47a1 Revert "[SPARK-36837][BUILD] Upgrade Kafka to 3.1.0" ### What changes were proposed in this pull request? This PR aims to revert commit 973ea0f06e72ab64574cbf00e095922a3415f864 from `branch-3.3` to exclude it from Apache Spark 3.3 scope. ### Why are the changes needed? SPARK-36837 tried to use Apache Kafka 3.1.0 at Apache Spark 3.3.0 and initially wanted to upgrade to Apache Kafka 3.3.1 before the official release. However, we can use the stable Apache Kafka 2.8.1 at Spark 3.3.0 and wait for more proven versions, Apache Kafka 3.2.x or 3.3.x. Apache Kafka 3.2.0 vote is already passed and will arrive. - https://lists.apache.org/thread/9k5sysvchg98lchv2rvvvq6xhpgk99cc Apache Kafka 3.3.0 release discussion is started too. - https://lists.apache.org/thread/cmol5bcf011s1xl91rt4ylb1dgz2vb1r ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #36517 from dongjoon-hyun/SPARK-36837-REVERT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 May 2022, 17:55:21 UTC
0baa5c7 [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject This PR fixes a perf regression in Spark 3.3 caused by https://github.com/apache/spark/pull/33958 In `CollapseProject`, we want to treat `CreateStruct` and its friends as cheap expressions if they are only referenced by `ExtractValue`, but the check is too conservative, which causes a perf regression. This PR fixes this check. Now "extract-only" means: the attribute only appears as a child of `ExtractValue`, but the consumer expression can be in any shape. Fixes perf regression No new tests Closes #36510 from cloud-fan/bug. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 547f032d04bd2cf06c54b5a4a2f984f5166beb7d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 May 2022, 04:59:36 UTC
70becf2 [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion ### What changes were proposed in this pull request? Access to JVM through passed-in GatewayClient during type conversion. ### Why are the changes needed? In customized type converters, we may utilize the passed-in GatewayClient to access JVM, rather than rely on the `SparkContext._jvm`. That's [how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508) Py4J explicit converters access JVM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36504 from xinrong-databricks/gateway_client_jvm. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 92fcf214c107358c1a70566b644cec2d35c096c0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 03:22:19 UTC
ebe4252 [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode ### What changes were proposed in this pull request? This is a bug of the command legacy mode as it does not fully restore to the legacy behavior. The legacy v1 SHOW DATABASES command does not quote the database names. This PR fixes it. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no change by default, unless people turn on legacy mode, in which case SHOW DATABASES common won't quote the database names. ### How was this patch tested? new tests Closes #36508 from cloud-fan/regression. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3094e495095635f6c9e83f4646d3321c2a9311f4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 May 2022, 03:18:35 UTC
65dd727 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 00:13:34 UTC
e4bb341 Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" This reverts commit f75c00da3cf01e63d93cedbe480198413af41455. 12 May 2022, 00:12:13 UTC
f75c00d [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index Remove outdated statements on distributed-sequence default index. Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. No. Doc change only. Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cec1e7b4e68deac321f409d424a3acdcd4cb91be) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 00:07:49 UTC
8608baa [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/35204 . https://github.com/apache/spark/pull/35204 introduced a potential regression: it removes the "location" table property from `V1Table` if the table is not external. The intention was to avoid putting the LOCATION clause for managed tables in `ShowCreateTableExec`. However, if we use the v2 DESCRIBE TABLE command by default in the future, this will bring a behavior change and v2 DESCRIBE TABLE command won't print the table location for managed tables. This PR fixes this regression by using a different idea to fix the SHOW CREATE TABLE issue: 1. introduce a new reserved table property `is_managed_location`, to indicate that the location is managed by the catalog, not user given. 2. `ShowCreateTableExec` only generates the LOCATION clause if the "location" property is present and is not managed. ### Why are the changes needed? avoid a potential regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. We can add a test when we use v2 DESCRIBE TABLE command by default. Closes #36498 from cloud-fan/regression. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa2bda5c4eabb23d5f5b3e14ccd055a2453f579f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 May 2022, 06:50:11 UTC
7d57577 [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost ### What changes were proposed in this pull request? Add a new leaf like node `LeafNodeWithoutStats` and apply to the list: - ResolvedDBObjectName - ResolvedNamespace - ResolvedTable - ResolvedView - ResolvedNonPersistentFunc - ResolvedPersistentFunc ### Why are the changes needed? We enable v2 command at 3.3.0 branch by default `spark.sql.legacy.useV1Command`. However this is a behavior change between v1 and c2 command. - v1 command: We resolve logical plan to command at analyzer phase by `ResolveSessionCatalog` - v2 commnd: We resolve logical plan to v2 command at physical phase by `DataSourceV2Strategy` Foe cost explain mode, we will call `LogicalPlanStats.stats` using optimized plan so there is a gap between v1 and v2 command. Unfortunately, the logical plan of v2 command contains the `LeafNode` which does not override the `computeStats`. As a result, there is a error running such sql: ```sql set spark.sql.ui.explainMode=cost; show tables; ``` ``` java.lang.UnsupportedOperationException: at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats$(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.analysis.ResolvedNamespace.computeStats(v2ResolutionPlans.scala:155) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30) ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #36488 from ulysses-you/SPARK-39112. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 06fd340daefd67a3e96393539401c9bf4b3cbde9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 May 2022, 06:32:21 UTC
1738f48 [MINOR][DOCS][PYTHON] Fixes pandas import statement in code example ### What changes were proposed in this pull request? In 'Applying a Function' section, example code import statement `as pd` was added ### Why are the changes needed? In 'Applying a Function' section, example code import statement needs to have a `as pd` because in function definitions we're using `pd`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Documentation change only. Continues to be markdown compliant. Closes #36502 from snifhex/patch-1. Authored-by: Sachin Tripathi <sachintripathis84@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 60edc5758e82e76a37ce5a5f98e870fac587b656) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 May 2022, 00:13:30 UTC
a4b420c [SPARK-39106][SQL] Correct conditional expression constant folding - add try-catch when we fold children inside `ConditionalExpression` if it's not foldable - mark `CaseWhen` and `If` as foldable if it's children are foldable For a conditional expression, we should add a try-catch to partially fold the constant inside it's children because some bracnhes may no be evaluated at runtime. For example if c1 or c2 is not null, the last branch should be never hit: ```sql SELECT COALESCE(c1, c2, 1/0); ``` Besides, for CaseWhen and If, we should mark it as foldable if it's children are foldable. It is safe since the both non-codegen and codegen code path have already respected the evaluation order. yes, bug fix add more test in sql file Closes #36468 from ulysses-you/SPARK-39106. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 08a4ade8ba881589da0741b3ffacd3304dc1e9b5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2022, 12:43:24 UTC
c6584af [SPARK-39135][SQL] DS V2 aggregate partial push-down should supports group by without aggregate functions ### What changes were proposed in this pull request? Currently, the SQL show below not supported by DS V2 aggregate partial push-down. `select key from tab group by key` ### Why are the changes needed? Make DS V2 aggregate partial push-down supports group by without aggregate functions. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes #36492 from beliefer/SPARK-39135. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit decd393e23406d82b47aa75c4d24db04c7d1efd6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2022, 09:37:39 UTC
7838a14 [MINOR][INFRA][3.3] Add ANTLR generated files to .gitignore ### What changes were proposed in this pull request? Add git ignore entries for files created by ANTLR. This is a backport of #35838. ### Why are the changes needed? To avoid developers from accidentally adding those files when working on parser/lexer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By making sure those files are ignored by git status when they exist. Closes #36489 from yutoacts/minor_gitignore_3.3. Authored-by: Yuto Akutsu <rhythmy.dev@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> 10 May 2022, 02:50:44 UTC
c759151 [SPARK-39107][SQL] Account for empty string input in regex replace ### What changes were proposed in this pull request? When trying to perform a regex replace, account for the possibility of having empty strings as input. ### Why are the changes needed? https://github.com/apache/spark/pull/29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string) From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions. 3.0.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | <empty>| +---+--------+ ``` 3.1.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | | +---+--------+ ``` The 3.0.2 outcome is the expected and correct one ### Does this PR introduce _any_ user-facing change? Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above. ### How was this patch tested? Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement. Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace. Authored-by: Lorenzo Martini <lmartini@palantir.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 731aa2cdf8a78835621fbf3de2d3492b27711d1a) Signed-off-by: Sean Owen <srowen@gmail.com> 10 May 2022, 00:44:26 UTC
cf13262 [SPARK-38939][SQL][FOLLOWUP] Replace named parameter with comment in ReplaceColumns ### What changes were proposed in this pull request? This PR aims to replace named parameter with comment in `ReplaceColumns`. ### Why are the changes needed? #36252 changed signature of deleteColumn#**TableChange.java**, but this PR breaks sbt compilation in k8s integration test. ```shell > build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=r -Dspark.kubernetes.test.imageRepo=kubespark "kubernetes-integration-tests/test" [error] /Users/IdeaProjects/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala:147:45: not found: value ifExists [error] TableChange.deleteColumn(Array(name), ifExists = false) [error] ^ [error] /Users/IdeaProjects/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala:159:19: value ++ is not a member of Array[Nothing] [error] deleteChanges ++ addChanges [error] ^ [error] two errors found [error] (catalyst / Compile / compileIncremental) Compilation failed ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the GA and k8s integration test. Closes #36487 from dcoliversun/SPARK-38939. Authored-by: Qian.Sun <qian.sun2020@gmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit 16b5124d75dc974c37f2fd87c78d231f8a3bf772) Signed-off-by: huaxingao <huaxin_gao@apple.com> 09 May 2022, 17:00:44 UTC
a52a245 [SPARK-38786][SQL][TEST] Bug in StatisticsSuite 'change stats after add/drop partition command' ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979 It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug. ### Why are the changes needed? Due to this test bug, the drop command was dropping a wrong (`partDir1`) underlying file in the test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added extra underlying file location check. Closes #36075 from kazuyukitanimura/SPARK-38786. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit a6b04f007c07fe00637aa8be33a56f247a494110) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 May 2022, 00:21:59 UTC
e9ee2c8 [SPARK-37618][CORE][FOLLOWUP] Support cleaning up shuffle blocks from external shuffle service ### What changes were proposed in this pull request? Fix test failure in build. Depending on the umask of the process running tests (which is typically inherited from the user's default umask), the group writable bit for the files/directories could be set or unset. The test was assuming that by default the umask will be restrictive (and so files/directories wont be group writable). Since this is not a valid assumption, we use jnr to change the umask of the process to be more restrictive - so that the test can validate the behavior change - and reset it back once the test is done. ### Why are the changes needed? Fix test failure in build ### Does this PR introduce _any_ user-facing change? No Adds jnr as a test scoped dependency, which does not bring in any other new dependency (asm is already a dep in spark). ``` [INFO] +- com.github.jnr:jnr-posix:jar:3.0.9:test [INFO] | +- com.github.jnr:jnr-ffi:jar:2.0.1:test [INFO] | | +- com.github.jnr:jffi:jar:1.2.7:test [INFO] | | +- com.github.jnr:jffi:jar:native:1.2.7:test [INFO] | | +- org.ow2.asm:asm:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-commons:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-analysis:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-tree:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-util:jar:5.0.3:test [INFO] | | \- com.github.jnr:jnr-x86asm:jar:1.0.2:test [INFO] | \- com.github.jnr:jnr-constants:jar:0.8.6:test ``` ### How was this patch tested? Modification to existing test. Tested on Linux, skips test when native posix env is not found. Closes #36473 from mridulm/fix-SPARK-37618-test. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 317407171cb36439c371153cfd45c1482bf5e425) Signed-off-by: Sean Owen <srowen@gmail.com> 08 May 2022, 13:11:31 UTC
feba335 [SPARK-39083][CORE] Fix race condition between update and clean app data ### What changes were proposed in this pull request? make `cleanAppData` atomic to prevent race condition between update and clean app data. When the race condition happens, it could lead to a scenario when `cleanAppData` delete the entry of ApplicationInfoWrapper for an application right after it has been updated by `mergeApplicationListing`. So there will be cases when the HS Web UI displays `Application not found` for applications whose logs does exist. #### Error message ``` 22/04/29 17:16:21 DEBUG FsHistoryProvider: New/updated attempts found: 1 ArrayBuffer(viewfs://iu/log/spark3/application_1651119726430_138107_1) 22/04/29 17:16:21 INFO FsHistoryProvider: Parsing viewfs://iu/log/spark3/application_1651119726430_138107_1 for listing data... 22/04/29 17:16:21 INFO FsHistoryProvider: Looking for end event; skipping 10805037 bytes from viewfs://iu/log/spark3/application_1651119726430_138107_1... 22/04/29 17:16:21 INFO FsHistoryProvider: Finished parsing viewfs://iu/log/spark3/application_1651119726430_138107_1 22/04/29 17:16:21 ERROR Utils: Uncaught exception in thread log-replay-executor-7 java.util.NoSuchElementException at org.apache.spark.util.kvstore.InMemoryStore.read(InMemoryStore.java:85) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$3(FsHistoryProvider.scala:927) at scala.Option.foreach(Option.scala:407) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$1(FsHistoryProvider.scala:926) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:2032) at org.apache.spark.deploy.history.FsHistoryProvider.checkAndCleanLog(FsHistoryProvider.scala:916) at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:712) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$15(FsHistoryProvider.scala:576) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` #### Background Currently, the HS runs the `checkForLogs` to build the application list based on the current contents of the log directory for every 10 seconds by default. - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L296-L299 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L472 In each turn of execution, this method scans the specified logDir and parse the log files to update its KVStore: - detect new updated/added files to process : https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L574-L578 - detect stale data to remove: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600 These 2 operations are executed in different threads as `submitLogProcessTask` uses `replayExecutor` to submit tasks. https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1389-L1401 ### When does the bug happen? `Application not found` error happens in the following scenario: In the first run of `checkForLogs`, it detected a newly-added log `viewfs://iu/log/spark3/AAA_1.inprogress` (log of an in-progress application named AAA). So it will add 2 entries to the KVStore - one entry of key-value as the key is the logPath (`viewfs://iu/log/spark3/AAA_1.inprogress`) and the value is an instance of LogInfo represented the log - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L495-L505 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L545-L552 - one entry of key-value as the key is the applicationId (`AAA`) and the value is an instance of ApplicationInfoWrapper holding the information of the application. - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L825 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 In the next run of `checkForLogs`, now the AAA application has finished, the log `viewfs://iu/log/spark3/AAA_1.inprogress` has been deleted and a new log `viewfs://iu/log/spark3/AAA_1` is created. So `checkForLogs` will do the following 2 things in 2 different threads: - Thread 1: parsing the new log `viewfs://iu/log/spark3/AAA_1` and update data in its KVStore - add a new entry of key: `viewfs://iu/log/spark3/AAA_1` and value: an instance of LogInfo represented the log - updated the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper (for example: the isCompleted flag now change from false to true) - Thread 2: data related to `viewfs://iu/log/spark3/AAA_1.inprogress` is now considered as stale and it must be deleted. - clean App data for `viewfs://iu/log/spark3/AAA_1.inprogress` https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600 - Inside `cleanAppData`, first it loads the latest information of `ApplicationInfoWrapper` from the KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L632 For most of the time, when this line is executed, Thread 1 already finished `updating the entry with key=applicationId (AAA) with new value of an instance of ApplicationInfoWrapper` so this condition https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L637 will be evaluated as false, so `isStale` will be false. However, in some rare cases, when Thread1 does not finish the update yet, the old data of ApplicationInfoWrapper will be load, so `isStale` will be true and it leads to deleting the entry of ApplicationInfoWrapper in KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L656-L662 and the worst thing is it delete the entry right after when Thread 1 has finished updating the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper. So the entry for the ApplicationInfoWrapper of applicationId= `AAA` is removed forever then when users access the Web UI for this application, and `Application not found` is shown up while the log for the app does exist. So here we make the `cleanAppData` method atomic just like the `addListing` method https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 so that - If Thread 1 gets the lock on the `listing` before Thread 2, it will update the entry for the application, so in Thread2 `isStale` will be false, the entry for the application will not be removed from KVStore - If Thread 2 gets the lock on the `listing` before Thread 1, then `isStale` will be true, the entry for the application will be removed from KVStore but after that it will be added again by Thread 1. In both case, the entry for the application will not be deleted unexpectedly from KVStore. ### Why are the changes needed? Fix the bug causing HS Web UI to display `Application not found` for applications whose logs does exist. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ## How was this patch tested? Manual test. Deployed in our Spark HS and the `java.util.NoSuchElementException` exception does not happen anymore. `Application not found` error does not happen anymore. Closes #36424 from tanvn/SPARK-39083. Authored-by: tan.vu <tan.vu@linecorp.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 29643265a9f5e8142d20add5350c614a55161451) Signed-off-by: Sean Owen <srowen@gmail.com> 08 May 2022, 13:09:32 UTC
3241f20 [SPARK-39093][SQL][FOLLOWUP] Fix Period test ### What changes were proposed in this pull request? Change the Period test to use months rather than days. ### Why are the changes needed? While the Period test as-is confirms that the compilation error is fixed, it doesn't confirm that the newly generated code is correct. Spark ignores any unit less than months in Periods. As a result, we are always testing `0/(num + 3)`, so the test doesn't verify that the code generated for the right-hand operand is correct. ### Does this PR introduce _any_ user-facing change? No, only changes a test. ### How was this patch tested? Unit test. Closes #36481 from bersprockets/SPARK-39093_followup. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 670710e91dfb2ed27c24b139c50a6bcf03132024) Signed-off-by: Gengliang Wang <gengliang@apache.org> 08 May 2022, 10:14:58 UTC
6378365 [SPARK-39121][K8S][DOCS] Fix format error on running-on-kubernetes doc ### What changes were proposed in this pull request? Fix format error on running-on-kubernetes doc ### Why are the changes needed? Fix format syntax error ### Does this PR introduce _any_ user-facing change? No, unreleased doc only ### How was this patch tested? - `SKIP_API=1 bundle exec jekyll serve --watch` - CI passed Closes #36476 from Yikun/SPARK-39121. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 2349f74866ae1b365b5e4e0ec8a58c4f7f06885c) Signed-off-by: Max Gekk <max.gekk@gmail.com> 07 May 2022, 07:20:02 UTC
74ffaec [SPARK-39012][SQL] SparkSQL cast partition value does not support all data types ### What changes were proposed in this pull request? Add binary type support for schema inference in SparkSQL. ### Why are the changes needed? Spark does not support binary and boolean types when casting partition value to a target type. This PR adds the support for binary and boolean. ### Does this PR introduce _any_ user-facing change? No. This is more like fixing a bug. ### How was this patch tested? UT Closes #36344 from amaliujia/inferbinary. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 08c07a17717691f931d4d3206dd0385073f5bd08) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 May 2022, 09:39:13 UTC
de3673e [SPARK-39105][SQL] Add ConditionalExpression trait ### What changes were proposed in this pull request? Add `ConditionalExpression` trait. ### Why are the changes needed? For developers, if a custom conditional like expression contains common sub expression then the evaluation order may be changed since Spark will pull out and eval the common sub expressions first during execution. Add ConditionalExpression trait is friendly for developers. ### Does this PR introduce _any_ user-facing change? no, add a new trait ### How was this patch tested? Pass existed test Closes #36455 from ulysses-you/SPARK-39105. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa86a078bb7d57d7dbd48095fb06059a9bdd6c2e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 May 2022, 07:42:10 UTC
6a61f95 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases ### What changes were proposed in this pull request? Add missed dependencies to `dev/create-release/spark-rm/Dockerfile`. ### Why are the changes needed? To be able to build Spark releases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By building the Spark 3.3 release via: ``` $ dev/create-release/do-release-docker.sh -d /home/ubuntu/max/spark-3.3-rc1 ``` Closes #36449 from MaxGekk/deps-Dockerfile. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 4b1c2fb7a27757ebf470416c8ec02bb5c1f7fa49) Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 May 2022, 17:10:17 UTC
0f2e3ec [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---+----+ | f1| f2| +---+----+ | a|null| | a| b| +---+----+ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6689b97ec76abe5bab27f02869f8f16b32530d1a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 May 2022, 07:23:43 UTC
1fa3171 [SPARK-39060][SQL][3.3] Typo in error messages of decimal overflow ### What changes were proposed in this pull request? This PR removes extra curly bracket from debug string for Decimal type in SQL. This is a backport from master branch. Commit: 165ce4eb7d6d75201beb1bff879efa99fde24f94 ### Why are the changes needed? Typo in error messages of decimal overflow. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests: ``` $ build/sbt "sql/testOnly" ``` Closes #36450 from vli-databricks/SPARK-39060-3.3. Authored-by: Vitalii Li <vitalii.li@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 May 2022, 06:08:38 UTC
94d3d6b [SPARK-38891][SQL] Skipping allocating vector for repetition & definition levels when possible ### What changes were proposed in this pull request? This PR adds two optimization on the vectorized Parquet reader for complex types: - avoid allocating vectors for repetition and definition levels whenever can be applied - avoid reading definition levels whenever can be applied ### Why are the changes needed? At the moment, Spark will allocate vectors for repetition and definition levels, and also read definition levels even if it's not necessary, for instance, when reading primitive types. This may add extra memory footprint especially when reading wide tables. Therefore, we should avoid them if possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #36202 from sunchao/SPARK-38891. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> 04 May 2022, 16:37:55 UTC
fd998c8 [SPARK-39093][SQL] Avoid codegen compilation error when dividing year-month intervals or day-time intervals by an integral ### What changes were proposed in this pull request? In `DivideYMInterval#doGenCode` and `DivideDTInterval#doGenCode`, rely on the operand variable names provided by `nullSafeCodeGen` rather than calling `genCode` on the operands twice. ### Why are the changes needed? `DivideYMInterval#doGenCode` and `DivideDTInterval#doGenCode` call `genCode` on the operands twice (once directly, and once indirectly via `nullSafeCodeGen`). However, if you call `genCode` on an operand twice, you might not get back the same variable name for both calls (e.g., when the operand is not a `BoundReference` or if whole-stage codegen is turned off). When that happens, `nullSafeCodeGen` generates initialization code for one set of variables, but the divide expression generates usage code for another set of variables, resulting in compilation errors like this: ``` spark-sql> create or replace temp view v1 as > select * FROM VALUES > (interval '10' months, interval '10' day, 2) > as v1(period, duration, num); Time taken: 2.81 seconds spark-sql> cache table v1; Time taken: 2.184 seconds spark-sql> select period/(num + 3) from v1; 22/05/03 08:56:37 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 44: Expression "project_value_2" is not an rvalue ... 22/05/03 08:56:37 WARN UnsafeProjection: Expr codegen error and falling back to interpreter mode ... 0-2 Time taken: 0.149 seconds, Fetched 1 row(s) spark-sql> select duration/(num + 3) from v1; 22/05/03 08:57:29 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 54: Expression "project_value_2" is not an rvalue ... 22/05/03 08:57:29 WARN UnsafeProjection: Expr codegen error and falling back to interpreter mode ... 2 00:00:00.000000000 Time taken: 0.089 seconds, Fetched 1 row(s) ``` The error is not fatal (unless you have `spark.sql.codegen.fallback` set to `false`), but it muddies the log and can slow the query (since the expression is interpreted). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests (unit tests run with `spark.sql.codegen.fallback` set to `false`, so the new tests fail without the fix). Closes #36442 from bersprockets/interval_div_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit ca87bead23ca32a05c6a404a91cea47178f63e70) Signed-off-by: Gengliang Wang <gengliang@apache.org> 04 May 2022, 09:22:24 UTC
d3aadb4 [SPARK-39087][SQL][3.3] Improve messages of error classes ### What changes were proposed in this pull request? In the PR, I propose to modify error messages of the following error classes: - INVALID_JSON_SCHEMA_MAP_TYPE - INCOMPARABLE_PIVOT_COLUMN - INVALID_ARRAY_INDEX_IN_ELEMENT_AT - INVALID_ARRAY_INDEX - DIVIDE_BY_ZERO This is a backport of https://github.com/apache/spark/pull/36428. ### Why are the changes needed? To improve readability of error messages. ### Does this PR introduce _any_ user-facing change? Yes. It changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite*" $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite*" $ build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "test:testOnly *SparkThrowableSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 040526391a45ad610422a48c05aa69ba5133f922) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36439 from MaxGekk/error-class-improve-msg-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 04 May 2022, 05:45:03 UTC
0515536 Preparing development version 3.3.1-SNAPSHOT 03 May 2022, 18:15:51 UTC
back to top