Revision history - refs/tags/v3.0.0-rc2 - origin: https://github.com/apache/spark

visit type:

Revision	Author	Date	Message	Commit Date
29853ec	Reynold Xin	18 May 2020, 13:21:37 UTC	Preparing Spark release v3.0.0-rc2	18 May 2020, 13:21:37 UTC
740da34	Max Gekk	18 May 2020, 12:07:01 UTC	[SPARK-31738][SQL][DOCS] Describe 'L' and 'M' month pattern letters ### What changes were proposed in this pull request? 1. Describe standard 'M' and stand-alone 'L' text forms 2. Add examples for all supported number of month letters <img width="1047" alt="Screenshot 2020-05-18 at 08 57 31" src="https://user-images.githubusercontent.com/1580697/82178856-b16f1000-98e5-11ea-87c0-456ef94dcd43.png"> ### Why are the changes needed? To improve docs and show how to use month patterns. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By building docs and checking by eyes. Closes #28558 from MaxGekk/describe-L-M-date-pattern. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b3686a762281ce9bf595bf790f8e4198d3d186b4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	18 May 2020, 12:07:13 UTC
2cdf4eb	HyukjinKwon	18 May 2020, 05:35:02 UTC	[SPARK-31746][YARN][TESTS] Show the actual error message in LocalityPlacementStrategySuite This PR proposes to show the actual traceback when "handle large number of containers and tasks (SPARK-18750)" test fails in `LocalityPlacementStrategySuite`. It does not fully resolve the JIRA SPARK-31746 yet. I tried to reproduce in my local by controlling the factors in the tests but I couldn't. I double checked the changes in SPARK-18750 are still valid. This test is flaky for an unknown reason (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122768/testReport/org.apache.spark.deploy.yarn/LocalityPlacementStrategySuite/handle_large_number_of_containers_and_tasks__SPARK_18750_/): ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError did not equal null at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) ``` After this PR, it will help to investigate the root cause: Before: ``` [info] - handle large number of containers and tasks (SPARK-18750) * FAILED * (824 milliseconds) [info] java.lang.StackOverflowError did not equal null (LocalityPlacementStrategySuite.scala:49) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:49) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) [info] at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) [info] at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) ... ``` After: ``` [info] - handle large number of containers and tasks (SPARK-18750) * FAILED * (825 milliseconds) [info] StackOverflowError should not be thrown; however, got: [info] [info] java.lang.StackOverflowError [info] at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) ... ``` No, dev-only. Manually tested by reverting https://github.com/apache/spark/commit/76db394f2baedc2c7b7a52c05314a64ec9068263 locally. Closes #28566 from HyukjinKwon/SPARK-31746. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 3bf7bf99e96fab754679a4f3c893995263161341) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	18 May 2020, 05:56:24 UTC
88e00c3	HyukjinKwon	18 May 2020, 05:55:31 UTC	Revert "[SPARK-31746][YARN][TESTS] Show the actual error message in LocalityPlacementStrategySuite" This reverts commit cbd8568ad7588cf14e5519c5aabf88d0b7fb0e33.	18 May 2020, 05:55:31 UTC
cbd8568	HyukjinKwon	18 May 2020, 05:35:02 UTC	[SPARK-31746][YARN][TESTS] Show the actual error message in LocalityPlacementStrategySuite ### What changes were proposed in this pull request? This PR proposes to show the actual traceback when "handle large number of containers and tasks (SPARK-18750)" test fails in `LocalityPlacementStrategySuite`. It does not fully resolve the JIRA SPARK-31746 yet. I tried to reproduce in my local by controlling the factors in the tests but I couldn't. I double checked the changes in SPARK-18750 are still valid. ### Why are the changes needed? This test is flaky for an unknown reason (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122768/testReport/org.apache.spark.deploy.yarn/LocalityPlacementStrategySuite/handle_large_number_of_containers_and_tasks__SPARK_18750_/): ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError did not equal null at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) ``` After this PR, it will help to investigate the root cause: Before: ``` [info] - handle large number of containers and tasks (SPARK-18750) * FAILED * (824 milliseconds) [info] java.lang.StackOverflowError did not equal null (LocalityPlacementStrategySuite.scala:49) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:49) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) [info] at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) [info] at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) ... ``` After: ``` [info] - handle large number of containers and tasks (SPARK-18750) * FAILED * (825 milliseconds) [info] StackOverflowError should not be thrown; however, got: [info] [info] java.lang.StackOverflowError [info] at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) ... ``` ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested by reverting https://github.com/apache/spark/commit/76db394f2baedc2c7b7a52c05314a64ec9068263 locally. Closes #28566 from HyukjinKwon/SPARK-31746. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 3bf7bf99e96fab754679a4f3c893995263161341) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	18 May 2020, 05:35:23 UTC
88630a3	Kris Mok	18 May 2020, 05:32:57 UTC	[SPARK-31399][CORE][TEST-HADOOP3.2][TEST-JAVA11] Support indylambda Scala closure in ClosureCleaner ### What changes were proposed in this pull request? This PR proposes to enhance Spark's `ClosureCleaner` to support "indylambda" style of Scala closures to the same level as the existing implementation for the old (inner class) style ones. The goal is to reach feature parity with the support of the old style Scala closures, with as close to bug-for-bug compatibility as possible. Specifically, this PR addresses one lacking support for indylambda closures vs the inner class closures: - When a closure is declared in a Scala REPL and captures the enclosing REPL line object, such closure should be cleanable (unreferenced fields on the enclosing REPL line object should be cleaned) This PR maintains the same limitations in the new indylambda closure support as the old inner class closures, in particular the following two: - Cleaning is only available for one level of REPL line object. If a closure captures state from a REPL line object further out from the immediate enclosing one, it won't be subject to cleaning. See example below. - "Sibling" closures are not handled yet. A "sibling" closure is defined here as a closure that is directly or indirectly referenced by the starting closure, but isn't lexically enclosing. e.g. ```scala { val siblingClosure = (x: Int) => x + this.fieldA // captures `this`, references `fieldA` on `this`. val startingClosure = (y: Int) => y + this.fieldB + siblingClosure(y) // captures `this` and `siblingClosure`, references `fieldB` on `this`. } ``` The changes are intended to be minimal, with further code cleanups planned in separate PRs. Jargons: - old, inner class style Scala closures, aka `delambdafy:inline`: default in Scala 2.11 and before - new, "indylambda" style Scala closures, aka `delambdafy:method`: default in Scala 2.12 and later ### Why are the changes needed? There had been previous effortsto extend Spark's `ClosureCleaner` to support "indylambda" Scala closures, which is necessary for proper Scala 2.12 support. Most notably the work done for [SPARK-14540](https://issues.apache.org/jira/browse/SPARK-14540). But the previous efforts had missed one import scenario: a Scala closure declared in a Scala REPL, and it captures the enclosing `this` -- a REPL line object. e.g. in a Spark Shell: ```scala :pa class NotSerializableClass(val x: Int) val ns = new NotSerializableClass(42) val topLevelValue = "someValue" val func = (j: Int) => { (1 to j).flatMap { x => (1 to x).map { y => y + topLevelValue } } } <Ctrl+D> sc.parallelize(0 to 2).map(func).collect ``` In this example, `func` refers to a Scala closure that captures the enclosing `this` because it needs to access `topLevelValue`, which is in turn implemented as a field on the enclosing REPL line object. The existing `ClosureCleaner` in Spark supports cleaning this case in Scala 2.11-, and this PR brings feature parity to Scala 2.12+. Note that the existing cleaning logic only supported one level of REPL line object nesting. This PR does not go beyond that. When a closure references state declared a few commands earlier, the cleaning will fail in both Scala 2.11 and Scala 2.12. e.g. ```scala scala> :pa // Entering paste mode (ctrl-D to finish) class NotSerializableClass1(val x: Int) case class Foo(id: String) val ns = new NotSerializableClass1(42) val topLevelValue = "someValue" // Exiting paste mode, now interpreting. defined class NotSerializableClass1 defined class Foo ns: NotSerializableClass1 = NotSerializableClass1615b1baf topLevelValue: String = someValue scala> :pa // Entering paste mode (ctrl-D to finish) val closure2 = (j: Int) => { (1 to j).flatMap { x => (1 to x).map { y => y + topLevelValue } // 2 levels } } // Exiting paste mode, now interpreting. closure2: Int => scala.collection.immutable.IndexedSeq[String] = <function1> scala> sc.parallelize(0 to 2).map(closure2).collect org.apache.spark.SparkException: Task not serializable ... ``` in the Scala 2.11 / Spark 2.4.x case: ``` Caused by: java.io.NotSerializableException: NotSerializableClass1 Serialization stack: - object not serializable (class: NotSerializableClass1, value: NotSerializableClass1615b1baf) - field (class: $iw, name: ns, type: class NotSerializableClass1) - object (class $iw, $iw64df3f4b) - field (class: $iw, name: $iw, type: class $iw) - object (class $iw, $iw66e6e5e9) - field (class: $line14.$read, name: $iw, type: class $iw) - object (class $line14.$read, $line14.$readc310aa3) - field (class: $iw, name: $line14$read, type: class $line14.$read) - object (class $iw, $iw79224636) - field (class: $iw, name: $outer, type: class $iw) - object (class $iw, $iw636d4cdc) - field (class: $anonfun$1, name: $outer, type: class $iw) - object (class $anonfun$1, <function1>) ``` in the Scala 2.12 / Spark master case after this PR: ``` Caused by: java.io.NotSerializableException: NotSerializableClass1 Serialization stack: - object not serializable (class: NotSerializableClass1, value: NotSerializableClass16f3b4c9a) - field (class: $iw, name: ns, type: class NotSerializableClass1) - object (class $iw, $iw2945a3c1) - field (class: $iw, name: $iw, type: class $iw) - object (class $iw, $iw152705d0) - field (class: $line14.$read, name: $iw, type: class $iw) - object (class $line14.$read, $line14.$read7cf311eb) - field (class: $iw, name: $line14$read, type: class $line14.$read) - object (class $iw, $iwd980dac) - field (class: $iw, name: $outer, type: class $iw) - object (class $iw, $iw557d9532) - element of array (index: 0) - array (class [Ljava.lang.Object;, size 1) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class $iw, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic $anonfun$closure2$1$adapted:(L$iw;Ljava/lang/Object;)Lscala/collection/immutable/IndexedSeq;, instantiatedMethodType=(Ljava/lang/Object;)Lscala/collection/immutable/IndexedSeq;, numCaptured=1]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class $Lambda$2103/815179920, $Lambda$2103/815179920569b57c4) ``` For more background of the new and old ways Scala lowers closures to Java bytecode, please see [A note on how NSC (New Scala Compiler) lowers lambdas](https://gist.github.com/rednaxelafx/e9ecd09bbd1c448dbddad4f4edf25d48#file-notes-md). For more background on how Spark's `ClosureCleaner` works and what's needed to make it support "indylambda" Scala closures, please refer to [A Note on Apache Spark's ClosureCleaner](https://gist.github.com/rednaxelafx/e9ecd09bbd1c448dbddad4f4edf25d48#file-spark_closurecleaner_notes-md). #### tl;dr The `ClosureCleaner` works like a mark-sweep algorithm on fields: - Finding (a chain of) outer objects referenced by the starting closure; - Scanning the starting closure and its inner closures and marking the fields on the outer objects accessed; - Cloning the outer objects, nulling out fields that are not accessed by any closure of concern. ##### Outer Objects For the old, inner class style Scala closures, the "outer objects" is defined as the lexically enclosing closures of the starting closure, plus an optional enclosing REPL line object if these closures are defined in a Scala REPL. All of them are on a singly-linked `$outer` chain. For the new, "indylambda" style Scala closures, the capturing implementation changed, so closures no longer refer to their enclosing closures via an `$outer` chain. However, a closure can still capture its enclosing REPL line object, much like the old style closures. The name of the field that captures this reference would be `arg$1` (instead of `$outer`). So what's missing in the `ClosureCleaner` for the "indylambda" support is find and potentially clone+clean the captured enclosing `this` REPL line object. That's what this PR implements. ##### Inner Closures The old, inner class style of Scala closures are compiled into separate inner classes, one per lambda body. So in order to discover the implementation (bytecode) of the inner closures, one has to jump over multiple classes. The name of such a class would contain the marker substring `$anonfun$`. The new, "indylambda" style Scala closures are compiled into static methods in the class where the lambdas were declared. So for lexically nested closures, their lambda bodies would all be compiled into static methods in the same class. This makes it much easier to discover the implementation (bytecode) of the nested lambda bodies. The name of such a static method would contain the marker substring `$anonfun$`. Discovery of inner closures involves scanning bytecode for certain patterns that represent the creation of a closure object for the inner closure. - For inner class style: the closure object creation site is like `new <InnerClassForTheClosure>(captured args)` - For "indylambda" style: the closure object creation site would be compiled into an `invokedynamic` instruction, with its "bootstrap method" pointing to the same one used by Java 8 for its serializable lambdas, and with the bootstrap method arguments pointing to the implementation method. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, Spark 2.4 / 3.0 / master on Scala 2.12 would not support Scala closures declared in a Scala REPL that captures anything from the REPL line objects. After this PR, such scenario is supported. ### How was this patch tested? Added new unit test case to `org.apache.spark.repl.SingletonReplSuite`. The new test case fails without the fix in this PR, and pases with the fix. Closes #28463 from rednaxelafx/closure-cleaner-indylambda. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dc01b7556f74e4a9873ceb1f78bc7df4e2ab4a8a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	18 May 2020, 05:33:07 UTC
3855b39	William Hyun	18 May 2020, 05:11:21 UTC	[SPARK-31740][K8S][TESTS] Use github URL instead of a broken link This PR aims to use GitHub URL instead of a broken link in `BasicTestsSuite.scala`. Currently, K8s integration test is broken: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/534/console ``` - Run SparkRemoteFileTest using a remote data file * FAILED * The code passed to eventually never returned normally. Attempted 130 times over 2.00109555135 minutes. Last failure message: false was not true. (KubernetesSuite.scala:370) ``` No. Pass the K8s integration test. Closes #28561 from williamhyun/williamhyun-patch-1. Authored-by: williamhyun <62487364+williamhyun@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5bb1a09b5f3a0f91409c7245847ab428c3c58322) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	18 May 2020, 05:13:45 UTC
5955933	Max Gekk	18 May 2020, 05:00:50 UTC	[SPARK-31727][SQL] Fix error message of casting timestamp to int in ANSI non-codegen mode ### What changes were proposed in this pull request? Change timestamp casting to int in ANSI and non-codegen mode, and make the error message consistent to the error messages in the codegen mode. In particular, casting to int is implemented in the same way as casting to short and byte. ### Why are the changes needed? 1. The error message in the non-codegen mode is diversed from the error message in the codegen mode. 2. The error message contains intermediate results that could confuse. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the error message of casting timestamp to int contains intermediate result but after the changes it contains the input values which causes arithmetic overflow. ### How was this patch tested? By running the modified test suite `AnsiCastSuite`. Closes #28549 from MaxGekk/fix-error-msg-cast-timestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fc5b90243ca1cd460988735dd4170b533d73e8e5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	18 May 2020, 05:01:04 UTC
a111597	Dongjoon Hyun	18 May 2020, 04:35:42 UTC	[SPARK-31743][CORE] Add spark_info metric into PrometheusResource ### What changes were proposed in this pull request? This PR aims to add `spark_info` metric into `PrometheusResource`. ### Why are the changes needed? This exposes Apache Spark version and revision like the following. ![Screen Shot 2020-05-17 at 6 02 20 PM](https://user-images.githubusercontent.com/9700541/82165091-990ce000-9868-11ea-82d5-8ea344eef646.png) ![Screen Shot 2020-05-17 at 6 06 32 PM](https://user-images.githubusercontent.com/9700541/82165247-2cdeac00-9869-11ea-83aa-e8083fa12a92.png) ### Does this PR introduce _any_ user-facing change? Yes, but it's exposed as an additional metric. ### How was this patch tested? Manual. ``` $ bin/spark-shell --driver-memory 4G -c spark.ui.prometheus.enabled=true $ curl -s http://localhost:4041/metrics/executors/prometheus/ \| head -n1 spark_info{version="3.1.0", revision="097d5098cca987e5f7bbb8394783c01517ebed0f"} 1.0 ``` Closes #28563 from dongjoon-hyun/SPARK-31743. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 64795f9e0c85a999bf808432d0d533843bea0a31) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	18 May 2020, 04:35:52 UTC
e56f9ab	HyukjinKwon	18 May 2020, 04:33:42 UTC	[SPARK-31742][TESTS] Increase the eventually time limit for Mino kdc in tests to fix flakiness ### What changes were proposed in this pull request? This PR is kind of a follow up of SPARK-31631. In some cases, it only attempts once for ~35 seconds. Seems 10 seconds are not enough to try multiple times - took a quick look and seems difficult to manipulate the socket configuration as well. It simply proposes to increase the time limit for now. It affects master and branch-3.0. ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 1 times over 34.294744142999996 seconds. Last failure message: Address already in use. at org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432) at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439) at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479) at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:308) at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479) at org.apache.spark.deploy.security.HadoopDelegationTokenManagerSuite.$anonfun$new$4(HadoopDelegationTokenManagerSuite.scala:106) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59) at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458) at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite.run(Suite.scala:1124) at org.scalatest.Suite.run$(Suite.scala:1106) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233) at org.scalatest.SuperEngine.runImpl(Engine.scala:518) at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:59) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: sbt.ForkMain$ForkError: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:433) at sun.nio.ch.Net.bind(Net.java:425) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:198) at org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:51) at org.apache.mina.core.polling.AbstractPollingIoAcceptor.registerHandles(AbstractPollingIoAcceptor.java:547) at org.apache.mina.core.polling.AbstractPollingIoAcceptor.access$400(AbstractPollingIoAcceptor.java:68) at org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:422) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) ... 3 more ``` ### Why are the changes needed? To fix flakiness in the tests. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins will test it out. Closes #28562 from HyukjinKwon/SPARK-31742. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c6d13099624d0a7a73bfad29aa1fa70444ba9411) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	18 May 2020, 04:33:55 UTC
ebea296	HyukjinKwon	18 May 2020, 04:31:06 UTC	[SPARK-31744][R][INFRA] Remove Hive dependency in AppVeyor build temporarily ### What changes were proposed in this pull request? This PR targets to remove Hive profile in SparkR build at AppVeyor in order to: - Speed up the build. Currently, SparkR build is [reaching the time limit](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/32853533). - Clean up the build profile. ### Why are the changes needed? We're hitting a time limit issue again and this PR could reduce the build time. Seems like we're [already skipping Hive related tests in SparkR](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/32853533) for some reasons, see below: ``` test_sparkSQL.R:307: skip: create DataFrame from RDD Reason: Hive is not build with SparkSQL, skipped test_sparkSQL.R:1341: skip: test HiveContext Reason: Hive is not build with SparkSQL, skipped test_sparkSQL.R:2813: skip: read/write ORC files Reason: Hive is not build with SparkSQL, skipped test_sparkSQL.R:2834: skip: read/write ORC files - compression option Reason: Hive is not build with SparkSQL, skipped test_sparkSQL.R:3727: skip: enableHiveSupport on SparkSession Reason: Hive is not build with SparkSQL, skipped ``` Although we build with Hive profile. So, the Hive profile is useless here. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? AppVeyor will test it out. Closes #28564 from HyukjinKwon/SPARK-31744. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f352cef077456a7ee3fa44ca1e55b1545c9633c5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	18 May 2020, 04:31:18 UTC
afe2247	Wenchen Fan	17 May 2020, 02:32:39 UTC	[SPARK-31405][SQL][3.0] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files ### What changes were proposed in this pull request? When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not. ### Why are the changes needed? Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values. ### Does this PR introduce _any_ user-facing change? Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config. ### How was this patch tested? updated tests Closes #28526 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	17 May 2020, 02:32:39 UTC
3b2d394	Jungtaek Lim (HeartSaVioR)	17 May 2020, 02:27:23 UTC	[SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? This patch effectively reverts SPARK-30098 via below changes: * Removed the config * Removed the changes done in parser rule * Removed the usage of config in tests * Removed tests which depend on the config * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098 * Reflect the change into docs (migration doc, create table syntax) ### Why are the changes needed? SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change. Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details. ### Does this PR introduce _any_ user-facing change? No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases. ### How was this patch tested? Existing UTs. Closes #28517 from HeartSaVioR/revert-SPARK-30098. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d2bec5e265e0aa4fa527c3f43cfe738cdbdc4598) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	17 May 2020, 02:27:36 UTC
4f5df2c	Max Gekk	17 May 2020, 02:26:00 UTC	[SPARK-31725][CORE][SQL][TESTS] Set America/Los_Angeles time zone and Locale.US in tests by default ### What changes were proposed in this pull request? Set default time zone and locale in the default constructor of `SparkFunSuite`: - Default time zone to `America/Los_Angeles` - Default locale to `Locale.US` ### Why are the changes needed? 1. To deduplicate code by moving common time zone and locale settings to one place SparkFunSuite 2. To have the same default time zone and locale in all tests. This should prevent errors like https://github.com/apache/spark/pull/28538 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running all affected test suites Closes #28548 from MaxGekk/timezone-settings-SparkFunSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5539ecfdac6c06285c7449494196ea3b4eb4cf87) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	17 May 2020, 02:26:11 UTC
e790206	angerszhu	16 May 2020, 20:23:49 UTC	[SPARK-31655][BUILD][3.0] Upgrade snappy-java to 1.1.7.5 ### What changes were proposed in this pull request? snappy-java have release v1.1.7.5, upgrade to latest version. Fixed in v1.1.7.4 - Caching internal buffers for SnappyFramed streams #234 - Fixed the native lib for ppc64le to work with glibc 2.17 (Previously it depended on 2.22) Fixed in v1.1.7.5 - Fixes java.lang.NoClassDefFoundError: org/xerial/snappy/pool/DefaultPoolFactory in 1.1.7.4 https://github.com/xerial/snappy-java/compare/1.1.7.3...1.1.7.5 v 1.1.7.5 release note: https://github.com/xerial/snappy-java/commit/edc4ec28bdb15a32b6c41ca9e8b195e635bec3a3 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need Closes #28509 from AngersZhuuuu/SPARK-31655-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	16 May 2020, 20:23:49 UTC
dfcba55	Wenchen Fan	16 May 2020, 14:33:58 UTC	[SPARK-31732][TESTS] Disable some flaky tests temporarily ### What changes were proposed in this pull request? It's quite annoying to be blocked by flaky tests in several PRs. This PR disables them. The tests come from 3 PRs I'm recently watching: https://github.com/apache/spark/pull/28526 https://github.com/apache/spark/pull/28463 https://github.com/apache/spark/pull/28517 ### Why are the changes needed? To make PR builder more stable ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28547 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2012d5847520c8aba54e8e3e6a634976a3c7657d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	16 May 2020, 14:34:15 UTC
847eec4	Kent Yao	16 May 2020, 09:11:21 UTC	[SPARK-31289][TEST][TEST-HIVE1.2] Eliminate org.apache.spark.sql.hive.thriftserver.CliSuite flakiness ### What changes were proposed in this pull request? CliSuite seems to be flaky while using metastoreDir per test. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120470/testReport/org.apache.spark.sql.hive.thriftserver/CliSuite/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120470/testReport/junit/org.apache.spark.sql.hive.thriftserver/CliSuite/history/ According to the error stack trace in the failed test, the test failed to instantiate a hive metastore client because of derby requirements. ```scala Caused by: ERROR XBM0A: The database directory '/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-9249ce52-0a06-42b6-a3df-e6295e880df0' exists. However, it does not contain the expected 'service.properties' file. Perhaps Derby was brought down in the middle of creating this database. You may want to delete this directory and try creating the database again. ``` The derby requires the metastore dir does not exist, but it does exist probably due to the test case before it failed to clear the metastore dir In this PR, the metastore is shared across the tests of CliSuite except those explicitly asked a separated metastore env itself ### Why are the changes needed? CliSuite seems to be flaky while using metastoreDir per test. To eliminate test flakiness ### Does this PR introduce any user-facing change? no ### How was this patch tested? modified test Closes #28055 from yaooqinn/clisuite. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1d66085a93a875247f19d710a5b5458ce1842c73) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	16 May 2020, 09:11:32 UTC
f9486d5	Yuanjian Li	16 May 2020, 04:37:18 UTC	[SPARK-31663][SQL] Grouping sets with having clause returns the wrong result - Resolve the havingcondition with expanding the GROUPING SETS/CUBE/ROLLUP expressions together in `ResolveGroupingAnalytics`: - Change the operations resolving directions to top-down. - Try resolving the condition of the filter as though it is in the aggregate clause by reusing the function in `ResolveAggregateFunctions` - Push the aggregate expressions into the aggregate which contains the expanded operations. - Use UnresolvedHaving for all having clause. Correctness bug fix. See the demo and analysis in SPARK-31663. Yes, correctness bug fix for HAVING with GROUPING SETS. New UTs added. Closes #28501 from xuanyuanking/SPARK-31663. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 86bd37f37eb1e534c520dc9a02387debf9fa05a1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	16 May 2020, 05:02:22 UTC
bdb5785	yi.wu	15 May 2020, 15:36:28 UTC	[SPARK-31620][SQL] Fix reference binding failure in case of an final agg contains subquery ### What changes were proposed in this pull request? Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`. ### Why are the changes needed? When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105 For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348 This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`. ### Does this PR introduce _any_ user-facing change? Yes, user hit error previously but run query successfully after this change. ### How was this patch tested? Added a regression test. Closes #28496 from Ngone51/spark-31620. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d8b001fa872f735df4344321c33780a892da9b41) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	15 May 2020, 15:36:41 UTC
435f126	Dongjoon Hyun	15 May 2020, 07:30:25 UTC	[SPARK-31716][SQL] Use fallback versions in HiveExternalCatalogVersionsSuite # What changes were proposed in this pull request? This PR aims to provide a fallback version instead of `Nil` in `HiveExternalCatalogVersionsSuite`. The provided fallback Spark versions recovers Jenkins jobs instead of failing. ### Why are the changes needed? Currently, `HiveExternalCatalogVersionsSuite` is aborted in all Jenkins jobs except JDK11 Jenkins jobs which don't have old Spark releases supporting JDK11. ``` HiveExternalCatalogVersionsSuite: org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite * ABORTED * Exception encountered when invoking run on a nested suite - Fail to get the lates Spark versions to test. (HiveExternalCatalogVersionsSuite.scala:180) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins Closes #28536 from dongjoon-hyun/SPARK-HiveExternalCatalogVersionsSuite. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5d90886523c415768c65ea9cba7db24bc508a23b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	15 May 2020, 07:30:39 UTC
d270be4	Kent Yao	15 May 2020, 06:36:34 UTC	[SPARK-31715][SQL][TEST] Fix flaky SparkSQLEnvSuite that sometimes varies single derby instance standard ### What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122622/testReport/junit/org.apache.spark.sql.hive.thriftserver/SparkSQLEnvSuite/SPARK_29604_external_listeners_should_be_initialized_with_Spark_classloader/history/?start=25 According to the test report history of SparkSQLEnvSuite，this test fails frequently which is caused by single derby instance restriction. ```java Caused by: sbt.ForkMain$ForkError: org.apache.derby.iapi.error.StandardException: Another instance of Derby may have already booted the database /home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/metastore_db. at org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source) at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown Source) ... 138 more ``` This PR adds a separate directory to locate the metastore_db for this test which runs in a dedicated JVM. Besides, diable the UI for the potential race on `spark.ui.port` which may also let the test case become flaky. ### Why are the changes needed? test fix ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? SparkSQLEnvSuite itself. Closes #28537 from yaooqinn/SPARK-31715. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 503faa24d33eb52da79ad99e39c8c011597499ea) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	15 May 2020, 06:36:43 UTC
635feaa	Max Gekk	15 May 2020, 04:24:58 UTC	[SPARK-31712][SQL][TESTS] Check casting timestamps before the epoch to Byte/Short/Int/Long types ### What changes were proposed in this pull request? Added tests to check casting timestamps before 1970-01-01 00:00:00Z to ByteType, ShortType, IntegerType and LongType in ansi and non-ansi modes. ### Why are the changes needed? To improve test coverage and prevent errors while modifying the CAST expression code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ ./build/sbt "test:testOnly *CastSuite" ``` Closes #28531 from MaxGekk/test-cast-timestamp-to-byte. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c7ce37dfa713f80c5f0157719f0e3d9bf0d271dd) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	15 May 2020, 04:25:17 UTC
8ba1578	Dongjoon Hyun	15 May 2020, 02:28:25 UTC	[SPARK-31713][INFRA] Make test-dependencies.sh detect version string correctly ### What changes were proposed in this pull request? This PR makes `test-dependencies.sh` detect the version string correctly by ignoring all the other lines. ### Why are the changes needed? Currently, all SBT jobs are broken like the following. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3/476/console ``` [error] running /home/jenkins/workspace/spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3/dev/test-dependencies.sh ; received return code 1 Build step 'Execute shell' marked build as failure ``` The reason is that the script detects the old version like `Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT` when `build/mvn` did fallback. Specifically, in the script, `OLD_VERSION` became `Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT` instead of `3.1.0-SNAPSHOT` if build/mvn did fallback. Then, `pom.xml` file is corrupted like the following at the end and the exit code become `1` instead of `0`. It causes Jenkins jobs fails ``` - <version>3.1.0-SNAPSHOT</version> + <version>Falling</version> ``` NO FALLBACK ``` $ build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn 3.1.0-SNAPSHOT ``` FALLBACK ``` $ build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec Falling back to archive.apache.org to download Maven Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn 3.1.0-SNAPSHOT ``` In the script ``` $ echo $(build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec) Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT ``` This PR will prevent irrelevant logs like `Falling back to archive.apache.org to download Maven`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the PR Builder. Closes #28532 from dongjoon-hyun/SPARK-31713. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cd5fbcf9a0151f10553f67bcaa22b8122b3cf263) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	15 May 2020, 02:28:42 UTC
cf708f9	Dongjoon Hyun	14 May 2020, 19:06:13 UTC	Revert "[SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener" This reverts commit 512cb2f0246a0d020f0ba726b4596555b15797c6.	14 May 2020, 19:06:13 UTC
ca9cde8	Dongjoon Hyun	14 May 2020, 17:25:22 UTC	[SPARK-31696][DOCS][FOLLOWUP] Update version in documentation # What changes were proposed in this pull request? This PR is a follow-up to fix a version of configuration document. ### Why are the changes needed? The original PR is backported to branch-3.0. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Manual. Closes #28530 from dongjoon-hyun/SPARK-31696-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7ce3f76af6b72e88722a89f792e6fded9c586795) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	14 May 2020, 17:25:35 UTC
541d451	Dongjoon Hyun	13 May 2020, 20:59:42 UTC	[SPARK-31696][K8S] Support driver service annotation in K8S ### What changes were proposed in this pull request? This PR aims to add `spark.kubernetes.driver.service.annotation` like `spark.kubernetes.driver.service.annotation`. ### Why are the changes needed? Annotations are used in many ways. One example is that Prometheus monitoring system search metric endpoint via annotation. - https://github.com/helm/charts/tree/master/stable/prometheus#scraping-pod-metrics-via-annotations ### Does this PR introduce _any_ user-facing change? Yes. The documentation is added. ### How was this patch tested? Pass Jenkins with the updated unit tests. Closes #28518 from dongjoon-hyun/SPARK-31696. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c8f3bd861d96cf3f7b01cd9f864c181a57e1c77a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	14 May 2020, 17:13:33 UTC
6834f46	Huaxin Gao	14 May 2020, 15:54:35 UTC	[SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary ### What changes were proposed in this pull request? Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark ### Why are the changes needed? Currently we have ``` since("2.0.0") def evaluate(self, dataset): if not isinstance(dataset, DataFrame): raise ValueError("dataset must be a DataFrame but got %s." % type(dataset)) java_blr_summary = self._call_java("evaluate", dataset) return BinaryLogisticRegressionSummary(java_blr_summary) ``` we should return LogisticRegressionSummary for multiclass logistic regression ### Does this PR introduce _any_ user-facing change? Yes return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python ### How was this patch tested? unit test Closes #28503 from huaxingao/lr_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit e10516ae63cfc58f2d493e4d3f19940d45c8f033) Signed-off-by: Sean Owen <srowen@gmail.com>	14 May 2020, 15:54:50 UTC
00e6acc	Weichen Xu	14 May 2020, 14:24:40 UTC	[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) ### What changes were proposed in this pull request? In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0 ``` for (i <- 0 until splits.length) { if (splits(i) == -0.0) { splits(i) = 0.0 } } ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. #### Manually test: ~~~scala import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) import spark.implicits._ val df1 = sc.parallelize(a1, 2).toDF("id") import org.apache.spark.ml.feature.QuantileDiscretizer val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0) val model = qd.fit(df1) // will raise error in spark master. ~~~ ### Explain scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code. And array.distinct will rely on elem.hashCode so it leads to this error. Test code on distinct ``` import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) a1.distinct.sorted.foreach(x => print(x.toString + "\n")) ``` Then you will see output like: ``` ... -0.009292684662246975 -0.0033280686465135823 -0.0 0.0 0.0022219556032221366 0.02217419561977274 ... ``` Closes #28498 from WeichenXu123/SPARK-31676. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b2300fca1e1a22d74c6eeda37942920a6c6299ff) Signed-off-by: Sean Owen <srowen@gmail.com>	14 May 2020, 14:25:04 UTC
f5cf11c	sunke.03	14 May 2020, 13:55:24 UTC	[SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … ### What changes were proposed in this pull request? This PR try to fix a bug in `org.apache.spark.sql.hive.execution.ScriptTransformationExec`. This bug appears in our online cluster. `ScriptTransformationExec` should throw an exception, when user uses a python script which contains parse error. But current implementation may miss this case of failure. ### Why are the changes needed? When user uses a python script which contains a parse error, there will be no output. So `scriptOutputReader.next(scriptOutputWritable) <= 0` matches, then we use `checkFailureAndPropagate()` to check the `proc`. But the `proc` may still be alive and `writerThread.exception` is not defined, `checkFailureAndPropagate` cannot check this case of failure. In the end, the Spark SQL job runs successfully and returns no result. In fact, the SparK SQL job should fails and shows the exception properly. For example, the error python script is blow. ``` python # encoding: utf8 import unknow_module import sys for line in sys.stdin: print line ``` The bug can be reproduced by running the following code in our cluter. ``` spark.range(100*100).toDF("index").createOrReplaceTempView("test") spark.sql("select TRANSFORM(index) USING 'python error_python.py' as new_index from test").collect.foreach(println) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT Closes #27724 from slamke/transformation. Authored-by: sunke.03 <sunke.03@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ddbce4edee6d4de30e6900bc0f03728a989aef0a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	14 May 2020, 13:55:39 UTC
d639a12	Karuppayya Rajendran	14 May 2020, 06:18:38 UTC	[SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory ### What changes were proposed in this pull request? Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory ### Why are the changes needed? BEFORE ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03 ``` AFTER ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = LocalFS scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem ``` The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested locally. Added Unit test Closes #28516 from karuppayya/SPARK-31692. Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 72601460ada41761737f39d5dff8e69444fce2ba) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	14 May 2020, 06:18:58 UTC
54ab70b	Xingcan Cui	14 May 2020, 03:07:22 UTC	[SPARK-31632][CORE][WEBUI] Enrich the exception message when application information is unavailable ### What changes were proposed in this pull request? This PR caught the `NoSuchElementException` and enriched the error message for `AppStatusStore.applicationInfo()` when Spark is starting up and the application information is unavailable. ### Why are the changes needed? During the initialization of `SparkContext`, it first starts the Web UI and then set up the `LiveListenerBus` thread for dispatching the `SparkListenerApplicationStart` event (which will trigger writing the requested `ApplicationInfo` to `InMemoryStore`). If the Web UI is accessed before this info's being written to `InMemoryStore`, the following `NoSuchElementException` will be thrown. ``` WARN org.eclipse.jetty.server.HttpChannel: /jobs/ java.util.NoSuchElementException at java.util.Collections$EmptyIterator.next(Collections.java:4191) at org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:505) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. This can be reproduced: 1. `./bin/spark-shell` 2. at the same time, open `http://localhost:4040/jobs/` in your browser with quickly refreshing. Closes #28444 from xccui/SPARK-31632. Authored-by: Xingcan Cui <xccui@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 42951e6786319481220ba4abfad015a8d11749f3) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	14 May 2020, 03:07:43 UTC
f71edf1	HyukjinKwon	13 May 2020, 17:03:12 UTC	[SPARK-31701][R][SQL] Bump up the minimum Arrow version as 0.15.1 in SparkR ### What changes were proposed in this pull request? This PR proposes to set the minimum Arrow version as 0.15.1 to be consistent with PySpark side at. ### Why are the changes needed? It will reduce the maintenance overhead to match the Arrow versions, and minimize the supported range. SparkR Arrow optimization is experimental yet. ### Does this PR introduce _any_ user-facing change? No, it's the change in unreleased branches only. ### How was this patch tested? 0.15.x was already tested at SPARK-29378, and we're testing the latest version of SparkR currently in AppVeyor. I already manually tested too. Closes #28520 from HyukjinKwon/SPARK-31701. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e1315cd65631823123af0d14771b0f699809251b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	13 May 2020, 17:03:23 UTC
da71e30	Kousuke Saruta	13 May 2020, 08:46:40 UTC	[SPARK-31697][WEBUI] HistoryServer should set Content-Type ### What changes were proposed in this pull request? This PR changes HistoryServer to set Content-Type. I noticed that we will get html as plain text when we access to wrong URLs which represent non-existence appId on HistoryServer. ``` <html> <head> <meta http-equiv="Content-type" content="text/html; charset=utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><link rel="stylesheet" href="/static/bootstrap.min.css" type="text/css"/><link rel="stylesheet" href="/static/vis-timeline-graph2d.min.css" type="text/css"/><link rel="stylesheet" href="/static/webui.css" type="text/css"/><link rel="stylesheet" href="/static/timeline-view.css" type="text/css"/><script src="/static/sorttable.js"></script><script src="/static/jquery-3.4.1.min.js"></script><script src="/static/vis-timeline-graph2d.min.js"></script><script src="/static/bootstrap.bundle.min.js"></script><script src="/static/initialize-tooltips.js"></script><script src="/static/table.js"></script><script src="/static/timeline-view.js"></script><script src="/static/log-view.js"></script><script src="/static/webui.js"></script><script>setUIRoot('')</script> <link rel="shortcut icon" href="/static/spark-logo-77x50px-hd.png"></link> <title>Not Found</title> </head> <body> <div class="container-fluid"> <div class="row"> <div class="col-12"> <h3 style="vertical-align: middle; display: inline-block;"> <a style="text-decoration: none" href="/"> <img src="/static/spark-logo-77x50px-hd.png"/> <span class="version" style="margin-right: 15px;">3.1.0-SNAPSHOT</span> </a> Not Found </h3> </div> </div> <div class="row"> <div class="col-12"> <div class="row">Application local-1589239 not found.</div> </div> </div> </div> </body> </html> ``` The reason is Content-Type not set. I confirmed it with `curl -I http://localhost:18080/history/<wrong-appId>` ``` HTTP/1.1 404 Not Found Date: Wed, 13 May 2020 06:59:29 GMT Cache-Control: no-cache, no-store, must-revalidate X-Frame-Options: SAMEORIGIN X-XSS-Protection: 1; mode=block X-Content-Type-Options: nosniff Content-Length: 1778 Server: Jetty(9.4.18.v20190429) ``` ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test case for this issue. Closes #28519 from sarutak/fix-content-type. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7952f44dacd18891e4f78c91146c1cf37dda6a46) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	13 May 2020, 08:46:52 UTC
ce52f61	Dongjoon Hyun	12 May 2020, 21:24:56 UTC	[SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` ### What changes were proposed in this pull request? This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string. ### Why are the changes needed? SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following. ``` $ echo 'Falling back to archive.apache.org to download Maven' > out $ build/mvn help:evaluate -Dexpression=project.version >> out Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn $ cat out \| grep -v INFO \| grep -v WARNING \| grep -v Download Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT $ cat out \| grep -v INFO \| grep -v WARNING \| grep -vi Download 3.1.0-SNAPSHOT ``` Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3772154442e6341ea97a2f41cd672413de918e85) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	12 May 2020, 21:25:17 UTC
e892a01	Dongjoon Hyun	12 May 2020, 19:57:48 UTC	[SPARK-31683][CORE] Make Prometheus output consistent with DropWizard 4.1 result ### What changes were proposed in this pull request? This PR aims to update Prometheus-related output format to be consistent with DropWizard 4.1 result. - Add `Number` metrics for gauges metrics. - Add `type` labels. ### Why are the changes needed? SPARK-29032 added Prometheus support. After that, SPARK-29674 upgraded DropWizard for JDK9+ support and this caused difference in output labels and number of keys for Guage metrics. The current status is different from Apache Spark 2.4.5. Since we cannot change DropWizard, this PR aims to be consistent in Apache Spark 3.0.0 only. DropWizard 3.x ``` metrics_master_aliveWorkers_Value 1.0 ``` DropWizard 4.1 ``` metrics_master_aliveWorkers_Value{type="gauges",} 1.0 metrics_master_aliveWorkers_Number{type="gauges",} 1.0 ``` ### Does this PR introduce _any_ user-facing change? Yes, but this is a new feature in 3.0.0. ### How was this patch tested? Manually check the output like the following. JMXExporter Result ``` $ curl -s http://localhost:8088/ \| grep "^metrics_master" \| sort metrics_master_aliveWorkers_Number{type="gauges",} 1.0 metrics_master_aliveWorkers_Value{type="gauges",} 1.0 metrics_master_apps_Number{type="gauges",} 0.0 metrics_master_apps_Value{type="gauges",} 0.0 metrics_master_waitingApps_Number{type="gauges",} 0.0 metrics_master_waitingApps_Value{type="gauges",} 0.0 metrics_master_workers_Number{type="gauges",} 1.0 metrics_master_workers_Value{type="gauges",} 1.0 ``` This PR ``` $ curl -s http://localhost:8080/metrics/master/prometheus/ \| grep master metrics_master_aliveWorkers_Number{type="gauges"} 1 metrics_master_aliveWorkers_Value{type="gauges"} 1 metrics_master_apps_Number{type="gauges"} 0 metrics_master_apps_Value{type="gauges"} 0 metrics_master_waitingApps_Number{type="gauges"} 0 metrics_master_waitingApps_Value{type="gauges"} 0 metrics_master_workers_Number{type="gauges"} 1 metrics_master_workers_Value{type="gauges"} 1 ``` Closes #28510 from dongjoon-hyun/SPARK-31683. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit 07209f3e2deab824f04484fa6b8bab0ec0a635d6) Signed-off-by: DB Tsai <d_tsai@apple.com>	12 May 2020, 19:58:09 UTC
512cb2f	Ali Smesseim	12 May 2020, 16:14:34 UTC	[SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener ### What changes were proposed in this pull request? The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log. Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed. ### Why are the changes needed? The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException. In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests Closes #28155 from alismess-db/hive-thriftserver-listener-update-safer. Authored-by: Ali Smesseim <ali.smesseim@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 6994c64efd5770a8fd33220cbcaddc1d96fed886) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	12 May 2020, 16:14:47 UTC
b50d53b	Weichen Xu	12 May 2020, 15:54:28 UTC	[SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF ### What changes were proposed in this pull request? Expose hashFunc property in HashingTF Some third-party library such as mleap need to access it. See background description here: https://github.com/combust/mleap/pull/665#issuecomment-621258623 ### Why are the changes needed? See https://github.com/combust/mleap/pull/665#issuecomment-621258623 ### Does this PR introduce any user-facing change? No. Only add a package private constructor. ### How was this patch tested? N/A Closes #28413 from WeichenXu123/hashing_tf_expose_hashfunc. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com> (cherry picked from commit e248bc7af6086cde7dd89a51459ae6a221a600c8) Signed-off-by: Xiangrui Meng <meng@databricks.com>	12 May 2020, 15:55:19 UTC
cbe75bb	Max Gekk	12 May 2020, 14:05:31 UTC	[SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator ### What changes were proposed in this pull request? Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`. ### Why are the changes needed? To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`. Closes #28502 from MaxGekk/random-java8-datetime. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a3fafddf390fd180047a0b9ef46f052a9b6813e0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 May 2020, 14:05:41 UTC
2549e38	Kent Yao	12 May 2020, 13:37:13 UTC	[SPARK-31678][SQL] Print error stack trace for Spark SQL CLI when error occurs ### What changes were proposed in this pull request? When I was finding the root cause for SPARK-31675, I noticed that it was very difficult for me to see what was actually going on, since it output nothing else but only ```sql Error in query: java.lang.IllegalArgumentException: Wrong FS: blablah/.hive-staging_blahbla, expected: hdfs://cluster1 ``` It is really hard for us to find causes through such a simple error message without a certain amount of experience. In this PR, I propose to print all of the stack traces when AnalysisException occurs if there are underlying root causes, also we can escape this via `-S` option. ### Why are the changes needed? In SPARK-11188, >For analysis exceptions in the sql-shell, we should only print the error message to the screen. The stacktrace will never have useful information since this error is used to signify an error with the query. But nowadays, some `AnalysisException`s do have useful information for us to debug, e.g. the `AnalysisException` below may contain exceptions from hive or Hadoop side. https://github.com/apache/spark/blob/a28ed86a387b286745b30cd4d90b3d558205a5a7/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L97-L112 ```scala at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:468) at org.apache.hadoop.hive.common.FileUtils.isSubDir(FileUtils.java:626) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2850) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398) at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1593) ``` ### Does this PR introduce _any_ user-facing change? Yes, `bin/spark-sql` will print all the stack trace when an AnalysisException which contains root causes occurs, before this fix, only the message will be printed. #### before ```scala Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10; ``` #### After ```scala Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs:..., expected: hdfs://hz-cluster10; org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Wrong FS: ..., expected: hdfs://hz-cluster10; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109) at org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:890) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:312) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:101) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:486) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:480) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:934) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:165) at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:163) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.IllegalArgumentException: Wrong FS: ..., expected: hdfs://hz-cluster10 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194) at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:2093) at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:289) at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1221) at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2607) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2892) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1398) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v0_14.loadPartition(HiveShim.scala:927) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:870) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:860) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:911) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) ... 52 more ``` You can use `-S` option to restore old behavior if you find the error is too verbose. ### How was this patch tested? Existing CliSuite - `SPARK-11188 Analysis error reporting` Add new test and verify manually Closes #28499 from yaooqinn/SPARK-31678. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ce714d81894a48e2d06c530674c2190e0483e1b4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 May 2020, 13:37:24 UTC
cb253b1	beliefer	12 May 2020, 01:25:04 UTC	[SPARK-31393][SQL] Show the correct alias in schema for expression ### What changes were proposed in this pull request? Some alias of expression can not display correctly in schema. This PR will fix them. - `TimeWindow` - `MaxBy` - `MinBy` - `UnaryMinus` - `BitwiseCount` This PR also fix a typo issue, please look at https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala#L142 Note: 1. `MaxBy` and `MinBy` extends `MaxMinBy` and the latter add a method `funcName` not needed. We can reuse `prettyName` to replace `funcName`. 2. Spark SQL exists some function no elegant implementation.For example: `BitwiseCount` override the sql method show below: `override def sql: String = s"bit_count(${child.sql})"` I don't think it's elegant enough. Because `Expression` gives the following definitions. ``` def sql: String = { val childrenSQL = children.map(_.sql).mkString(", ") s"$prettyName($childrenSQL)" } ``` By this definition, `BitwiseCount` should override `prettyName` method. ### Why are the changes needed? Improve the implement of some expression. ### Does this PR introduce any user-facing change? 'Yes'. This PR will let user see the correct alias in schema. ### How was this patch tested? Jenkins test. Closes #28164 from beliefer/elegant-pretty-name-for-function. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a89006aba03a623960e5c4c6864ca8c899c81db9) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	12 May 2020, 01:25:21 UTC
9469831	Jungtaek Lim (HeartSaVioR)	12 May 2020, 00:25:41 UTC	[SPARK-31559][YARN] Re-obtain tokens at the startup of AM for yarn cluster mode if principal and keytab are available ### What changes were proposed in this pull request? This patch re-obtain tokens at the start of AM for yarn cluster mode, if principal and keytab are available. It basically transfers the credentials from the original user, so this patch puts the new tokens into credentials from the original user via overwriting. To obtain tokens from providers in user application, this patch leverages the user classloader as context classloader while initializing token manager in the startup of AM. ### Why are the changes needed? Submitter will obtain delegation tokens for yarn-cluster mode, and add these credentials to the launch context. AM will be launched with these credentials, and AM and driver are able to leverage these tokens. In Yarn cluster mode, driver is launched in AM, which in turn initializes token manager (while initializing SparkContext) and obtain delegation tokens (+ schedule to renew) if both principal and keytab are available. That said, even we provide principal and keytab to run application with yarn-cluster mode, AM always starts with initial tokens from launch context until token manager runs and obtains delegation tokens. So there's a "gap", and if user codes (driver) access to external system with delegation tokens (e.g. HDFS) before initializing SparkContext, it cannot leverage the tokens token manager will obtain. It will make the application fail if AM is killed "after" the initial tokens are expired and relaunched. This is even a regression: see below codes in branch-2.4: https://github.com/apache/spark/blob/branch-2.4/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala https://github.com/apache/spark/blob/branch-2.4/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/AMCredentialRenewer.scala In Spark 2.4.x, AM runs AMCredentialRenewer at initialization, and AMCredentialRenew obtains tokens and merge with credentials being provided with launch context of AM. So it guarantees new tokens in driver run. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested with specifically crafted application (simple reproducer) - https://github.com/HeartSaVioR/spark-delegation-token-experiment/blob/master/src/main/scala/net/heartsavior/spark/example/LongRunningAppWithHDFSConfig.scala Before this patch, new AM attempt failed when I killed AM after the expiration of tokens. After this patch the new AM attempt runs fine. Closes #28336 from HeartSaVioR/SPARK-31559. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@apache.org> (cherry picked from commit 842b1dcdff0ecab4af9f292c2ff7b2b9ae1ac40a) Signed-off-by: Marcelo Vanzin <vanzin@apache.org>	12 May 2020, 00:25:53 UTC
7e226a2	fan31415	11 May 2020, 23:23:23 UTC	[SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 <fan12356789@gmail.com> Co-authored-by: yijiefan <fanyije@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen <srowen@gmail.com>	11 May 2020, 23:23:34 UTC
37c352c	oleg	11 May 2020, 20:10:39 UTC	[SPARK-31456][CORE] Fix shutdown hook priority edge cases ### What changes were proposed in this pull request? Fix application order for shutdown hooks for the priorities of Int.MaxValue, Int.MinValue ### Why are the changes needed? The bug causes out-of-order execution of shutdown hooks if their priorities were Int.MinValue or Int.MaxValue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a test covering the change. Closes #28494 from oleg-smith/SPARK-31456_shutdown_hook_priority. Authored-by: oleg <oleg@nexla.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d7c3e9e53e01011f809b6cb145349ee8a9c5e5f0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	11 May 2020, 20:10:56 UTC
7b567e4	Max Gekk	11 May 2020, 12:59:41 UTC	[SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps ### What changes were proposed in this pull request? Modified `RandomDataGenerator.forType` for DateType and TimestampType to generate special date//timestamp values with 0.5 probability. This will trigger dictionary encoding in Parquet datasource test HadoopFsRelationTest "test all data types". Currently, dictionary encoding is tested only for numeric types like ShortType. ### Why are the changes needed? To extend test coverage. Currently, probability of testing of dictionary encoding in the test HadoopFsRelationTest "test all data types" for DateType and TimestampType is close to zero because dates/timestamps are uniformly distributed in wide range, and the chance of generating the same values is pretty low. In this way, parquet datasource cannot apply dictionary encoding for such column types. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetHadoopFsRelationSuite` and `JsonHadoopFsRelationSuite`. Closes #28481 from MaxGekk/test-random-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 32a5398b659695c338cd002d9094bdf19a89a716) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	11 May 2020, 12:59:52 UTC
e2bf140	Dongjoon Hyun	11 May 2020, 05:32:26 UTC	[SPARK-31674][CORE][DOCS] Make Prometheus metric endpoints experimental ### What changes were proposed in this pull request? This PR aims to new Prometheus-format metric endpoints experimental in Apache Spark 3.0.0. ### Why are the changes needed? Although the new metrics are disabled by default, we had better make it experimental explicitly in Apache Spark 3.0.0 since the output format is still not fixed. We can finalize it in Apache Spark 3.1.0. ### Does this PR introduce _any_ user-facing change? Only doc-change is visible to the users. ### How was this patch tested? Manually check the code since this is a documentation and class annotation change. Closes #28495 from dongjoon-hyun/SPARK-31674. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b80309bdb4d26556bd3da6a61cac464cdbdd1fe1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	11 May 2020, 05:32:37 UTC
5c6a4fc	Max Gekk	11 May 2020, 04:58:08 UTC	[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns Modified the `decodeDictionaryIds()` method of `VectorizedColumnReader` to handle especially `TimestampType` when the passed parameter `rebaseDateTime` is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianMicros()`. This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS") .select($"tsS".cast("timestamp").as("ts")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala scala> spark.read.parquet(path).show(false) +-----------------------+ \|ts \| +-----------------------+ \|1001-01-07 00:32:20.123\| ... \|1001-01-07 00:32:20.123\| +-----------------------+ ``` Expected values must be 1001-01-01 01:02:03.123 but not 1001-01-07 00:32:20.123. Yes. After the changes: ```scala scala> spark.read.parquet(path).show(false) +-----------------------+ \|ts \| +-----------------------+ \|1001-01-01 01:02:03.123\| ... \|1001-01-01 01:02:03.123\| +-----------------------+ ``` Modified the test `SPARK-31159: rebasing timestamps in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28489 from MaxGekk/fix-ts-rebase-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5d5866be12259c40972f7404f64d830cab87401f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	11 May 2020, 05:01:37 UTC
6f7c719	Max Gekk	10 May 2020, 19:22:12 UTC	[SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps ### What changes were proposed in this pull request? Shift non-existing dates in Proleptic Gregorian calendar by 1 day. The reason for that is `RowEncoderSuite` generates random dates/timestamps in the hybrid calendar, and some dates/timestamps don't exist in Proleptic Gregorian calendar like 1000-02-29 because 1000 is not leap year in Proleptic Gregorian calendar. ### Why are the changes needed? This makes RowEncoderSuite much stable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running RowEncoderSuite and set non-existing date manually: ```scala val date = new java.sql.Date(1000 - 1900, 1, 29) Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY)) ``` Closes #28486 from MaxGekk/fix-RowEncoderSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 9f768fa9916dec3cc695e3f28ec77148d81d335f) Signed-off-by: Sean Owen <srowen@gmail.com>	10 May 2020, 19:22:28 UTC
6786500	Huaxin Gao	10 May 2020, 17:57:25 UTC	[SPARK-31636][SQL][DOCS] Remove HTML syntax in SQL reference ### What changes were proposed in this pull request? Remove the unneeded embedded inline HTML markup by using the basic markdown syntax. Please see #28414 ### Why are the changes needed? Make the doc cleaner and easily editable by MD editors. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build and check Closes #28451 from huaxingao/html_cleanup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a75dc80a76ef7666330be2cb1de87e89a4103d95) Signed-off-by: Sean Owen <srowen@gmail.com>	10 May 2020, 17:57:38 UTC
eb278ba	Max Gekk	10 May 2020, 04:31:26 UTC	[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns ### What changes were proposed in this pull request? Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`. ### Why are the changes needed? This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala spark.read.parquet(path).show(false) +----------+ \|date \| +----------+ \|1001-01-07\| ... \|1001-01-07\| +----------+ ``` Expected values must be 1000-01-01 but not 1001-01-07. ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```scala spark.read.parquet(path).show(false) +----------+ \|date \| +----------+ \|1001-01-01\| ... \|1001-01-01\| +----------+ ``` ### How was this patch tested? Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit ce63bef1dac91f82cda5ebb47b38bd98eaf8164f) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	10 May 2020, 04:31:49 UTC
ba43922	manuzhang	08 May 2020, 17:24:13 UTC	[SPARK-31658][SQL] Fix SQL UI not showing write commands of AQE plan Show write commands on SQL UI of an AQE plan Currently the leaf node of an AQE plan is always a `AdaptiveSparkPlan` which is not true when it's a child of a write command. Hence, the node of the write command as well as its metrics are not shown on the SQL UI. ![image](https://user-images.githubusercontent.com/1191767/81288918-1893f580-9098-11ea-9771-e3d0820ba806.png) ![image](https://user-images.githubusercontent.com/1191767/81289008-3a8d7800-9098-11ea-93ec-516bbaf25d2d.png) No Add UT. Closes #28474 from manuzhang/aqe-ui. Lead-authored-by: manuzhang <owenzhang1990@gmail.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 77c690a7252b22c9dd8f3cb7ac32f79fd6845cad) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	08 May 2020, 17:35:47 UTC
57b55ea	Dongjoon Hyun	08 May 2020, 00:50:32 UTC	[SPARK-31646][FOLLOWUP][TESTS] Add clean up code and disable irrelevent conf (cherry picked from commit 24fac1e0c70a783b4d240607639ff20d7dd24191) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	08 May 2020, 00:53:33 UTC
36b1a79	tianlzhang	07 May 2020, 22:22:13 UTC	[SPARK-31646][SHUFFLE] Remove unused registeredConnections counter from ShuffleMetrics ### What changes were proposed in this pull request? Remove unused `registeredConnections` counter from `ExternalBlockHandler#ShuffleMetrics` This was added by SPARK-25642 at 3.0.0 - https://github.com/apache/spark/commit/8dd29fe36b781d115213b1d6a8446ad04e9239bb ### Why are the changes needed? It's `registeredConnections` counter created in `TransportContext` that's really counting the numbers and it's misleading for people who want to add new metrics like `registeredConnections`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add UTs to ensure all expected metrics are registered for `ExternalShuffleService` and `YarnShuffleService` Closes #28457 from manuzhang/spark-31611-pre. Lead-authored-by: tianlzhang <tianlzhang@ebay.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dad61ed46521157c42144920b42b91bcba8295d3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	07 May 2020, 22:22:25 UTC
fafe0f3	Kent Yao	07 May 2020, 05:37:03 UTC	[SPARK-31631][TESTS] Fix test flakiness caused by MiniKdc which throws 'address in use' BindException with retry ### What changes were proposed in this pull request? The `KafkaSuite`s are flaky because of the Hadoop MiniKdc issue - https://issues.apache.org/jira/browse/HADOOP-12656 > Looking at MiniKdc implementation, if port is 0, the constructor use ServerSocket to find an unused port, assign the port number to the member variable port and close the ServerSocket object; later, in initKDCServer(), instantiate a TcpTransport object and bind at that port. > It appears that the port may be used in between, and then throw the exception. Related test failures are suspected, such as https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122225/testReport/org.apache.spark.sql.kafka010/KafkaDelegationTokenSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ ```scala [info] org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite ABORTED * (15 seconds, 426 milliseconds) [info] java.net.BindException: Address already in use [info] at sun.nio.ch.Net.bind0(Native Method) [info] at sun.nio.ch.Net.bind(Net.java:433) [info] at sun.nio.ch.Net.bind(Net.java:425) [info] at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) [info] at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) [info] at org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:198) [info] at org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:51) [info] at org.apache.mina.core.polling.AbstractPollingIoAcceptor.registerHandles(AbstractPollingIoAcceptor.java:547) [info] at org.apache.mina.core.polling.AbstractPollingIoAcceptor.access$400(AbstractPollingIoAcceptor.java:68) [info] at org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:422) [info] at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) ``` After comparing the error stack trace with similar issues reported in different projects, such as https://issues.apache.org/jira/browse/KAFKA-3453 https://issues.apache.org/jira/browse/HBASE-14734 We can be sure that they are caused by the same problem issued in HADOOP-12656. In the PR, We apply the approach from HBASE first before we finally drop Hadoop 2.7.x ### Why are the changes needed? fix test flakiness ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? the test itself passing Jenkins Closes #28442 from yaooqinn/SPARK-31631. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	07 May 2020, 12:03:01 UTC
dfb916f	Max Gekk	07 May 2020, 07:52:29 UTC	[SPARK-31361][SQL][TESTS][FOLLOWUP] Check non-vectorized Parquet reader while date/timestamp rebasing ### What changes were proposed in this pull request? In PR, I propose to modify two tests of `ParquetIOSuite`: - SPARK-31159: rebasing timestamps in write - SPARK-31159: rebasing dates in write to check non-vectorized Parquet reader together with vectorized reader. ### Why are the changes needed? To improve test coverage and make sure that non-vectorized reader behaves similar to the vectorized reader. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `PaquetIOSuite`: ``` $ ./build/sbt "test:testOnly *ParquetIOSuite" ``` Closes #28466 from MaxGekk/test-novec-rebase-ParquetIOSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 272d229005b7166ab83bbb8f44a4d5e9d89424a1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	07 May 2020, 07:52:41 UTC
dc7324e	Liang-Chi Hsieh	07 May 2020, 00:57:08 UTC	[SPARK-31365][SQL][FOLLOWUP] Refine config document for nested predicate pushdown ### What changes were proposed in this pull request? This is a followup to address the https://github.com/apache/spark/pull/28366#discussion_r420611872 by refining the SQL config document. ### Why are the changes needed? Make developers less confusing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only doc change. Closes #28468 from viirya/SPARK-31365-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 9bf738724a3895551464d8ba0d455bc90868983f) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	07 May 2020, 00:57:26 UTC
43d8d54	Max Gekk	07 May 2020, 00:46:42 UTC	[SPARK-31361][SQL][FOLLOWUP] Use LEGACY_PARQUET_REBASE_DATETIME_IN_READ instead of avro config in ParquetIOSuite ### What changes were proposed in this pull request? Replace the Avro SQL config `LEGACY_AVRO_REBASE_DATETIME_IN_READ ` by `LEGACY_PARQUET_REBASE_DATETIME_IN_READ ` in `ParquetIOSuite`. ### Why are the changes needed? Avro config is not relevant to the parquet tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetIOSuite` via ``` ./build/sbt "test:testOnly *ParquetIOSuite" ``` Closes #28461 from MaxGekk/fix-conf-in-ParquetIOSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 3d38bc2605ab01d61127c09e1bf6ed6a6683ed3e) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	07 May 2020, 00:47:17 UTC
f8a20c4	HyukjinKwon	07 May 2020, 00:00:59 UTC	[SPARK-31647][SQL] Deprecate 'spark.sql.optimizer.metadataOnly' configuration ### What changes were proposed in this pull request? This PR proposes to deprecate 'spark.sql.optimizer.metadataOnly' configuration and remove it in the future release. ### Why are the changes needed? This optimization can cause a potential correctness issue, see also SPARK-26709. Also, it seems difficult to extend the optimization. Basically you should whitelist all available functions. It costs some maintenance overhead, see also SPARK-31590. Looks we should just better let users use `SparkSessionExtensions` instead if they must use, and remove it in Spark side. ### Does this PR introduce _any_ user-facing change? Yes, setting `spark.sql.optimizer.metadataOnly` will show a deprecation warning: ```scala scala> spark.conf.unset("spark.sql.optimizer.metadataOnly") ``` ``` 20/05/06 12:57:23 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to inject it as a custom rule. ``` ```scala scala> spark.conf.set("spark.sql.optimizer.metadataOnly", "true") ``` ``` 20/05/06 12:57:44 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to inject it as a custom rule. ``` ### How was this patch tested? Manually tested. Closes #28459 from HyukjinKwon/SPARK-31647. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 5c5dd77d6a29b014b3fe4b4015f5c7199650a378) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	07 May 2020, 00:01:19 UTC
c1b5a4f	yi.wu	06 May 2020, 12:52:53 UTC	[SPARK-31650][SQL] Fix wrong UI in case of AdaptiveSparkPlanExec has unmanaged subqueries ### What changes were proposed in this pull request? Make the non-subquery `AdaptiveSparkPlanExec` update UI again after execute/executeCollect/executeTake/executeTail if the `AdaptiveSparkPlanExec` has subqueries which do not belong to any query stages. ### Why are the changes needed? If there're subqueries do not belong to any query stages of the main query, the main query could get final physical plan and update UI before those subqueries finished. As a result, the UI can not reflect the change from the subqueries, e.g. new nodes generated from subqueries. Before: <img width="335" alt="before_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149758-671a9480-8fb1-11ea-84c4-9a4520e2b08e.png"> After: <img width="546" alt="after_aqe_ui" src="https://user-images.githubusercontent.com/16397174/81149752-63870d80-8fb1-11ea-9852-f41e11afe216.png"> ### Does this PR introduce _any_ user-facing change? No(AQE feature hasn't been released). ### How was this patch tested? Tested manually. Closes #28460 from Ngone51/fix_aqe_ui. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b16ea8e1ab58bd24c50d31ce0dfc6c79c87fa3b2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	06 May 2020, 12:53:04 UTC
860849d	Liang-Chi Hsieh	06 May 2020, 04:50:06 UTC	[SPARK-31365][SQL] Enable nested predicate pushdown per data sources ### What changes were proposed in this pull request? This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown. ### Why are the changes needed? We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources. In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments https://github.com/apache/spark/pull/27728#discussion_r410829720. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added/Modified unit tests. Closes #28366 from viirya/SPARK-31365. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4952f1a03cc48d9f1c3d2539ffa19bf051e398bf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	06 May 2020, 04:50:17 UTC
fcd566e	Daoyuan Wang	06 May 2020, 04:34:43 UTC	[SPARK-31595][SQL] Spark sql should allow unescaped quote mark in quoted string ### What changes were proposed in this pull request? `def splitSemiColon` cannot handle unescaped quote mark like "'" or '"' correctly. When there are unmatched quotes in a string, `splitSemiColon` will not drop off semicolon as expected. ### Why are the changes needed? Some regex expression will use quote mark in string. We should process semicolon correctly. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added Unit test and also manual test. Closes #28393 from adrian-wang/unescaped. Authored-by: Daoyuan Wang <me@daoyuan.wang> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 53a9bf8fece7322312cbe93c9224c04f645a0f5e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	06 May 2020, 04:34:53 UTC
45ed712	sychen	06 May 2020, 01:56:19 UTC	[SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters ### What changes were proposed in this pull request? Metadata-only queries should not include subquery in partition filters. ### Why are the changes needed? Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`. ### Does this PR introduce any user-facing change? Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters. ### How was this patch tested? add UT Closes #28383 from cxzl25/fix_SPARK-31590. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 588966d696373c11e963116a0e08ee33c30f0dfb) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	06 May 2020, 01:56:35 UTC
b763c23	yi.wu	05 May 2020, 19:36:42 UTC	[SPARK-31643][TEST] Fix flaky o.a.s.scheduler.BarrierTaskContextSuite.barrier task killed, interrupt ### What changes were proposed in this pull request? Make sure the task has nearly reached `context.barrier()` before killing. ### Why are the changes needed? In case of the task is killed before it reaches `context.barrier()`, the task will not create the expected file. ``` Error Message org.scalatest.exceptions.TestFailedException: new java.io.File(dir, killedFlagFile).exists() was false Expect barrier task being killed. Stacktrace sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: new java.io.File(dir, killedFlagFile).exists() was false Expect barrier task being killed. at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$testBarrierTaskKilled$1(BarrierTaskContextSuite.scala:266) at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$testBarrierTaskKilled$1$adapted(BarrierTaskContextSuite.scala:226) at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:163) at org.apache.spark.scheduler.BarrierTaskContextSuite.testBarrierTaskKilled(BarrierTaskContextSuite.scala:226) at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$29(BarrierTaskContextSuite.scala:277) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ``` [Here's](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122273/testReport/org.apache.spark.scheduler/BarrierTaskContextSuite/barrier_task_killed__interrupt/) the full error messages. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #28454 from Ngone51/fix_kill_interrupt. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 61a6ca5d3f623c2a8b49277ac62d77bf4dbfa84f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	05 May 2020, 19:36:57 UTC
5b2d1f4	Steve Loughran	05 May 2020, 19:17:24 UTC	[SPARK-31644][BUILD] Make Spark's guava version configurable from the command line ### What changes were proposed in this pull request? This adds the maven property guava.version which can be used to control the guava version for a build. It does not change the current version. ### Why are the changes needed? All future Hadoop releases are going to be built with a later guava version, including Hadoop 3.1.4. This means to run the spark tests with that release you need to update the spark guava version. This patch lets whoever builds spark do this locally. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ran the hadoop-cloud module tests with the 3.1.4 RC0 ``` mvn -T 1 -Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging -Phadoop-cloud,yarn,kinesis-asl test --pl hadoop-cloud ``` observed the linkage problem ``` java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) ``` made the version configurable, retested with ``` -Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging Dguava.version=27.0-jre ``` all good. Closes #28455 from steveloughran/SPARK-31644-guava-version. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 86c4e4352565cf271a96b650f533ed89a725322c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	05 May 2020, 19:17:43 UTC
080c51e	Max Gekk	05 May 2020, 14:15:31 UTC	[SPARK-31641][SQL] Fix days conversions by JSON legacy parser ### What changes were proposed in this pull request? Perform days rebasing while converting days from JSON string field. In Spark 2.4 and earlier versions, the days are interpreted as days since the epoch in the hybrid calendar (Julian + Gregorian since 1582-10-15). Since Spark 3.0, the base calendar was switched to Proleptic Gregorian calendar, so, the days should be rebased to represent the same local date. ### Why are the changes needed? The changes fix a bug and restore compatibility with Spark 2.4 in which: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-01\| +----------+ ``` ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-11\| +----------+ ``` After: ```scala scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show +----------+ \| d\| +----------+ \|1582-01-01\| +----------+ ``` ### How was this patch tested? Add a test to `JsonSuite`. Closes #28453 from MaxGekk/json-rebase-legacy-days. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bd264299317bba91f2dc1dc27fd51e6bc0609d66) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	05 May 2020, 14:15:43 UTC
b29bbed	Max Gekk	05 May 2020, 14:11:53 UTC	[SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold ### What changes were proposed in this pull request? Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`. ### Why are the changes needed? The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`. Before (see the PR's results https://github.com/apache/spark/pull/28440): ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 376 388 10 13.3 75.2 1.1X Collect java.sql.Timestamp 1878 1937 64 2.7 375.6 0.2X ``` After: ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 249 264 24 20.1 49.8 1.7X Collect java.sql.Timestamp 1503 1523 24 3.3 300.5 0.3X ``` Perf improvements in average of: 1. From java.sql.Timestamp is ~ 34% 2. To java.sql.Timestamps is ~16% ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`. Closes #28441 from MaxGekk/opt-rebase-common-threshold. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bef5828e12500630d7efc8e0c005182b25ef2b7f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	05 May 2020, 14:12:06 UTC
8c077b0	Akshat Bordia	05 May 2020, 13:58:37 UTC	[SPARK-31621][CORE] Fixing Spark Master UI Issue when application is waiting for workers to launch driver ### What changes were proposed in this pull request? Fixing an issue where Spark Master UI Fails to load if the application is waiting for workers to launch driver. Root Cause: This is happening due to the fact that the submitted application is waiting for a worker to be free to run the driver. Due to this resource is set to null in the formatResourcesAddresses method and this is running into null pointer exception. ![image](https://user-images.githubusercontent.com/31816865/80801557-77ee9300-8bca-11ea-92b7-b8df58b68de3.png) Fix: Added a null check before forming a resource address and display "None" if the driver isn't launched yet. ### Why are the changes needed? Spark Master UI should load as expected when applications are waiting for workers to run driver. ### Does this PR introduce _any_ user-facing change? The worker column in Spark Master UI will show "None" if the driver hasn't been launched yet. ![image](https://user-images.githubusercontent.com/31816865/80801671-be43f200-8bca-11ea-86c3-381925f82cc7.png) ### How was this patch tested? Tested on a local setup. Launched 2 applications and ensured that Spark Master UI loads fine. ![image](https://user-images.githubusercontent.com/31816865/80801883-5b9f2600-8bcb-11ea-8a1a-cc597aabc4c2.png) Closes #28429 from akshatb1/MasterUIBug. Authored-by: Akshat Bordia <akshat.bordia31@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit c71198ab6c8c9ded6a52eb97859b39dc2119b5fd) Signed-off-by: Thomas Graves <tgraves@apache.org>	05 May 2020, 13:59:26 UTC
ccde0a1	Dilip Biswal	05 May 2020, 06:21:14 UTC	[SPARK-31030][DOCS][FOLLOWUP] Replace HTML Table by Markdown Table ### What changes were proposed in this pull request? This PR is to clean up the markdown file in remaining pages in sql reference. The first one was done by gatorsmile in [28415](https://github.com/apache/spark/pull/28415) - Replace HTML table by MD table - sql-ref-ansi-compliance.md <img width="967" alt="Screen Shot 2020-05-01 at 4 36 35 PM" src="https://user-images.githubusercontent.com/14225158/80848981-1cbca080-8bca-11ea-8a5d-63174b31c800.png"> - sql-ref-datatypes.md (Scala) <img width="967" alt="Screen Shot 2020-05-01 at 4 37 30 PM" src="https://user-images.githubusercontent.com/14225158/80849057-6a390d80-8bca-11ea-8866-ab08bab31432.png"> <img width="967" alt="Screen Shot 2020-05-01 at 4 39 18 PM" src="https://user-images.githubusercontent.com/14225158/80849061-6c9b6780-8bca-11ea-834c-eb93d3ab47ae.png"> - sql-ref-datatypes.md (Java) <img width="967" alt="Screen Shot 2020-05-01 at 4 41 24 PM" src="https://user-images.githubusercontent.com/14225158/80849138-b3895d00-8bca-11ea-9d3b-555acad2086c.png"> <img width="967" alt="Screen Shot 2020-05-01 at 4 41 39 PM" src="https://user-images.githubusercontent.com/14225158/80849140-b6844d80-8bca-11ea-9ca9-1812b6a76c02.png"> - sql-ref-datatypes.md (Python) <img width="967" alt="Screen Shot 2020-05-01 at 4 43 36 PM" src="https://user-images.githubusercontent.com/14225158/80849202-0400ba80-8bcb-11ea-96a5-7caecbf9dbbf.png"> <img width="967" alt="Screen Shot 2020-05-01 at 4 43 54 PM" src="https://user-images.githubusercontent.com/14225158/80849205-06fbab00-8bcb-11ea-8f00-6df52b151684.png"> - sql-ref-datatypes.md (R) <img width="967" alt="Screen Shot 2020-05-01 at 4 45 16 PM" src="https://user-images.githubusercontent.com/14225158/80849288-5fcb4380-8bcb-11ea-8277-8589b5bb31bc.png"> <img width="967" alt="Screen Shot 2020-05-01 at 4 45 36 PM" src="https://user-images.githubusercontent.com/14225158/80849294-62c63400-8bcb-11ea-9438-b4f1193bc757.png"> - sql-ref-datatypes.md (SQL) <img width="967" alt="Screen Shot 2020-05-01 at 4 48 02 PM" src="https://user-images.githubusercontent.com/14225158/80849336-986b1d00-8bcb-11ea-9736-5fb40496b681.png"> - sql-ref-syntax-qry-select-tvf.md <img width="967" alt="Screen Shot 2020-05-01 at 4 49 32 PM" src="https://user-images.githubusercontent.com/14225158/80849399-d10af680-8bcb-11ea-8dc2-e3e750e21a59.png"> ### Why are the changes needed? Make the doc cleaner and easily editable by MD editors ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually using jekyll serve Closes #28433 from dilipbiswal/sql-doc-table-cleanup. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 5052d9557d964c07d0b8bd2e2b08ede7c6958118) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	05 May 2020, 06:21:44 UTC
edc0cb9	turbofei	05 May 2020, 06:14:33 UTC	[SPARK-31467][SQL][TEST] Refactor the sql tests to prevent TableAlreadyExistsException ### What changes were proposed in this pull request? If we add UT in hive/SQLQuerySuite or other sql test suites and use table named `test`. We may meet TableAlreadyExistsException. ``` org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or view 'test' already exists in database 'default' ``` The reason is that, there is some tests that does not clean up the tables/views. In this PR, I add `withTempViews` for these tests. ### Why are the changes needed? To fix the TableAlreadyExistsException issue when adding an UT, which uses table named `test` or others, in some sql test suites, such as hive/SQLQuerySuite. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed UT. Closes #28239 from turboFei/SPARK-31467. Authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 8d1f7d2a4ae08135df56cc32462b29703be46c42) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	05 May 2020, 06:14:56 UTC
d2a7b17	Max Gekk	05 May 2020, 05:40:15 UTC	[SPARK-31623][SQL][TESTS] Benchmark rebasing of INT96 and TIMESTAMP_MILLIS timestamps in read/write ### What changes were proposed in this pull request? Add new benchmarks to `DateTimeRebaseBenchmark` for reading/writing timestamps of INT96 and TIMESTAMP_MICROS column types. Here are benchmark results for reading timestamps after 1582 year with default settings (rebasing is off for TIMESTAMP_MICROS/TIMESTAMP_MILLIS, and rebasing on for INT96): timestamp type \| vectorized off (ns/row) \| vectorized on (ns/row) --\|--\|-- TIMESTAMP_MICROS\| 160.1 \| 50.2 INT96 \| 215.6 \| 117.8 TIMESTAMP_MILLIS \| 159.9 \| 60.6 ### Why are the changes needed? To compare default timestamp type `TIMESTAMP_MICROS` with other types in the case if an user decides to switch on them. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmarks via: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252-8u252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28431 from MaxGekk/parquet-timestamps-DateTimeRebaseBenchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 735771e7b4f7c6dfe70a1a6f59b0646dcc6bacd7) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	05 May 2020, 05:40:28 UTC
b8143cb	Dongjoon Hyun	05 May 2020, 01:07:30 UTC	[SPARK-27963][FOLLOW-UP][DOCS][CORE] Remove `for testing` because CleanerListener is used ExecutorMonitor during dynamic allocation ### What changes were proposed in this pull request? This PR aims to remove `for testing` from `CleanerListener` class description to promote this private class more clearly. ### Why are the changes needed? After SPARK-27963 (Allow dynamic allocation without a shuffle service), `CleanerListener` is used in `ExecutorMonitor` during dynamic allocation. Specifically, `CleanerListener.shuffleCleaned` is used. - https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala#L385-L392 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a private doc-only change. Closes #28452 from dongjoon-hyun/SPARK-MINOR. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0907f2e7b505adf4e96a1fa7a80629680c3bf5bf) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	05 May 2020, 01:07:44 UTC
567e65e	beliefer	05 May 2020, 01:04:16 UTC	[SPARK-31372][SQL][TEST][FOLLOW-UP] Improve ExpressionsSchemaSuite so that easy to track the diff ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/28194. As discussed at https://github.com/apache/spark/pull/28194/files#r418418796. This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff. Although `ExpressionsSchemaSuite` at line https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala#L165 just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output. This PR will try to avoid this issue. After this PR, the exception looks like below: ``` [info] - Check schemas for expression examples * FAILED * (7 seconds, 336 milliseconds) [info] 340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167) ``` ### Why are the changes needed? Make the exception more concise and clear. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28430 from beliefer/improve-expressions-schema-suite. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b9494206a50d39973b46f32f2d44cc8099c078d4) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	05 May 2020, 01:04:38 UTC
c4b292e	Max Gekk	05 May 2020, 00:27:02 UTC	[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default ### What changes were proposed in this pull request? This reverts commit https://github.com/apache/spark/commit/43a73e387cb843486adcf5b8bbd8b99010ce6e02. It sets `INT96` as the timestamp type while saving timestamps to parquet files. ### Why are the changes needed? To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites. Closes #28450 from MaxGekk/parquet-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 372ccba0632a76a7b02cb2c558a3ecd4fae839e5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	05 May 2020, 00:27:17 UTC
8a97e4e	Dongjoon Hyun	04 May 2020, 15:14:12 UTC	[SPARK-31633][BUILD] Upgrade SLF4J from 1.7.16 to 1.7.30 ### What changes were proposed in this pull request? This PR aims to upgrade SLF4J from 1.7.16 to 1.7.30. ### Why are the changes needed? SLF4J 1.7.23+ is required to enable `slf4j-log4j12` with MDC feature to run under Java 9. Also, this will bring all latest bug fixes. - http://www.slf4j.org/news.html > When running under Java 9, log4j version 1.2.x is unable to correctly parse the "java.version" system property. Assuming an inccorect Java version, it proceeded to disable its MDC functionality. The slf4j-log4j12 module shipping in this release fixes the issue by tweaking MDC internals by reflection, allowing log4j to run under Java 9. See also SLF4J-393. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #28446 from dongjoon-hyun/SPARK-31633. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e7995c2ddcfd43c8cd99d2a54009139752b66e69) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	04 May 2020, 15:14:26 UTC
357fdb2	Burak Yavuz	04 May 2020, 12:22:29 UTC	[SPARK-31624] Fix SHOW TBLPROPERTIES for V2 tables that leverage the session catalog ## What changes were proposed in this pull request? SHOW TBLPROPERTIES does not get the correct table properties for tables using the Session Catalog. This PR fixes that, by explicitly falling back to the V1 implementation if the table is in fact a V1 table. We also hide the reserved table properties for V2 tables, as users do not have control over setting these table properties. Henceforth, if they cannot be set or controlled by the user, then they shouldn't be displayed as such. ### Why are the changes needed? Shows the incorrect table properties, i.e. only what exists in the Hive MetaStore for V2 tables that may have table properties outside of the MetaStore. ### Does this PR introduce _any_ user-facing change? Fixes a bug ### How was this patch tested? Regression test Closes #28434 from brkyvz/ddlCommands. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 02a319d7e157c195d0a2b8c2bb992d980dde7d5c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	04 May 2020, 12:22:39 UTC
04b3699	Kazuaki Ishizaki	04 May 2020, 07:53:50 UTC	[MINOR][DOCS] Fix typo in documents Fixed typo in `docs` directory and in `project/MimaExcludes.scala` Better readability of documents No No test needed Closes #28447 from kiszk/typo_20200504. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 35fcc8d5c58626836cd4d99a472e7350ea3acd0d) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	04 May 2020, 08:02:27 UTC
b3d2eaa	Wenchen Fan	04 May 2020, 06:30:10 UTC	[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase ### What changes were proposed in this pull request? Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly. ### Why are the changes needed? Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare). ### Does this PR introduce any user-facing change? no ### How was this patch tested? Run part of the `DateTimeRebaseBenchmark` locally. The results: before this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X [info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X [info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X [info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X [info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X [info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X [info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X ``` After this patch ``` [info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X [info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X [info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X [info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X [info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X [info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X [info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X [info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X ``` Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase. Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase. Closes #28406 from cloud-fan/perf. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit f72220b8ab256e8e6532205a4ce51d50b69c26e9) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	04 May 2020, 06:30:24 UTC
8584f4d	Yuming Wang	04 May 2020, 05:59:33 UTC	[SPARK-31626][SQL] Port HIVE-10415: hive.start.cleanup.scratchdir configuration is not taking effect ### What changes were proposed in this pull request? This pr port [HIVE-10415](https://issues.apache.org/jira/browse/HIVE-10415): `hive.start.cleanup.scratchdir` configuration is not taking effect. ### Why are the changes needed? I encountered this issue: ![image](https://user-images.githubusercontent.com/5399861/80869375-aeafd080-8cd2-11ea-8573-93ec4b422be1.png) I'd like to make `hive.start.cleanup.scratchdir` effective to reduce this issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test Closes #28436 from wangyum/SPARK-31626. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 7ef0b69a92db91d0c09e65eb9dcfb973def71814) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	04 May 2020, 05:59:46 UTC
7d48215	Tianshi Zhu	04 May 2020, 05:50:38 UTC	[SPARK-31267][SQL] Flaky test: WholeStageCodegenSparkSubmitSuite.Generated code on driver should not embed platform-specific constant ### What changes were proposed in this pull request? Allow customized timeouts for `runSparkSubmit`, which will make flaky tests more likely to pass by using a larger timeout value. I was able to reproduce the test failure on my laptop, which took 1.5 - 2 minutes to finish the test. After increasing the timeout, the test now can pass locally. ### Why are the changes needed? This allows slow tests to use a larger timeout, so they are more likely to succeed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The test was able to pass on my local env after the change. Closes #28438 from tianshizz/SPARK-31267. Authored-by: Tianshi Zhu <zhutianshirea@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a222644e1df907d0aba19634a166e146dfb4f551) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	04 May 2020, 05:50:57 UTC
d400880	Max Gekk	04 May 2020, 00:39:50 UTC	[SPARK-31527][SQL][TESTS][FOLLOWUP] Fix the number of rows in `DateTimeBenchmark` ### What changes were proposed in this pull request? - Changed to the number of rows in benchmark cases from 3 to the actual number `N`. - Regenerated benchmark results in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 \| ### Why are the changes needed? The changes are needed to have: - Correct benchmark results - Base line for other perf improvements that can be checked in the same environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the benchmark and checking its output. Closes #28440 from MaxGekk/SPARK-31527-DateTimeBenchmark-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 2fb85f6b684843f337b6e73ba57ee9e57a53496d) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	04 May 2020, 00:40:07 UTC
9c101e9	Michael Chirico	03 May 2020, 03:40:20 UTC	[SPARK-31571][R] Overhaul stop/message/warning calls to be more canonical ### What changes were proposed in this pull request? Internal usages like `{stop,warning,message}({paste,paste0,sprintf}` and `{stop,warning,message}(some_literal_string_as_variable` have been removed and replaced as appropriate. ### Why are the changes needed? CRAN policy recommends against using such constructions to build error messages, in particular because it makes the process of creating portable error messages for the package more onerous. ### Does this PR introduce any user-facing change? There may be some small grammatical changes visible in error messaging. ### How was this patch tested? Not done Closes #28365 from MichaelChirico/r-stop-paste. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit f53d8c63e80172295e2fbc805c0c391bdececcaa) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	03 May 2020, 03:40:35 UTC
3aa659c	Max Gekk	02 May 2020, 08:54:36 UTC	[MINOR][SQL][TESTS] Disable UI in SQL benchmarks by default ### What changes were proposed in this pull request? Set `spark.ui.enabled` to `false` in `SqlBasedBenchmark.getSparkSession`. This disables UI in all SQL benchmarks by default. ### Why are the changes needed? UI overhead lowers numbers in the `Relative` column and impacts on `Stdev` in benchmark results. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Checked by running `DateTimeRebaseBenchmark`. Closes #28432 from MaxGekk/ui-off-in-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 13dddee9a8490ead00ff00bd741db4a170dfd759) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	02 May 2020, 08:55:05 UTC
7ddaaed	Huaxin Gao	01 May 2020, 17:11:43 UTC	[MINOR][SQL][DOCS] Remove two leading spaces from sql tables ### What changes were proposed in this pull request? Remove two leading spaces from sql tables. ### Why are the changes needed? Follow the format of other references such as https://docs.snowflake.com/en/sql-reference/constructs/join.html, https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_10002.htm, https://www.postgresql.org/docs/10/sql-select.html. ### Does this PR introduce any user-facing change? before ``` SELECT * FROM test; +-+ ... +-+ ``` after ``` SELECT * FROM test; +-+ ... +-+ ``` ### How was this patch tested? Manually build and check Closes #28348 from huaxingao/sql-format. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 75da05038b68839c2b665675c80455826fc426b5) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	01 May 2020, 17:12:02 UTC
1795a70	Pablo Langa	01 May 2020, 13:09:04 UTC	[SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements ### What changes were proposed in this pull request? The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case. Example: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window case class R(id: String, value: String, bytes: Array[Byte]) def makeR(id: String, value: String) = R(id, value, value.getBytes) val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF() // In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected). df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false) // The same problem is displayed when using window functions. val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) val result = df.select( collect_set('value).over(win) as "stringSet", collect_set('bytes).over(win) as "bytesSet" ) .select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize") .show() ``` We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays Array(1, 2, 3) == Array(1, 2, 3) => False The result is that duplicates are not removed in the hashset The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type. This transformation is only applied when we have a BinaryType ### Why are the changes needed? Fix the bug explained ### Does this PR introduce any user-facing change? Yes. Now `collect_set()` correctly deduplicates array of byte. ### How was this patch tested? Unit testing Closes #28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 4fecc20f6ecdfe642890cf0a368a85558c40a47c) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	01 May 2020, 13:09:38 UTC
7c6b970	Takeshi Yamamuro	01 May 2020, 09:37:41 UTC	[SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite ### What changes were proposed in this pull request? This PR is a follow-up PR to update the golden file of `ExpressionsSchemaSuite`. ### Why are the changes needed? To recover tests in branch-3.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #28427 from maropu/SPARK-31372-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	01 May 2020, 09:37:41 UTC
cafd67e	HyukjinKwon	01 May 2020, 03:12:30 UTC	[SPARK-31549][PYSPARK][FOLLOW-UP] Remove a newline between class methods for linter check	01 May 2020, 03:12:30 UTC
78df2ca	Xingbo Jiang	01 May 2020, 02:46:17 UTC	[SPARK-31619][CORE] Rename config "spark.dynamicAllocation.shuffleTimeout" to "spark.dynamicAllocation.shuffleTracking.timeout" ### What changes were proposed in this pull request? The "spark.dynamicAllocation.shuffleTimeout" configuration only takes effect if "spark.dynamicAllocation.shuffleTracking.enabled" is true, so we should re-namespace that configuration so that it's nested under the "shuffleTracking" one. ### How was this patch tested? Covered by current existing test cases. Closes #28426 from jiangxb1987/confName. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b7cde42b04b21c9bfee6535199cf385855c15853) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	01 May 2020, 02:46:40 UTC
7be1484	Yuanjian Li	01 May 2020, 01:32:37 UTC	[SPARK-28424][TESTS][FOLLOW-UP] Add test cases for all interval units Add test cases covering all interval units: MICROSECOND MILLISECOND SECOND MINUTE HOUR DAY WEEK MONTH YEAR For test coverage. No. Test only. Closes #28418 from xuanyuanking/SPARK-28424. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit aec8b694359a5a7fb42adfcbfe2f3cef2d307e28) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	01 May 2020, 01:34:44 UTC
a281b9c	Weichen Xu	01 May 2020, 01:08:16 UTC	[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group ### What changes were proposed in this pull request? I add a new API in pyspark RDD class: def collectWithJobGroup(self, groupId, description, interruptOnCancel=False) This API do the same thing with `rdd.collect`, but it can specify the job group when do collect. The purpose of adding this API is, if we use: ``` sc.setJobGroup("group-id...") rdd.collect() ``` The `setJobGroup` API in pyspark won't work correctly. This related to a bug discussed in https://issues.apache.org/jira/browse/SPARK-31549 Note: This PR is a rather temporary workaround for `PYSPARK_PIN_THREAD`, and as a step to migrate to `PYSPARK_PIN_THREAD` smoothly. It targets Spark 3.0. - `PYSPARK_PIN_THREAD` is unstable at this moment that affects whole PySpark applications. - It is impossible to make it runtime configuration as it has to be set before JVM is launched. - There is a thread leak issue between Python and JVM. We should address but it's not a release blocker for Spark 3.0 since the feature is experimental. I plan to handle this after Spark 3.0 due to stability. Once `PYSPARK_PIN_THREAD` is enabled by default, we should remove this API out ideally. I will target to deprecate this API in Spark 3.1. ### Why are the changes needed? Fix bug. ### Does this PR introduce any user-facing change? A develop API in pyspark: `pyspark.RDD. collectWithJobGroup` ### How was this patch tested? Unit test. Closes #28395 from WeichenXu123/collect_with_job_group. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit ee1de66fe4e05754ea3f33b75b83c54772b00112) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	01 May 2020, 01:08:37 UTC
f7c1feb	Huaxin Gao	30 April 2020, 21:30:35 UTC	[SPARK-31612][SQL][DOCS] SQL Reference clean up ### What changes were proposed in this pull request? SQL Reference cleanup ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce _any_ user-facing change? updated sql-ref-syntax-qry.html before <img width="1100" alt="Screen Shot 2020-04-29 at 11 08 25 PM" src="https://user-images.githubusercontent.com/13592258/80677799-70b27280-8a6e-11ea-8e3f-a768f29d0377.png"> after <img width="1100" alt="Screen Shot 2020-04-29 at 11 05 55 PM" src="https://user-images.githubusercontent.com/13592258/80677803-74de9000-8a6e-11ea-880c-aa05c53254a6.png"> ### How was this patch tested? Manually build and check Closes #28417 from huaxingao/cleanup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 2410a45703b829391211caaf1a745511f95298ad) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	30 April 2020, 21:30:54 UTC
f5e018e	Xiao Li	30 April 2020, 16:34:56 UTC	[SPARK-28806][DOCS][FOLLOW-UP] Remove unneeded HTML from the MD file ### What changes were proposed in this pull request? This PR is to clean up the markdown file in SHOW COLUMNS page. - remove the unneeded embedded inline HTML markup by using the basic markdown syntax. - use the ``` sql for highlighting the SQL syntax. ### Why are the changes needed? Make the doc cleaner and easily editable by MD editors. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Before ![Screen Shot 2020-04-29 at 5 20 11 PM](https://user-images.githubusercontent.com/11567269/80661963-fa4d4a80-8a44-11ea-9dea-c43cda6de010.png) After ![Screen Shot 2020-04-29 at 6 03 50 PM](https://user-images.githubusercontent.com/11567269/80661940-f15c7900-8a44-11ea-9943-a83e8d8618fb.png) Closes #28414 from gatorsmile/cleanupShowColumns. Lead-authored-by: Xiao Li <gatorsmile@gmail.com> Co-authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit b5ecc41c73018bbc742186d2e752101a99cfe852) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	30 April 2020, 16:35:11 UTC
9d4a864	Max Gekk	30 April 2020, 12:45:32 UTC	[SPARK-31557][SQL] Fix timestamps rebasing in legacy parsers In the PR, I propose to fix two legacy timestamp formatter `LegacySimpleTimestampFormatter` and `LegacyFastTimestampFormatter` to perform micros rebasing in parsing/formatting from/to strings. Legacy timestamps formatters operate on the hybrid calendar (Julian + Gregorian), so, the input micros should be rebased to have the same date-time fields as in Proleptic Gregorian calendar used by Spark SQL, see SPARK-26651. Yes Added tests to `TimestampFormatterSuite` Closes #28408 from MaxGekk/fix-rebasing-in-legacy-timestamp-formatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c09cfb9808d0e399b97781aae0da50332ba4b49b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 April 2020, 12:48:23 UTC
e9ca660	Yuanjian Li	30 April 2020, 06:24:00 UTC	[SPARK-27340][SS][TESTS][FOLLOW-UP] Rephrase API comments and simplify tests ### What changes were proposed in this pull request? - Rephrase the API doc for `Column.as` - Simplify the UTs ### Why are the changes needed? Address comments in https://github.com/apache/spark/pull/28326 ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #28390 from xuanyuanking/SPARK-27340-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7195a18bf24d9506d2f8d9d4d93ff679b3d21b65) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 April 2020, 06:24:20 UTC
4864680	gatorsmile	30 April 2020, 05:47:42 UTC	[SPARK-31030][DOCS][FOLLOWUP] Replace HTML Table by Markdown Table ### What changes were proposed in this pull request? This PR is to clean up the markdown file in datetime-pattern page. - Replace HTML table by MD table ### Why are the changes needed? Make the doc cleaner and easily editable by MD editors. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Before ![Screen Shot 2020-04-29 at 7 59 10 PM](https://user-images.githubusercontent.com/11567269/80668093-c9294600-8a55-11ea-9dca-d558203298f8.png) After ![Screen Shot 2020-04-29 at 8 13 38 PM](https://user-images.githubusercontent.com/11567269/80668146-f1b14000-8a55-11ea-8d47-8dc8a0378271.png) Closes #28415 from gatorsmile/cleanupUDFPage. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f56c6630fbb3230f0eb9549b45105a1a015abd4c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 April 2020, 05:47:54 UTC
75fc384	beliefer	30 April 2020, 03:58:04 UTC	[SPARK-31372][SQL][TEST] Display expression schema for double check ### What changes were proposed in this pull request? Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement. We need to add more powerful guarantees so that aliases outputed by built-in functions are correct. This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file. By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them. ### Why are the changes needed? Ensure that the output alias is correct ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28194 from beliefer/check-expression-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1d1bb79bc695dbaa00699e6fc9073233b60ed395) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 April 2020, 03:58:18 UTC
82b8f7f	Dongjoon Hyun	30 April 2020, 03:45:12 UTC	[SPARK-31601][K8S] Fix spark.kubernetes.executor.podNamePrefix to work ### What changes were proposed in this pull request? This PR aims to fix `spark.kubernetes.executor.podNamePrefix` to work. ### Why are the changes needed? Currently, the configuration is broken like the following. ``` bin/spark-submit \ --master k8s://$K8S_MASTER \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ -c spark.kubernetes.container.image=spark:pr \ -c spark.kubernetes.driver.pod.name=mypod \ -c spark.kubernetes.executor.podNamePrefix=mypod \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar ``` BEFORE SPARK-31601 ``` pod/mypod 1/1 Running 0 9s pod/spark-pi-7469dd71c499fafb-exec-1 1/1 Running 0 4s pod/spark-pi-7469dd71c499fafb-exec-2 1/1 Running 0 4s ``` AFTER SPARK-31601 ``` pod/mypod 1/1 Running 0 8s pod/mypod-exec-1 1/1 Running 0 3s pod/mypod-exec-2 1/1 Running 0 3s ``` ### Does this PR introduce any user-facing change? Yes. This is a bug fix. The conf will work as described in the documentation. ### How was this patch tested? Pass the Jenkins and run the above comment manually. Closes #28401 from dongjoon-hyun/SPARK-31601. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com> (cherry picked from commit 85dad37f69ebb617c8ac015dbbbda11054170298) Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com>	30 April 2020, 03:46:41 UTC
281200f	Max Gekk	30 April 2020, 03:20:10 UTC	[SPARK-31553][SQL][TESTS][FOLLOWUP] Tests for collection elem types of `isInCollection` ### What changes were proposed in this pull request? - Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`. - Test different switch thresholds in the `isInCollection: Scala Collection` test. ### Why are the changes needed? To prevent regressions like introduced by https://github.com/apache/spark/pull/25754 and reverted by https://github.com/apache/spark/pull/28388 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing and new tests in `ColumnExpressionSuite` Closes #28405 from MaxGekk/test-isInCollection. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 91648654da259c63178f3fb3f94e3e62e1ef1e45) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 April 2020, 03:20:21 UTC
9ccf828	DB Tsai	29 April 2020, 21:10:40 UTC	[SPARK-31582][YARN] Being able to not populate Hadoop classpath ### What changes were proposed in this pull request? We are adding a new Spark Yarn configuration, `spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath`. ### Why are the changes needed? Spark Yarn client populates extra Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to a Yarn Hadoop cluster. However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can cause jar conflicts because Spark distribution can contain different version of Hadoop jars. One case we have is when a user uses an Apache Spark distribution with its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn clusters, because of two different incompatible Hadoop jars in the classpath, it runs into errors. By not populating the Hadoop classpath from the clusters can address this issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? An UT is added, but very hard to add a new integration test since this requires using different incompatible versions of Hadoop. We also manually tested this PR, and we are able to submit a Spark job using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating CDH classpath. Closes #28376 from dbtsai/yarn-classpath. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit ecfee82fda5f0403024ff64f16bc767b8d1e3e3d) Signed-off-by: DB Tsai <d_tsai@apple.com>	29 April 2020, 21:10:58 UTC
06e1c0d	Michael Chirico	29 April 2020, 09:42:32 UTC	[SPARK-29339][R][FOLLOW-UP] Remove requireNamespace1 workaround for arrow ### What changes were proposed in this pull request? `requireNamespace1` was used to get `SparkR` on CRAN while Suggesting `arrow` while `arrow` was not yet available on CRAN. ### Why are the changes needed? Now `arrow` is on CRAN, we can properly use `requireNamespace` without triggering CRAN failures. ### Does this PR introduce any user-facing change? No ### How was this patch tested? AppVeyor will test, and CRAN check in Jenkins build. Closes #28387 from MichaelChirico/r-require-arrow. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 226301a6bc9e675f00e0c66ae31bc2d297e3649f) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	29 April 2020, 09:42:52 UTC
b4e63ac	Max Gekk	29 April 2020, 07:19:34 UTC	[SPARK-31557][SQL][TESTS][FOLLOWUP] Check rebasing in all legacy formatters ### What changes were proposed in this pull request? - Check all available legacy formats in the tests added by https://github.com/apache/spark/pull/28345 - Check dates rebasing in legacy parsers for only one direction either days -> string or string -> days. ### Why are the changes needed? Round trip tests can hide issues in dates rebasing. For example, if we remove rebasing from legacy parsers (from `parse()` and `format()`) the tests will pass. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `DateFormatterSuite`. Closes #28398 from MaxGekk/test-rebasing-in-legacy-date-formatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 73eac7565d7cd185d12a46637703ebff73649e40) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	29 April 2020, 07:19:45 UTC
dde2dd6	Terry Kim	29 April 2020, 07:06:45 UTC	[SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views ### What changes were proposed in this pull request? This PR addresses two things: - `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921) - `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`. ### Why are the changes needed? It's a bug. ### Does this PR introduce any user-facing change? Yes, now `SHOW TBLPROPERTIES` works on views: ``` scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES view").show(truncate=false) +---------------------------------+-------------+ \|key \|value \| +---------------------------------+-------------+ \|view.catalogAndNamespace.numParts\|2 \| \|view.query.out.col.0 \|c1 \| \|view.query.out.numCols \|1 \| \|p2 \|v2 \| \|view.catalogAndNamespace.part.0 \|spark_catalog\| \|p1 \|v1 \| \|view.catalogAndNamespace.part.1 \|default \| +---------------------------------+-------------+ ``` And for a temporary view: ``` scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false) +---+-----+ \|key\|value\| +---+-----+ +---+-----+ ``` ### How was this patch tested? Added tests. Closes #28375 from imback82/show_tblproperties_followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 36803031e850b08d689df90d15c75e1a1eeb28a8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	29 April 2020, 07:06:57 UTC

Newer
Older