https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
a3cffc9 Preparing Spark release v2.4.6-rc1 06 May 2020, 23:13:21 UTC
a00eddc [SPARK-31653][BUILD] Setuptools is needed before installing any other python packages ### What changes were proposed in this pull request? Allow the docker build to succeed ### Why are the changes needed? The base packages depend on having setuptools installed now ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the release script, pip installs succeeded Closes #28467 from holdenk/SPARK-31653-setuptools-needs-to-be-isntalled-before-anything-else. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> 06 May 2020, 21:56:42 UTC
579b33b [SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters ### What changes were proposed in this pull request? Metadata-only queries should not include subquery in partition filters. ### Why are the changes needed? Apply the `OptimizeMetadataOnlyQuery` rule again, will get the exception `Cannot evaluate expression: scalar-subquery`. ### Does this PR introduce any user-facing change? Yes. When `spark.sql.optimizer.metadataOnly` is enabled, it succeeds when the queries include subquery in partition filters. ### How was this patch tested? add UT Closes #28383 from cxzl25/fix_SPARK-31590. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 588966d696373c11e963116a0e08ee33c30f0dfb) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 06 May 2020, 01:56:54 UTC
1222ce0 [SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements ### What changes were proposed in this pull request? The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case. Example: ```scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window case class R(id: String, value: String, bytes: Array[Byte]) def makeR(id: String, value: String) = R(id, value, value.getBytes) val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF() // In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected). df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false) // The same problem is displayed when using window functions. val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) val result = df.select( collect_set('value).over(win) as "stringSet", collect_set('bytes).over(win) as "bytesSet" ) .select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize") .show() ``` We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays Array(1, 2, 3) == Array(1, 2, 3) => False The result is that duplicates are not removed in the hashset The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type. This transformation is only applied when we have a BinaryType ### Why are the changes needed? Fix the bug explained ### Does this PR introduce any user-facing change? Yes. Now `collect_set()` correctly deduplicates array of byte. ### How was this patch tested? Unit testing Closes #28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 4fecc20f6ecdfe642890cf0a368a85558c40a47c) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 01 May 2020, 13:10:12 UTC
a2897d2 [SPARK-31601][K8S] Fix spark.kubernetes.executor.podNamePrefix to work This PR aims to fix `spark.kubernetes.executor.podNamePrefix` to work. Currently, the configuration is broken like the following. ``` bin/spark-submit \ --master k8s://$K8S_MASTER \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ -c spark.kubernetes.container.image=spark:pr \ -c spark.kubernetes.driver.pod.name=mypod \ -c spark.kubernetes.executor.podNamePrefix=mypod \ local:///opt/spark/examples/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar ``` **BEFORE SPARK-31601** ``` pod/mypod 1/1 Running 0 9s pod/spark-pi-7469dd71c499fafb-exec-1 1/1 Running 0 4s pod/spark-pi-7469dd71c499fafb-exec-2 1/1 Running 0 4s ``` **AFTER SPARK-31601** ``` pod/mypod 1/1 Running 0 8s pod/mypod-exec-1 1/1 Running 0 3s pod/mypod-exec-2 1/1 Running 0 3s ``` Yes. This is a bug fix. The conf will work as described in the documentation. Pass the Jenkins and run the above comment manually. Closes #28401 from dongjoon-hyun/SPARK-31601. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com> (cherry picked from commit 85dad37f69ebb617c8ac015dbbbda11054170298) Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com> (cherry picked from commit 82b8f7fc9d21f4ac506d8cd613158f0511f5cb1d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 April 2020, 04:55:36 UTC
da3748f [SPARK-31449][SQL][2.4] Fix getting time zone offsets from local milliseconds ### What changes were proposed in this pull request? Replace current implementation of `getOffsetFromLocalMillis()` by the code from JDK https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/aa318070b27849f1fe00d14684b2a40f7b29bf79/jdk/src/share/classes/java/util/GregorianCalendar.java#L2795-L2801: ```java if (zone instanceof ZoneInfo) { ((ZoneInfo)zone).getOffsetsByWall(millis, zoneOffsets); } else { int gmtOffset = isFieldSet(fieldMask, ZONE_OFFSET) ? internalGet(ZONE_OFFSET) : zone.getRawOffset(); zone.getOffsets(millis - gmtOffset, zoneOffsets); } ``` ### Why are the changes needed? Current domestic implementation of `getOffsetFromLocalMillis()` is incompatible with other date-time functions used from JDK's `GregorianCalendar` like `ZoneInfo.getOffsets`, and can return wrong results as it is demonstrated in SPARK-31449. For example, currently the function returns 1h offset but JDK function 0h: ``` Europe/Paris 1916-10-01 23:50:39.0 3600000 0 ``` Actually, the timestamp is in a DST interval of shifting wall clocks by 1 hour back Year | Date & Time | Abbreviation | Time Change | Offset After -- | -- | -- | -- | -- 1916 |Tue, 14 Jun, 23:00 | WET → WEST | +1 hour (DST start) | UTC+1h 1916 |Sun, 2 Oct, 00:00 | WEST → WET | -1 hour (DST end) | UTC And according the default JDK policy, the latest timestamp should be taken in the case of overlapping but current implementation takes the earliest one. That makes it incompatible with other JDK calls. ### Does this PR introduce any user-facing change? Yes, see differences in SPARK-31449. ### How was this patch tested? By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #28410 from MaxGekk/fix-tz-offset-by-wallclock-2.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 April 2020, 03:16:37 UTC
3b17ad3 [SPARK-31582][YARN][2.4] Being able to not populate Hadoop classpath ### What changes were proposed in this pull request? We are adding a new Spark Yarn configuration, `spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath`. ### Why are the changes needed? Spark Yarn client populates extra Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to a Yarn Hadoop cluster. However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can cause jar conflicts because Spark distribution can contain different version of Hadoop jars. One case we have is when a user uses an Apache Spark distribution with its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn clusters, because of two different incompatible Hadoop jars in the classpath, it runs into errors. By not populating the Hadoop classpath from the clusters can address this issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? An UT is added, but very hard to add a new integration test since this requires using different incompatible versions of Hadoop. We also manually tested this PR, and we are able to submit a Spark job using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating CDH classpath. Closes #28411 from dbtsai/SPARK-31582. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 29 April 2020, 21:39:03 UTC
4be3390 [SPARK-31519][SQL][2.4] Datetime functions in having aggregate expressions returns the wrong result ### What changes were proposed in this pull request? Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)). ### Why are the changes needed? The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a datetime function `hour`, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. See the demo below: ``` SELECT SUM(a) AS b, '2020-01-01 12:12:12' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` The query's result is ``` +---+-------------------+ | b| fake| +---+-------------------+ | 2|2020-01-01 12:12:12| +---+-------------------+ ``` But if we use `hour` function, it will return an empty result. ``` SELECT SUM(a) AS b, hour('2020-01-01 12:12:12') AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` ### Does this PR introduce any user-facing change? Yes, bug fix for cast in having aggregate expressions. ### How was this patch tested? New UT added. Closes #28397 from xuanyuanking/SPARK-31519-backport. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 29 April 2020, 16:18:40 UTC
2828495 [SPARK-31563][SQL][FOLLOWUP] Create literals directly from Catalyst's internal value in InSet.sql ### What changes were proposed in this pull request? In the PR, I propose to simplify the code of `InSet.sql` and create `Literal` instances directly from Catalyst's internal values by using the default `Literal` constructor. ### Why are the changes needed? This simplifies code and avoids unnecessary conversions to external types. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test `SPARK-31563: sql of InSet for UTF8String collection` in `ColumnExpressionSuite`. Closes #28399 from MaxGekk/fix-InSet-sql-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 86761861c28e1c854b4f78f1c078591b11b7daf3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 April 2020, 06:45:00 UTC
eab7b88 [SPARK-31589][INFRA][2.4] Use `r-lib/actions/setup-r` in GitHub Action ### What changes were proposed in this pull request? This PR aims to use `r-lib/actions/setup-r` because it's more stable and maintained by 3rd party. ### Why are the changes needed? This will recover the current outage. In addition, this will be more robust in the future. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the GitHub Actions, especially `Linter R` and `Generate Documents`. Closes #28384 from dongjoon-hyun/SPARK-31589-2.4. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 April 2020, 04:58:05 UTC
4a23bd0 [SPARK-31568][R][DOCS] Add detail about func/key in gapply to documentation Improve documentation for `gapply` in `SparkR` Spent a long time this weekend trying to figure out just what exactly `key` is in `gapply`'s `func`. I had assumed it would be a _named_ list, but apparently not -- the examples are working because `schema` is applying the name and the names of the output `data.frame` don't matter. As near as I can tell the description I've added is correct, namely, that `key` is an unnamed list. No? Not in code. Only documentation. Not. Documentation only Closes #28350 from MichaelChirico/r-gapply-key-doc. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 April 2020, 08:04:44 UTC
9275865 [SPARK-31485][CORE][2.4] Avoid application hang if only partial barrier tasks launched ### What changes were proposed in this pull request? Use `TaskSetManger.abort` to abort a barrier stage instead of throwing exception within `resourceOffers`. ### Why are the changes needed? Any non fatal exception thrown within Spark RPC framework can be swallowed: https://github.com/apache/spark/blob/100fc58da54e026cda87832a10e2d06eaeccdf87/core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala#L202-L211 The method `TaskSchedulerImpl.resourceOffers` is also within the scope of Spark RPC framework. Thus, throw exception inside `resourceOffers` won't fail the application. As a result, if a barrier stage fail the require check at `require(addressesWithDescs.size == taskSet.numTasks, ...)`, the barrier stage will fail the check again and again util all tasks from `TaskSetManager` dequeued. But since the barrier stage isn't really executed, the application will hang. The issue can be reproduced by the following test: ```scala initLocalClusterSparkContext(2) val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2) val dep = new OneToOneDependency[Int](rdd0) val rdd = new MyRDD(sc, 2, List(dep), Seq(Seq("executor_h_0"),Seq("executor_h_0"))) rdd.barrier().mapPartitions { iter => BarrierTaskContext.get().barrier() iter }.collect() ``` ### Does this PR introduce any user-facing change? Yes, application hang previously but fail-fast after this fix. ### How was this patch tested? Added a regression test. Closes #28357 from Ngone51/bp-spark-31485-24. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2020, 06:10:02 UTC
218c114 [SPARK-25595][2.4] Ignore corrupt Avro files if flag IGNORE_CORRUPT_FILES enabled ## What changes were proposed in this pull request? With flag `IGNORE_CORRUPT_FILES` enabled, schema inference should ignore corrupt Avro files, which is consistent with Parquet and Orc data source. ## How was this patch tested? Unit test Closes #28334 from gengliangwang/SPARK-25595-2.4. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 April 2020, 03:51:25 UTC
61fc1f7 [SPARK-31552][SQL][2.4] Fix ClassCastException in ScalaReflection arrayClassFor This PR backports https://github.com/apache/spark/pull/28324 to branch-2.4 ### What changes were proposed in this pull request? the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor` For example: ```scala scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder() newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T] scala> val decOne = Decimal(1, 38, 18) decOne: org.apache.spark.sql.types.Decimal = 1E-18 scala> val decTwo = Decimal(2, 38, 18) decTwo: org.apache.spark.sql.types.Decimal = 2E-18 scala> val decSpark = Array(decOne, decTwo) decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18) scala> Seq(decSpark).toDF() java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57) at newArrayEncoder(<console>:57) ... 53 elided scala> ``` In this PR, we add the missing cases to `arrayClassFor` ### Why are the changes needed? bugfix as described above ### Does this PR introduce any user-facing change? no ### How was this patch tested? add a test for array encoders Closes #28341 from yaooqinn/SPARK-31552-24. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 April 2020, 06:14:03 UTC
5e6bcca [SPARK-31563][SQL] Fix failure of InSet.sql for collections of Catalyst's internal types In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection. The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563. Highly likely, not. Added a test to `ColumnExpressionSuite` Closes #28343 from MaxGekk/fix-InSet-sql. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7d8216a6642f40af0d1b623129b1d5f4c86bec68) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 25 April 2020, 16:35:35 UTC
0477f21 [SPARK-31532][SPARK-31234][SQL][2.4][FOLLOWUP] Use lowercases for GLOBAL_TEMP_DATABASE config in SparkSessionBuilderSuite ### What changes were proposed in this pull request? This PR intends to fix test code for using lowercases for the `GLOBAL_TEMP_DATABASE` config in `SparkSessionBuilderSuite`. The handling of the config is different between branch-3.0+ and branch-2.4. In branch-3.0+, Spark always lowercases a value in the config, so I think we had better always use lowercases for it in the test. This comes from the dongjoon-hyun comment: https://github.com/apache/spark/pull/28316#issuecomment-619303160 ### Why are the changes needed? To fix the test failure in branch-2.4. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Fixed the test. Closes #28339 from maropu/SPARK-31532. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 25 April 2020, 05:23:57 UTC
a2a0c52 [SPARK-31532][SQL] Builder should not propagate static sql configs to the existing active or default SparkSession ### What changes were proposed in this pull request? SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession This seems a long-standing bug. ```scala scala> spark.sql("set spark.sql.warehouse.dir").show +--------------------+--------------------+ | key| value| +--------------------+--------------------+ |spark.sql.warehou...|file:/Users/kenty...| +--------------------+--------------------+ scala> spark.sql("set spark.sql.warehouse.dir=2"); org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir; at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) at org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100) at org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) ... 47 elided scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get getClass getOrCreate scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").getOrCreate 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. res7: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession6403d574 scala> spark.sql("set spark.sql.warehouse.dir").show +--------------------+-----+ | key|value| +--------------------+-----+ |spark.sql.warehou...| xyz| +--------------------+-----+ scala> OptionsAttachments ``` ### Why are the changes needed? bugfix as shown in the previous section ### Does this PR introduce any user-facing change? Yes, static SQL configurations with SparkSession.builder.config do not propagate to any existing or new SparkSession instances. ### How was this patch tested? new ut. Closes #28316 from yaooqinn/SPARK-31532. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 8424f552293677717da7411ed43e68e73aa7f0d6) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 24 April 2020, 23:53:48 UTC
2fefb60 [SPARK-30199][DSTREAM] Recover `spark.(ui|blockManager).port` from checkpoint ### What changes were proposed in this pull request? This is a backport of #26827. This PR aims to recover `spark.ui.port` and `spark.blockManager.port` from checkpoint like `spark.driver.port`. ### Why are the changes needed? When the user configures these values, we can respect them. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #28320 from dongjoon-hyun/SPARK-30199-2.4. Authored-by: Aaruna <aaruna@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 April 2020, 04:23:46 UTC
5183984 [SPARK-31503][SQL][2.4] fix the SQL string of the TRIM functions backport https://github.com/apache/spark/pull/28281 to 2.4 This backport has one difference: there is no `EXTRACT(... FROM ...)` SQL syntax in 2.4, so this PR just uses the common function call syntax. Closes #28299 from cloud-fan/pick. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 April 2020, 16:35:45 UTC
43cc620 [SPARK-31256][SQL] DataFrameNaFunctions.drop should work for nested columns For example, for the following `df`: ``` val schema = new StructType() .add("c1", new StructType() .add("c1-1", StringType) .add("c1-2", StringType)) val data = Seq(Row(Row(null, "a2")), Row(Row("b1", "b2")), Row(null)) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` In Spark 2.4.4, ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` In Spark 2.4.5 or Spark 3.0.0-preview2, if nested columns are specified, they are ignored. ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ | [, a2]| |[b1, b2]| | null| +--------+ ``` This seems like a regression. Now, the nested column can be specified: ``` df.na.drop("any", Seq("c1.c1-1")).show +--------+ | c1| +--------+ |[b1, b2]| +--------+ ``` Also, if `*` is specified as a column, it will throw an `AnalysisException` that `*` cannot be resolved, which was the behavior in 2.4.4. Currently, in master, it has no effect. Updated existing tests. Closes #28266 from imback82/SPARK-31256. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d7499aed9cc943304e2ec89379d3651410f6ca90) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 April 2020, 03:07:24 UTC
1af3b48 [SPARK-31234][SQL][2.4] ResetCommand should not affect static SQL Configuration ### What changes were proposed in this pull request? This PR is to backport the fix of https://github.com/apache/spark/pull/28003, add a migration guide, update the PR description and add an end-to-end test case. Before this PR, the SQL `RESET` command will reset the values of static SQL configuration to the default and remove the cached values of Spark Context Configurations in the current session. This PR fixes the bugs. After this PR, the `RESET` command follows its definition and only updates the runtime SQL configuration values to the default. ### Why are the changes needed? When we introduced the feature of Static SQL Configuration, we did not update the implementation of SQL `RESET` command. The static SQL configuration should not be changed by any command at runtime. However, the `RESET` command resets the values to the default. We should fix them. ### Does this PR introduce any user-facing change? Before Spark 2.4.6, the `RESET` command resets both the runtime and static SQL configuration values to the default. It also removes the cached values of Spark Context Configurations in the current session, although these configuration values are for displaying/querying only. ### How was this patch tested? Added an end-to-end test and a unit test Closes #28262 from gatorsmile/spark-31234followup2.4. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 April 2020, 23:24:56 UTC
9416b7c Apply appropriate RPC handler to receive, receiveStream when auth enabled 18 April 2020, 15:20:43 UTC
ea75c15 [SPARK-31420][WEBUI][FOLLOWUP] Make locale of timeline-view 'en' <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This change explicitly set locale of timeline view to 'en' to be the same appearance as before upgrading vis-timeline. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> We upgraded vis-timeline in #28192 and the upgraded version is different from before we used in the notation of dates. The notation seems to be dependent on locale. The following is appearance in my Japanese environment. <img width="557" alt="locale-changed" src="https://user-images.githubusercontent.com/4736016/79265314-de886700-7ed0-11ea-8641-fa76b993c0d9.png"> Although the notation is in Japanese, the default format is a little bit unnatural (e.g. 4月9日 05:39 is natural rather than 9 四月 05:39). I found we can get the same appearance as before by explicitly set locale to 'en'. ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> I visited JobsPage, JobPage and StagePage and confirm that timeline view shows dates with 'en' locale. <img width="735" alt="fix-date-appearance" src="https://user-images.githubusercontent.com/4736016/79267107-8bfc7a00-7ed3-11ea-8a25-f6681d04a83c.png"> NOTE: #28192 will be backported to branch-2.4 and branch-3.0 so this PR should be follow #28214 and #28213 . Closes #28218 from sarutak/fix-locale-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit df27350142d81a3e8941939870bfc0ab50e37a43) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 16 April 2020, 17:32:37 UTC
d34590c [SPARK-31441][PYSPARK][SQL][2.4] Support duplicated column names for toPandas with arrow execution ### What changes were proposed in this pull request? This is to backport #28210. This PR is adding support duplicated column names for `toPandas` with Arrow execution. ### Why are the changes needed? When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates. ```py >>> spark.sql("select 1 v, 1 v").toPandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` ### Does this PR introduce any user-facing change? Yes, previously we will face an error above, but after this PR, we will see the result: ```py >>> spark.sql("select 1 v, 1 v").toPandas() v v 0 1 1 ``` ### How was this patch tested? Added and modified related tests. Closes #28221 from ueshin/issues/SPARK-31441/2.4/to_pandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> 15 April 2020, 18:41:43 UTC
49abdc4 [SPARK-31186][PYSPARK][SQL][2.4] toPandas should not fail on duplicate column names ### What changes were proposed in this pull request? When `toPandas` API works on duplicate column names produced from operators like join, we see the error like: ``` ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). ``` This patch fixes the error in `toPandas` API. This is the backport of original patch to branch-2.4. ### Why are the changes needed? To make `toPandas` work on dataframe with duplicate column names. ### Does this PR introduce any user-facing change? Yes. Previously calling `toPandas` API on a dataframe with duplicate column names will fail. After this patch, it will produce correct result. ### How was this patch tested? Unit test. Closes #28219 from viirya/SPARK-31186-2.4. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 15 April 2020, 04:57:23 UTC
775e958 [SPARK-31420][WEBUI][2.4] Infinite timeline redraw in job details page ### What changes were proposed in this pull request? This PR backports #28192 to branch-2.4. ### Why are the changes needed? SPARK-31420 affects branch-2.4 too. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Same as #28192 Closes #28214 from sarutak/SPARK-31420-branch-2.4. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 15 April 2020, 04:33:22 UTC
1c65b1d [SPARK-31422][CORE][FOLLOWUP] Fix a test case compilation error 11 April 2020, 15:53:57 UTC
26d5e8f [SPARK-31422][CORE] Fix NPE when BlockManagerSource is used after BlockManagerMaster stops ### What changes were proposed in this pull request? This PR (SPARK-31422) aims to return empty result in order to avoid `NullPointerException` at `getStorageStatus` and `getMemoryStatus` which happens after `BlockManagerMaster` stops. The empty result is consistent with the current status of `SparkContext` because `BlockManager` and `BlockManagerMaster` is already stopped. ### Why are the changes needed? In `SparkEnv.stop`, the following stop sequence is used and `metricsSystem.stop` invokes `sink.stop`. ``` blockManager.master.stop() metricsSystem.stop() --> sinks.foreach(_.stop) ``` However, some sink can invoke `BlockManagerSource` and ends up with `NullPointerException` because `BlockManagerMaster` is already stopped and `driverEndpoint` became `null`. ``` java.lang.NullPointerException at org.apache.spark.storage.BlockManagerMaster.getStorageStatus(BlockManagerMaster.scala:170) at org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63) at org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63) at org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:31) at org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:30) ``` Since `SparkContext` registers and forgets `BlockManagerSource` without deregistering, we had better avoid `NullPointerException` inside `BlockManagerMaster` preventively. ```scala _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager)) ``` ### Does this PR introduce any user-facing change? Yes. This will remove NPE for the users who uses `BlockManagerSource`. ### How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #28187 from dongjoon-hyun/SPARK-31422. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a6e6fbf2ca23e51d43f175907ce6f29c946e1acf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 April 2020, 15:28:25 UTC
9657575 [SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake ### What changes were proposed in this pull request? This PR proposes to show a better error message when a user mistakenly installs `pyspark` from PIP but the default `python` does not point out the corresponding `pip`. See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 as an example. It can be reproduced as below: I have two Python executables. `python` is Python 3.7, `pip` binds with Python 3.7 and `python2.7` is Python 2.7. ```bash pip install pyspark ``` ```bash pyspark ``` ``` ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. ... ``` ```bash PYSPARK_PYTHON=python2.7 pyspark ``` ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 24: /bin/load-spark-env.sh: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: /bin/spark-submit: No such file or directory /usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin/pyspark: line 77: exec: /bin/spark-submit: cannot execute: No such file or directory ``` ### Why are the changes needed? There are multiple questions outside about this error and they have no idea what's going on. See: - https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 - https://stackoverflow.com/questions/45991888/path-issue-could-not-find-valid-spark-home-while-searching - https://stackoverflow.com/questions/49707239/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/55569985/pyspark-could-not-find-valid-spark-home - https://stackoverflow.com/questions/48296474/error-could-not-find-valid-spark-home-while-searching-pycharm-in-windows - https://github.com/ContinuumIO/anaconda-issues/issues/8076 The answer is usually setting `SPARK_HOME`; however this isn't completely correct. It works if you set `SPARK_HOME` because `pyspark` executable script directly imports the library by using `SPARK_HOME` (see https://github.com/apache/spark/blob/master/bin/pyspark#L52-L53) instead of the default package location specified via `python` executable. So, this way you use a package installed in a different Python, which isn't ideal. ### Does this PR introduce any user-facing change? Yes, it fixes the error message better. **Before:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] ... ``` **After:** ``` Could not find valid SPARK_HOME while searching ['/Users', '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/bin'] Did you install PySpark via a package manager such as pip or Conda? If so, PySpark was not found in your Python environment. It is possible your Python environment does not properly bind with your package manager. Please check your default 'python' and if you set PYSPARK_PYTHON and/or PYSPARK_DRIVER_PYTHON environment variables, and see if you can import PySpark, for example, 'python -c 'import pyspark'. If you cannot import, you can install by using the Python executable directly, for example, 'python -m pip install pyspark [--user]'. Otherwise, you can also explicitly set the Python executable, that has PySpark installed, to PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON environment variables, for example, 'PYSPARK_PYTHON=python3 pyspark'. ... ``` ### How was this patch tested? Manually tested as described above. Closes #28152 from HyukjinKwon/SPARK-31382. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0248b329726527a07d688122f56dd2ada0e51337) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 09 April 2020, 02:05:11 UTC
1a2c233 [SPARK-31327][SQL][2.4] Write Spark version into Avro file metadata Backport https://github.com/apache/spark/commit/6b1ca886c0066f4e10534336f3fce64cdebc79a5, similar to https://github.com/apache/spark/pull/28142 ### What changes were proposed in this pull request? Write Spark version into Avro file metadata ### Why are the changes needed? The version info is very useful for backward compatibility. This is also done in parquet/orc. ### Does this PR introduce any user-facing change? no ### How was this patch tested? new test Closes #28150 from cloud-fan/pick. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 April 2020, 15:27:06 UTC
fd6b0bc [SPARK-25102][SQL][2.4] Write Spark version to ORC/Parquet file metadata ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/22932 . Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. **ORC (`native` and `hive` implmentation)** ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` **PARQUET** ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ### Why are the changes needed? This backport helps us handle this files differently in Apache Spark 3.0.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with newly added test cases. Closes #28142 from dongjoon-hyun/SPARK-25102-2.4. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 April 2020, 20:09:30 UTC
aa9701b [SPARK-31231][BUILD] Unset setuptools version in pip packaging test ### What changes were proposed in this pull request? This PR unsets `setuptools` version in CI. This was fixed in the 0.46.1.2+ `setuptools` - pypa/setuptools#2046. `setuptools` 0.46.1.0 and 0.46.1.1 still have this problem. ### Why are the changes needed? To test the latest setuptools out to see if users can actually install and use it. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Jenkins will test. Closes #28111 from HyukjinKwon/SPARK-31231-revert. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 9f58f0385758f31179943a681e4353eb9279cb20) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 April 2020, 23:09:48 UTC
22e0a5a [SPARK-31312][SQL][2.4] Cache Class instance for the UDF instance in HiveFunctionWrapper ### What changes were proposed in this pull request? This patch proposes to cache Class instance for the UDF instance in HiveFunctionWrapper to fix the case where Hive simple UDF is somehow transformed (expression is copied) and evaluated later with another classloader (for the case current thread context classloader is somehow changed). In this case, Spark throws CNFE as of now. It's only occurred for Hive simple UDF, as HiveFunctionWrapper caches the UDF instance whereas it doesn't do for `UDF` type. The comment says Spark has to create instance every time for UDF, so we cannot simply do the same. This patch caches Class instance instead, and switch current thread context classloader to which loads the Class instance. This patch extends the test boundary as well. We only tested with GenericUDTF for SPARK-26560, and this patch actually requires only UDF. But to avoid regression for other types as well, this patch adds all available types (UDF, GenericUDF, AbstractGenericUDAFResolver, UDAF, GenericUDTF) into the boundary of tests. Credit to cloud-fan as he discovered the problem and proposed the solution. ### Why are the changes needed? Above section describes why it's a bug and how it's fixed. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UTs added. Closes #28086 from HeartSaVioR/SPARK-31312-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 April 2020, 03:38:53 UTC
e226f68 [SPARK-31306][DOCS] update rand() function documentation to indicate exclusive upper bound ### What changes were proposed in this pull request? A small documentation change to clarify that the `rand()` function produces values in `[0.0, 1.0)`. ### Why are the changes needed? `rand()` uses `Rand()` - which generates values in [0, 1) ([documented here](https://github.com/apache/spark/blob/a1dbcd13a3eeaee50cc1a46e909f9478d6d55177/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala#L71)). The existing documentation suggests that 1.0 is a possible value returned by rand (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so `U[0.0, 1.0]` suggests the value returned could include 1.0). ### Does this PR introduce any user-facing change? Only documentation changes. ### How was this patch tested? Documentation changes only. Closes #28071 from Smeb/master. Authored-by: Ben Ryves <benjamin.ryves@getyourguide.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 31 March 2020, 06:17:05 UTC
4add8ad [SPARK-31101][BUILD][2.4] Upgrade Janino to 3.0.16 ### What changes were proposed in this pull request? This PR(SPARK-31101) proposes to upgrade Janino to 3.0.16 which is released recently. * Merged pull request janino-compiler/janino#114 "Grow the code for relocatables, and do fixup, and relocate". Please see the commit log. - https://github.com/janino-compiler/janino/commits/3.0.16 You can see the changelog from the link: http://janino-compiler.github.io/janino/changelog.html / though release note for Janino 3.0.16 is actually incorrect. ### Why are the changes needed? We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: janino-compiler/janino#113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details. Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable Janino to succeed to compile user's query properly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Below test code fails on branch-2.4 and passes with this patch. (Note that there seems to be the case where another UT affects this UT to not fail - adding this to SQLQuerySuite won't fail this UT, but adding this to DateFunctionsSuite will fail this UT, and if you run this UT solely in SQLQuerySuite via `build/sbt "sql/testOnly *.SQLQuerySuite -- -z SPARK-31115"` then it fails.) ``` /** * NOTE: The test code tries to control the size of for/switch statement in expand_doConsume, * as well as the overall size of expand_doConsume, so that the query triggers known Janino * bug - https://github.com/janino-compiler/janino/issues/113. * * The expected exception message from Janino when we use switch statement for "ExpandExec": * - "Operand stack inconsistent at offset xxx: Previous size 1, now 0" * which will not happen when we use if-else-if statement for "ExpandExec". * * "The number of fields" and "The number of distinct aggregation functions" are the major * factors to increase the size of generated code: while these values should be large enough * to trigger the Janino bug, these values should not also too big; otherwise one of below * exceptions might be thrown: * - "expand_doConsume would be beyond 64KB" * - "java.lang.ClassFormatError: Too many arguments in method signature in class file" */ test("SPARK-31115 Lots of columns and distinct aggregations shouldn't break code generation") { withSQLConf( (SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true"), (SQLConf.WHOLESTAGE_MAX_NUM_FIELDS.key, "10000"), (SQLConf.CODEGEN_FALLBACK.key, "false"), (SQLConf.CODEGEN_LOGGING_MAX_LINES.key, "-1") ) { var df = Seq(("1", "2", 1), ("1", "2", 2), ("2", "3", 3), ("2", "3", 4)).toDF("a", "b", "c") // The value is tested under commit "244405fe57d7737d81c34ba9e8917df6285889eb": // the query fails with switch statement, whereas it passes with if-else statement. // Note that the value depends on the Spark logic as well - different Spark versions may // require different value to ensure the test failing with switch statement. val numNewFields = 100 df = df.withColumns( (1 to numNewFields).map { idx => s"a$idx" }, (1 to numNewFields).map { idx => when(col("c").mod(lit(2)).===(lit(0)), lit(idx)).otherwise(col("c")) } ) val aggExprs: Array[Column] = Range(1, numNewFields).map { idx => if (idx % 2 == 0) { coalesce(countDistinct(s"a$idx"), lit(0)) } else { coalesce(count(s"a$idx"), lit(0)) } }.toArray val aggDf = df .groupBy("a", "b") .agg(aggExprs.head, aggExprs.tail: _*) // We are only interested in whether the code compilation fails or not, so skipping // verification on outputs. aggDf.collect() } } ``` Closes #27997 from HeartSaVioR/SPARK-31101-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 March 2020, 02:25:57 UTC
f05ac28 [SPARK-31293][DSTREAMS][KINESIS][DOC] Fix wrong examples and help messages for Kinesis integration This PR (SPARK-31293) fixes wrong command examples, parameter descriptions and help message format for Amazon Kinesis integration with Spark Streaming. To improve usability of those commands. No I ran the fixed commands manually and confirmed they worked as expected. Closes #28063 from sekikn/SPARK-31293. Authored-by: Kengo Seki <sekikn@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 60dd1a690fed62b1d6442cdc8cf3f89ef4304d5a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 29 March 2020, 21:29:56 UTC
801d6a9 [SPARK-31261][SQL] Avoid npe when reading bad csv input with `columnNameCorruptRecord` specified ### What changes were proposed in this pull request? SPARK-25387 avoids npe for bad csv input, but when reading bad csv input with `columnNameCorruptRecord` specified, `getCurrentInput` is called and it still throws npe. ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add a test. Closes #28029 from wzhfy/corrupt_column_npe. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 791d2ba346f3358fc280adbbbe27f2cd50fd3732) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 March 2020, 04:30:43 UTC
4217f75 Revert "[SPARK-31258][BUILD] Pin the avro version in SBT" This reverts commit 916a25a46bca7196416372bacc3fc260a6ef658f. 26 March 2020, 20:49:39 UTC
844f207 [SPARK-31231][BUILD][FOLLOW-UP] Set the upper bound (before 46.1.0) for setuptools in pip package test This PR is a followup of apache/spark#27995. Rather then pining setuptools version, it sets upper bound so Python 3.5 with branch-2.4 tests can pass too. To make the CI build stable No, dev-only change. Jenkins will test. Closes #28005 from HyukjinKwon/investigate-pip-packaging-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 178d472e1d9f9f61fa54866d00d0a5b88ee87619) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 March 2020, 03:34:12 UTC
916a25a [SPARK-31258][BUILD] Pin the avro version in SBT add arvo dep in SparkBuild fix sbt unidoc like https://github.com/apache/spark/pull/28017#issuecomment-603828597 ```scala [warn] Multiple main classes detected. Run 'show discoveredMainClasses' to see the list [warn] Multiple main classes detected. Run 'show discoveredMainClasses' to see the list [info] Main Scala API documentation to /home/jenkins/workspace/SparkPullRequestBuilder6/target/scala-2.12/unidoc... [info] Main Java API documentation to /home/jenkins/workspace/SparkPullRequestBuilder6/target/javaunidoc... [error] /home/jenkins/workspace/SparkPullRequestBuilder6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: value createDatumWriter is not a member of org.apache.avro.generic.GenericData [error] writerCache.getOrElseUpdate(schema, GenericData.get.createDatumWriter(schema)) [error] ^ [info] No documentation generated with unsuccessful compiler run [error] one error found ``` no pass jenkins and verify manually with `sbt dependencyTree` ```scala kentyaohulk  ~/spark   dep  build/sbt dependencyTree | grep avro | grep -v Resolving [info] +-org.apache.avro:avro-mapred:1.8.2 [info] | +-org.apache.avro:avro-ipc:1.8.2 [info] | | +-org.apache.avro:avro:1.8.2 [info] +-org.apache.avro:avro:1.8.2 [info] | | +-org.apache.avro:avro:1.8.2 [info] org.apache.spark:spark-avro_2.12:3.1.0-SNAPSHOT [S] [info] | | | +-org.apache.avro:avro-mapred:1.8.2 [info] | | | | +-org.apache.avro:avro-ipc:1.8.2 [info] | | | | | +-org.apache.avro:avro:1.8.2 [info] | | | +-org.apache.avro:avro:1.8.2 [info] | | | | | +-org.apache.avro:avro:1.8.2 ``` Closes #28020 from yaooqinn/dep. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 336621e277794ab2e6e917391a928ec662498fcf) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 March 2020, 01:50:26 UTC
4381ad5 [SPARK-30494][SQL][2.4] Fix cached data leakage during replacing an existing view ### What changes were proposed in this pull request? This is backport of #27185 to branch-2.4. The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak. We can reproduce by below commands: ``` Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("create or replace temporary view temp1 as select 1") scala> spark.sql("cache table temp1") scala> spark.sql("create or replace temporary view temp1 as select 1, 2") scala> spark.sql("cache table temp1") scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined) scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined) ``` ### Why are the changes needed? Fix the memory leak, specially for long running mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add a unit test. Closes #28000 from LantaoJin/SPARK-30494_2.4. Lead-authored-by: lajin <lajin@ebay.com> Co-authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 March 2020, 20:29:07 UTC
e37f664 Revert "[SPARK-31231][BUILD] Explicitly setuptools version as 46.0.0 in pip package test" This reverts commit 223b9fb1eadeba0e05b1a300512c31c4f99f41e8. Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 March 2020, 18:55:57 UTC
223b9fb [SPARK-31231][BUILD] Explicitly setuptools version as 46.0.0 in pip package test ### What changes were proposed in this pull request? For a bit of background, PIP packaging test started to fail (see [this logs](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)) as of setuptools 46.1.0 release. In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep the modes in `package_data`. In PySpark pip installation, we keep the executable scripts in `package_data` https://github.com/apache/spark/blob/fc4e56a54c15e20baf085e6061d3d83f5ce1185d/python/setup.py#L199-L200, and expose their symbolic links as executable scripts. So, the symbolic links (or copied scripts) executes the scripts copied from `package_data`, which doesn't have the executable permission in its mode: ``` /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: Permission denied /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: cannot execute: Permission denied ``` The current issue is being tracked at https://github.com/pypa/setuptools/issues/2041 </br> For what this PR proposes: It sets the upper bound in PR builder for now to unblock other PRs. _This PR does not solve the issue yet. I will make a fix after monitoring https://github.com/pypa/setuptools/issues/2041_ ### Why are the changes needed? It currently affects users who uses the latest setuptools. So, _users seem unable to use PySpark with the latest setuptools._ See also https://github.com/pypa/setuptools/issues/2041#issuecomment-602566667 ### Does this PR introduce any user-facing change? It makes CI pass for now. No user-facing change yet. ### How was this patch tested? Jenkins will test. Closes #27995 from HyukjinKwon/investigate-pip-packaging. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit c181c45f863ba55e15ab9b41f635ffbddad9bac0) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 24 March 2020, 09:00:25 UTC
244405f [SPARK-26293][SQL][2.4] Cast exception when having python udf in subquery ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/23248 which seems mistakenly not backported. This is a regression introduced by https://github.com/apache/spark/pull/22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` https://github.com/apache/spark/pull/22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery ## How was this patch tested? a new test. Closes #27960 from HyukjinKwon/backport-SPARK-26293. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 20 March 2020, 03:16:38 UTC
73cc8b5 [SPARK-31164][SQL][2.4] Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns ### What changes were proposed in this pull request? This is a backport for [pr#27924](https://github.com/apache/spark/pull/27924). For a bucketed table, when deciding output partitioning, if the output doesn't contain all bucket columns, the result is `UnknownPartitioning`. But when generating rdd, current Spark uses `createBucketedReadRDD` because it doesn't check if the output contains all bucket columns. So the rdd and its output partitioning are inconsistent. ### Why are the changes needed? To fix a bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified existing tests. Closes #27934 from wzhfy/inconsistent_rdd_partitioning_2.4. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 March 2020, 12:17:50 UTC
6a60c66 [MINOR][SQL] Update the DataFrameWriter.bucketBy comment ### What changes were proposed in this pull request? This PR intends to update the `DataFrameWriter.bucketBy` comment for clearly describing that the bucketBy scheme follows a Spark "specific" one. I saw the questions about the current bucketing compatibility with Hive in [SPARK-31162](https://issues.apache.org/jira/browse/SPARK-31162?focusedCommentId=17060408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17060408) and [SPARK-17495](https://issues.apache.org/jira/browse/SPARK-17495?focusedCommentId=17059847&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17059847) from users and IMHO the comment is a bit confusing to users about the compatibility ### Why are the changes needed? To make users understood smoothly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27930 from maropu/UpdateBucketByComment. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 124b4ce2e6e8f84294f8fc13d3e731a82325dacb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 March 2020, 07:53:38 UTC
26ad3fe [SPARK-31163][SQL] TruncateTableCommand with acl/permission should handle non-existed path ### What changes were proposed in this pull request? This fix #26956 Wrap try-catch on `fs.getFileStatus(path)` within acl/permission in case of the path doesn't exist. ### Why are the changes needed? `truncate table` may fail to re-create path in case of interruption or something else. As a result, next time we `truncate table` on the same table with acl/permission, it will fail due to `FileNotFoundException`. And it also brings behavior change compares to previous Spark version, which could still `truncate table` successfully even if the path doesn't exist. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #27923 from Ngone51/fix_truncate. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cb26f636b08aea4c5c6bf5035a359cd3cbf335c0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 March 2020, 18:45:58 UTC
51ccb6f [SPARK-31144][SQL][2.4] Wrap Error with QueryExecutionException to notify QueryExecutionListener ### What changes were proposed in this pull request? When `java.lang.Error` is thrown during the execution, `ExecutionListenerManager` will wrap it with `QueryExecutionException` so that we can send it to `QueryExecutionListener.onFailure` which only accepts `Exception`. ### Why are the changes needed? If ` java.lang.Error` is thrown during the execution, QueryExecutionListener doesn't get notified right now. ### Does this PR introduce any user-facing change? No ### How was this patch tested? The new added unit test. Closes #27904 from zsxwing/SPARK-31144-2.4. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 March 2020, 20:58:12 UTC
e6bcaaa [SPARK-31130][BUILD] Use the same version of `commons-io` in SBT This PR (SPARK-31130) aims to pin `Commons IO` version to `2.4` in SBT build like Maven build. [HADOOP-15261](https://issues.apache.org/jira/browse/HADOOP-15261) upgraded `commons-io` from 2.4 to 2.5 at Apache Hadoop 3.1. In `Maven`, Apache Spark always uses `Commons IO 2.4` based on `pom.xml`. ``` $ git grep commons-io.version pom.xml: <commons-io.version>2.4</commons-io.version> pom.xml: <version>${commons-io.version}</version> ``` However, `SBT` choose `2.5`. **branch-3.0** ``` $ build/sbt -Phadoop-3.2 "core/dependencyTree" | grep commons-io:commons-io | head -n1 [info] | | +-commons-io:commons-io:2.5 ``` **branch-2.4** ``` $ build/sbt -Phadoop-3.1 "core/dependencyTree" | grep commons-io:commons-io | head -n1 [info] | | +-commons-io:commons-io:2.5 ``` No. Pass the Jenkins with `[test-hadoop3.2]` (the default PR Builder is `SBT`) and manually do the following locally. ``` build/sbt -Phadoop-3.2 "core/dependencyTree" | grep commons-io:commons-io | head -n1 ``` Closes #27886 from dongjoon-hyun/SPARK-31130. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 972e23d18186c73026ebed95b37a886ca6eecf3e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 March 2020, 16:08:15 UTC
c017422 [SPARK-29295][SQL][2.4] Insert overwrite to Hive external table partition should delete old data ### What changes were proposed in this pull request? This patch proposes to delete old Hive external partition directory even the partition does not exist in Hive, when insert overwrite Hive external table partition. This is backport of #25979 to branch-2.4. ### Why are the changes needed? When insert overwrite to a Hive external table partition, if the partition does not exist, Hive will not check if the external partition directory exists or not before copying files. So if users drop the partition, and then do insert overwrite to the same partition, the partition will have both old and new data. For example: ```scala withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { // test is an external Hive table. sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 2") sql("SELECT id FROM test WHERE name = 'n1' ORDER BY id") // Got both 1 and 2. } ``` ### Does this PR introduce any user-facing change? Yes. This fix a correctness issue when users drop partition on a Hive external table partition and then insert overwrite it. ### How was this patch tested? Added test. Closes #27887 from viirya/SPARK-29295-2.4. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 March 2020, 10:00:35 UTC
8e1021d [SPARK-31095][BUILD][2.4] Upgrade netty-all to 4.1.47.Final ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/27869. This PR aims to bring the bug fixes from the latest netty-all. ### Why are the changes needed? - 4.1.47.Final: https://github.com/netty/netty/milestone/222?closed=1 (15 patches or issues) - 4.1.46.Final: https://github.com/netty/netty/milestone/221?closed=1 (80 patches or issues) - 4.1.45.Final: https://github.com/netty/netty/milestone/220?closed=1 (23 patches or issues) - 4.1.44.Final: https://github.com/netty/netty/milestone/218?closed=1 (113 patches or issues) - 4.1.43.Final: https://github.com/netty/netty/milestone/217?closed=1 (63 patches or issues) ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #27870 from dongjoon-hyun/netty. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 March 2020, 18:50:03 UTC
f378c7f [SPARK-30941][PYSPARK] Add a note to asDict to document its behavior when there are duplicate fields ### What changes were proposed in this pull request? Adding a note to document `Row.asDict` behavior when there are duplicate fields. ### Why are the changes needed? When a row contains duplicate fields, `asDict` and `_get_item_` behaves differently. We should document it to let users know the difference explicitly. ### Does this PR introduce any user-facing change? No. Only document change. ### How was this patch tested? Existing test. Closes #27853 from viirya/SPARK-30941. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d21aab403a0a32e8b705b38874c0b335e703bd5d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 March 2020, 18:07:25 UTC
7c237cc [MINOR][SQL] Remove an ignored test from JsonSuite ### What changes were proposed in this pull request? Remove ignored and outdated test `Type conflict in primitive field values (Ignored)` from JsonSuite. ### Why are the changes needed? The test is not maintained for long time. It can be removed to reduce size of JsonSuite, and improve maintainability. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the command `./build/sbt "test:testOnly *JsonV2Suite"` Closes #27795 from MaxGekk/remove-ignored-test-in-JsonSuite. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit cf7c397ede05fd106697bd5cc8062f394623bf22) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 06 March 2020, 01:36:17 UTC
1c17ede [MINOR][CORE] Expose the alias -c flag of --conf for spark-submit ### What changes were proposed in this pull request? -c is short for --conf, it was introduced since v1.1.0 but hidden from users until now ### Why are the changes needed? ### Does this PR introduce any user-facing change? no expose hidden feature ### How was this patch tested? Nah Closes #27802 from yaooqinn/conf. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3edab6cc1d70c102093e973a2cf97208db19be8c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 March 2020, 04:38:23 UTC
0ea91da [MINOR][DOCS] ForeachBatch java example fix ### What changes were proposed in this pull request? ForEachBatch Java example was incorrect ### Why are the changes needed? Example did not compile ### Does this PR introduce any user-facing change? Yes, to docs. ### How was this patch tested? In IDE. Closes #27740 from roland1982/foreachwriter_java_example_fix. Authored-by: roland-ondeviceresearch <roland@ondeviceresearch.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a4aaee01fa8e71d51f49b24889d862422e0727c7) Signed-off-by: Sean Owen <srowen@gmail.com> 03 March 2020, 15:25:10 UTC
f4c8c48 [SPARK-30998][SQL][2.4] ClassCastException when a generator having nested inner generators ### What changes were proposed in this pull request? A query below failed in branch-2.4; ``` scala> sql("select array(array(1, 2), array(3)) ar").select(explode(explode($"ar"))).show() 20/03/01 13:51:56 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] java.lang.ClassCastException: scala.collection.mutable.ArrayOps$ofRef cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ... ``` This pr modified the `hasNestedGenerator` code in `ExtractGenerator` for correctly catching nested inner generators. This backport PR comes from https://github.com/apache/spark/pull/27750# ### Why are the changes needed? A bug fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #27769 from maropu/SPARK-20998-BRANCH-2.4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 03 March 2020, 14:47:40 UTC
7216099 [SPARK-30993][SQL][2.4] Use its sql type for UDT when checking the type of length (fixed/var) or mutable ### What changes were proposed in this pull request? This patch fixes the bug of UnsafeRow which misses to handle the UDT specifically, in `isFixedLength` and `isMutable`. These methods don't check its SQL type for UDT, always treating UDT as variable-length, and non-mutable. It doesn't bring any issue if UDT is used to represent complicated type, but when UDT is used to represent some type which is matched with fixed length of SQL type, it exposes the chance of correctness issues, as these informations sometimes decide how the value should be handled. We got report from user mailing list which suspected as mapGroupsWithState looks like handling UDT incorrectly, but after some investigation it was from GenerateUnsafeRowJoiner in shuffle phase. https://github.com/apache/spark/blob/0e2ca11d80c3921387d7b077cb64c3a0c06b08d7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala#L32-L43 Here updating position should not happen on fixed-length column, but due to this bug, the value of UDT having fixed-length as sql type would be modified, which actually corrupts the value. ### Why are the changes needed? Misclassifying of the type of length for UDT can corrupt the value when the row is presented to the input of GenerateUnsafeRowJoiner, which brings correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT added. Closes #27761 from HeartSaVioR/SPARK-30993-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 March 2020, 09:50:43 UTC
0b71b4d [SPARK-31003][TESTS] Fix incorrect uses of assume() in tests ### What changes were proposed in this pull request? This patch fixes several incorrect uses of `assume()` in our tests. If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable). In contrast, `assert(condition)` will fail the test if the condition doesn't hold. If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test. This patch fixes several such cases, replacing certain `assume()` calls with `assert()`. Credit to ahirreddy for spotting this problem. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27754 from JoshRosen/fix-assume-vs-assert. Lead-authored-by: Josh Rosen <rosenville@gmail.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f0010c81e2ef9b8859b39917bb62b48d739a4a22) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 March 2020, 23:21:24 UTC
cd8f86a [SPARK-30813][ML] Fix Matrices.sprand comments ### What changes were proposed in this pull request? Fix mistakes in comments ### Why are the changes needed? There are mistakes in comments ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27564 from xwu99/fix-mllib-sprand-comment. Authored-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit ac122762f5091a7abf62d17595e0f5a99374ac5c) Signed-off-by: Sean Owen <srowen@gmail.com> 02 March 2020, 14:56:49 UTC
0d1664c [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) ### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes #26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit f4499f678dc2e9f72c3ee5d2af083aa6b98f3fc2) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 March 2020, 01:19:53 UTC
ff5ba49 [SPARK-30970][K8S][CORE][2.4] Fix NPE while resolving k8s master url ### What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/27721 ``` bin/spark-sql --master k8s:///https://kubernetes.docker.internal:6443 --conf spark.kubernetes.container.image=yaooqinn/spark:v2.4.4 Exception in thread "main" java.lang.NullPointerException at org.apache.spark.util.Utils$.checkAndGetK8sMasterUrl(Utils.scala:2739) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:261) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` Althrough `k8s:///https://kubernetes.docker.internal:6443` is a wrong master url but should not throw npe The `case null` will never be touched. https://github.com/apache/spark/blob/3f4060c340d6bac412e8819c4388ccba226efcf3/core/src/main/scala/org/apache/spark/util/Utils.scala#L2772-L2776 ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut case Closes #27736 from yaooqinn/SPARK-30970-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 February 2020, 19:15:26 UTC
7574d99 [SPARK-30968][BUILD] Upgrade aws-java-sdk-sts to 1.11.655 ### What changes were proposed in this pull request? This PR aims to upgrade `aws-java-sdk-sts` to `1.11.655`. ### Why are the changes needed? [SPARK-29677](https://github.com/apache/spark/pull/26333) upgrades AWS Kinesis Client to 1.12.0 for Apache Spark 2.4.5 and 3.0.0. Since AWS Kinesis Client 1.12.0 is using AWS SDK 1.11.665, `aws-java-sdk-sts` should be consistent with Kinesis client dependency. - https://github.com/awslabs/amazon-kinesis-client/releases/tag/v1.12.0 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes #27720 from dongjoon-hyun/SPARK-30968. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3995728c3ce9d85b0436c0220f957b9d9133d64a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 February 2020, 01:06:26 UTC
2749043 [MINOR][ML] Fix confusing error message in VectorAssembler ### What changes were proposed in this pull request? When VectorAssembler encounters a NULL with handleInvalid="error", it throws an exception. This exception, though, has a typo making it confusing. Yet apparently, this same exception for NaN values is fine. Fixed it to look like the right one. ### Why are the changes needed? Encountering this error with such message was very confusing! I hope to save time of fellow engineers by improving it. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? It's just an error message... Closes #27709 from Saluev/patch-1. Authored-by: Tigran Saluev <tigran@saluev.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 6f4a2e4c99dadf111e43c6e5b4d7ee5e4d4bd8f6) Signed-off-by: Sean Owen <srowen@gmail.com> 27 February 2020, 17:06:25 UTC
a549a07 [SPARK-23435][INFRA][FOLLOW-UP] Remove unnecessary dependency in AppVeyor ### What changes were proposed in this pull request? `testthat` version was pinned to `1.0.2` at https://github.com/apache/spark/commit/f15102b1702b64a54233ae31357e32335722f4e5 due to compatibility issue in SparkR. The compatibility issue is finally fixed as of https://github.com/apache/spark/commit/298d0a5102e54ddc24f114e83d2b936762722eec and we now use testthat latest version. Now we don't need to install `crayon', 'praise' and 'R6' as they are dependences in testthat (https://github.com/r-lib/testthat/blob/master/DESCRIPTION). ### Why are the changes needed? To minimise build specification and prevent dependency confusion. ### Does this PR introduce any user-facing change? No. Dev only change. ### How was this patch tested? AppVeyor build will test it out. Closes #27717 from HyukjinKwon/SPARK-23435-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b2a74107d50a0f69d727049581e01d9e2f6b4778) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 February 2020, 08:19:41 UTC
b0a2c17 [MINOR][BUILD] Fix make-distribution.sh to show usage without 'echo' cmd ### What changes were proposed in this pull request? turn off `x` mode and do not print the command while printing usage of `make-distribution.sh` ### Why are the changes needed? improve dev tools ### Does this PR introduce any user-facing change? for only developers, clearer hints #### after ``` ./dev/make-distribution.sh --hel +++ dirname ./dev/make-distribution.sh ++ cd ./dev/.. ++ pwd + SPARK_HOME=/Users/kentyao/spark + DISTDIR=/Users/kentyao/spark/dist + MAKE_TGZ=false + MAKE_PIP=false + MAKE_R=false + NAME=none + MVN=/Users/kentyao/spark/build/mvn + (( 1 )) + case $1 in + echo 'Error: --hel is not supported' Error: --hel is not supported + exit_with_usage + set +x make-distribution.sh - tool for making binary distributions of Spark usage: make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>] <maven build options> See Spark's "Building Spark" doc for correct Maven options. ``` #### before ``` +++ dirname ./dev/make-distribution.sh ++ cd ./dev/.. ++ pwd + SPARK_HOME=/Users/kentyao/spark + DISTDIR=/Users/kentyao/spark/dist + MAKE_TGZ=false + MAKE_PIP=false + MAKE_R=false + NAME=none + MVN=/Users/kentyao/spark/build/mvn + (( 1 )) + case $1 in + echo 'Error: --hel is not supported' Error: --hel is not supported + exit_with_usage + echo 'make-distribution.sh - tool for making binary distributions of Spark' make-distribution.sh - tool for making binary distributions of Spark + echo '' + echo usage: usage: + cl_options='[--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>]' + echo 'make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>] <maven build options>' make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn <mvn-command>] <maven build options> + echo 'See Spark'\''s "Building Spark" doc for correct Maven options.' See Spark's "Building Spark" doc for correct Maven options. + echo '' + exit 1 ``` ### How was this patch tested? manually Closes #27706 from yaooqinn/build. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a6026c830a582af75a0d95d18f7759922a086334) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 February 2020, 22:41:01 UTC
56fa200 [SPARK-30759][SQL][3.0] Fix cache initialization in StringRegexExpression In the PR, I propose to fix `cache` initialization in `StringRegexExpression` by changing of expected value type in `case Literal(value: String, StringType)` from `String` to `UTF8String`. This is a backport of #27502 and #27547 Actually, the case doesn't work at all because `Literal`'s value has type `UTF8String`, see <img width="649" alt="Screen Shot 2020-02-08 at 22 45 50" src="https://user-images.githubusercontent.com/1580697/74091681-0d4a2180-4acb-11ea-8a0d-7e8c65f4214e.png"> No Added new test by `RegexpExpressionsSuite`. Closes #27713 from MaxGekk/str-regexp-foldable-pattern-backport. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit cfc48a8a3068972791410e8e36ff9cf1ba5af445) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 February 2020, 22:27:21 UTC
b302caf [SPARK-30944][BUILD] Update URL for Google Cloud Storage mirror of Maven Central This PR is a followup to #27307: per https://travis-ci.community/t/maven-builds-that-use-the-gcs-maven-central-mirror-should-update-their-paths/5926, the Google Cloud Storage mirror of Maven Central has updated its URLs: the new paths are updated more frequently. The new paths are listed on https://storage-download.googleapis.com/maven-central/index.html This patch updates our build files to use these new URLs. No. Existing build + tests. Closes #27688 from JoshRosen/update-gcs-mirror-url. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 25 February 2020, 08:07:52 UTC
8fdd039 [MINOR][DOCS] Fix ForEachWriter Java example ### What changes were proposed in this pull request? Structured streaming documentation example fix ### Why are the changes needed? Currently the java example uses incorrect syntax ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? In IDE Closes #27671 from roland1982/foreachwriter_java_example_fix. Authored-by: roland-ondeviceresearch <roland@ondeviceresearch.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 22 February 2020, 00:54:00 UTC
1b793ed [SPARK-30907][DOCS] Revise the doc of spark.ui.retainedTasks ### What changes were proposed in this pull request? Revise the documentation of `spark.ui.retainedTasks` to make it clear that the configuration is for one stage. ### Why are the changes needed? There are configurations for the limitation of UI data. `spark.ui.retainedJobs`, `spark.ui.retainedStages` and `spark.worker.ui.retainedExecutors` are the total max number for one application, while the configuration `spark.ui.retainedTasks` is the max number for one stage. ### Does this PR introduce any user-facing change? No ### How was this patch tested? None, just doc. Closes #27660 from gengliangwang/reviseRetainTask. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 2a695e6d159fe49efe7601383fec4f3f172c5a97) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 February 2020, 01:07:18 UTC
8597a56 [SPARK-30556][SQL][BACKPORT-2.4] Reset the status changed in SQLExecution withThreadLocalCaptured ### What changes were proposed in this pull request? Follow up for #27267, reset the status changed in SQLExecution.withThreadLocalCaptured. ### Why are the changes needed? For code safety. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. (cherry picked from commit a6b91d2bf727e175d0e175295001db85647539b1) Closes #27633 from xuanyuanking/SPARK-30556-backport. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 February 2020, 14:05:55 UTC
c80b79f [SPARK-30763][SQL][2.4] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/27508 and used to spark2.4. ### Why are the changes needed? Fix a bug `java.lang.IndexOutOfBoundsException No group 1` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? New UT. Closes #27631 from beliefer/fix-2.4-regexp_extract-bug. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 February 2020, 12:34:53 UTC
6e0c116 [SPARK-30731] Update deprecated Mkdocs option Split from #27534. ### What changes were proposed in this pull request? This PR updates a deprecated Mkdocs option to use the new name. ### Why are the changes needed? This change will prevent the docs from failing to build when we update to a version of Mkdocs that no longer supports the deprecated option. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I built the docs locally and reviewed them in my browser. Closes #27626 from nchammas/SPARK-30731-mkdocs-dep-opt. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 2ab8d674ba98db4068c4cdefdc4c424dd39c071c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 19 February 2020, 08:30:12 UTC
7285eea [SPARK-30811][SQL][2.4] CTE should not cause stack overflow when it refers to non-existent table with same name ### Why are the changes needed? A query with Common Table Expressions can cause a stack overflow when it contains a CTE that refers a non-existing table with the same name. The name of the table need to have a database qualifier. This is caused by a couple of things: - `CTESubstitution` runs analysis on the CTE, but this does not throw an exception because the table has a database qualifier. The reason is that we don't fail is because we re-attempt to resolve the relation in a later rule; - `CTESubstitution` replace logic does not check if the table it is replacing has a database, it shouldn't replace the relation if it does. So now we will happily replace `nonexist.t` with `t`; Note that this **not** an issue for master or the spark-3.0 branch. This PR fixes this by checking whether the relation name does not have a database, and it also reverses the transformation order. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added regression test to `DataFrameSuite`. Closes #27562 from hvanhovell/SPARK-30811. Authored-by: herman <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 February 2020, 22:02:38 UTC
86e01c0 [SPARK-30857][SQL][2.4] Fix truncations of timestamps before the epoch to hours and days ### What changes were proposed in this pull request? In the PR, I propose to replace `%` by `Math.floorMod` in `DateTimeUtils.truncTimestamp` for the `HOUR` and `DAY` levels. ### Why are the changes needed? This fixes the issue of incorrect truncation of timestamps before the epoch `1970-01-01T00:00:00.000000Z` to the `HOUR` and `DAY` levels. For example, timestamps after the epoch are truncated by cutting off the rest part of the timestamp: ```sql spark-sql> select date_trunc('HOUR', '2020-02-11 00:01:02.123'), date_trunc('HOUR', '2020-02-11 00:01:02.789'); 2020-02-11 00:00:00 2020-02-11 00:00:00 ``` but hours in the truncated timestamp before the epoch are increased by 1: ```sql spark-sql> select date_trunc('HOUR', '1960-02-11 00:01:02.123'), date_trunc('HOUR', '1960-02-11 00:01:02.789'); 1960-02-11 01:00:00 1960-02-11 01:00:00 ``` ### Does this PR introduce any user-facing change? Yes. After the changes, the example above outputs correct result: ```sql spark-sql> select date_trunc('HOUR', '1960-02-11 00:01:02.123'); 1960-02-11 00:00:00 ``` ### How was this patch tested? Added new tests to `DateFunctionsSuite`. Closes #27612 from MaxGekk/fix-hour-day-truc-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 February 2020, 13:30:22 UTC
fecc7d5 [SPARK-30793][SQL][2.4] Fix truncations of timestamps before the epoch to minutes and seconds ### What changes were proposed in this pull request? In the PR (a backport of #27543), I propose to replace `%` by `Math.floorMod` in `DateTimeUtils.truncTimestamp` for the `SECOND` and `MINUTE` levels. ### Why are the changes needed? This fixes the issue of incorrect truncation of timestamps before the epoch `1970-01-01T00:00:00.000000Z` to the `SECOND` and `MINUTE` levels. For example, timestamps after the epoch are truncated by cutting off the rest part of the timestamp: ```sql spark-sql> select date_trunc('SECOND', '2020-02-11 00:01:02.123'); 2020-02-11 00:01:02 ``` but seconds in the truncated timestamp before the epoch are increased by 1: ```sql spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123'); 1960-02-11 00:01:03 ``` ### Does this PR introduce any user-facing change? Yes. After the changes, the example above outputs correct result: ```sql spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123'); 1960-02-11 00:01:02 ``` ### How was this patch tested? Added new tests to `DateFunctionsSuite`. Closes #27611 from MaxGekk/fix-second-minute-truc-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 February 2020, 03:35:23 UTC
c8f9ce8 [SPARK-30834][DOCS][PYTHON][2.4] Add note for recommended pandas and pyarrow versions ### What changes were proposed in this pull request? Add doc for recommended pandas and pyarrow versions. ### Why are the changes needed? The recommended versions are those that have been thoroughly tested by Spark CI. Other versions may be used at the discretion of the user. ### Does this PR introduce any user-facing change? No ### How was this patch tested? NA Closes #27586 from BryanCutler/python-doc-rec-pandas-pyarrow-SPARK-30834. Lead-authored-by: Bryan Cutler <cutlerb@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 17 February 2020, 02:09:35 UTC
6294cb8 [SPARK-30826][SQL] Respect reference case in `StringStartsWith` pushed down to parquet ### What changes were proposed in this pull request? In the PR, I propose to convert the attribute name of `StringStartsWith` pushed down to the Parquet datasource to column reference via the `nameToParquetField` map. Similar conversions are performed for other source filters pushed down to parquet. ### Why are the changes needed? This fixes the bug described in [SPARK-30826](https://issues.apache.org/jira/browse/SPARK-30826). The query from an external table: ```sql CREATE TABLE t1 (col STRING) USING parquet OPTIONS (path '$path') ``` created on top of written parquet files by `Seq("42").toDF("COL").write.parquet(path)` returns wrong empty result: ```scala spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show +---+ |col| +---+ +---+ ``` ### Does this PR introduce any user-facing change? Yes. After the changes the result is correct for the example above: ```scala spark.sql("SELECT * FROM t1 WHERE col LIKE '4%'").show +---+ |col| +---+ | 42| +---+ ``` ### How was this patch tested? Added a test to `ParquetFilterSuite` Closes #27574 from MaxGekk/parquet-StringStartsWith-case-sens. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8b73b92aadd685b29ef3d9b33366f5e1fd3dae99) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 15 February 2020, 12:11:07 UTC
a312b85 [SPARK-30823][PYTHON][DOCS] Set `%PYTHONPATH%` when building PySpark documentation on Windows This commit is published into the public domain. ### What changes were proposed in this pull request? In analogy to `python/docs/Makefile`, which has > export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip) on line 10, this PR adds > set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip to `make2.bat`. Since there is no `realpath` in default installations of Windows, I left the relative paths unresolved. Per the instructions on how to build docs, `make.bat` is supposed to be run from `python/docs` as the working directory, so this should probably not cause issues (`%BUILDDIR%` is a relative path as well.) ### Why are the changes needed? When building the PySpark documentation on Windows, by changing directory to `python/docs` and running `make.bat` (which runs `make2.bat`), the majority of the documentation may not be built if pyspark is not in the default `%PYTHONPATH%`. Sphinx then reports that `pyspark` (and possibly dependencies) cannot be imported. If `pyspark` is in the default `%PYTHONPATH%`, I suppose it is that version of `pyspark` – as opposed to the version found above the `python/docs` directory – that is considered when building the documentation, which may result in documentation that does not correspond to the development version one is trying to build. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual tests on my Windows 10 machine. Additional tests with other environments very welcome! Closes #27569 from DavidToneian/SPARK-30823. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b2134ee73cfad4d78aaf8f0a2898011ac0308e48) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 14 February 2020, 04:49:49 UTC
0fa72a0 [PYSPARK][DOCS][MINOR] Changed `:func:` to `:attr:` Sphinx roles, fixed links in documentation of `Data{Frame,Stream}{Reader,Writer}` This commit is published into the public domain. ### What changes were proposed in this pull request? This PR fixes the documentation of `DataFrameReader`, `DataFrameWriter`, `DataStreamReader`, and `DataStreamWriter`, where attributes of other classes were misrepresented as functions. Additionally, creation of hyperlinks across modules was fixed in these instances. ### Why are the changes needed? The old state produced documentation that suggested invalid usage of PySpark objects (accessing attributes as though they were callable.) ### Does this PR introduce any user-facing change? No, except for improved documentation. ### How was this patch tested? No test added; documentation build runs through. Closes #27553 from DavidToneian/docfix-DataFrameReader-DataFrameWriter-DataStreamReader-DataStreamWriter. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 25db8c71a2100c167b8c2d7a6c540ebc61db9b73) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 14 February 2020, 02:01:20 UTC
cf9f955 [SPARK-30797][SQL] Set tradition user/group/other permission to ACL entries when setting up ACLs in truncate table ### What changes were proposed in this pull request? This is a follow-up to the PR #26956. In #26956, the patch proposed to preserve path permission when truncating table. When setting up original ACLs, we need to set user/group/other permission as ACL entries too, otherwise if the path doesn't have default user/group/other ACL entries, ACL API will complain an error `Invalid ACL: the user, group and other entries are required.`. In short this change makes sure: 1. Permissions for user/group/other are always kept into ACLs to work with ACL API. 2. Other custom ACLs are still kept after TRUNCATE TABLE (#26956 did this). ### Why are the changes needed? Without this fix, `TRUNCATE TABLE` will get an error when setting up ACLs if there is no default default user/group/other ACL entries. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Update unit test. Manual test on dev Spark cluster. Set ACLs for a table path without default user/group/other ACL entries: ``` hdfs dfs -setfacl --set 'user:liangchi:rwx,user::rwx,group::r--,other::r--' /user/hive/warehouse/test.db/test_truncate_table hdfs dfs -getfacl /user/hive/warehouse/test.db/test_truncate_table # file: /user/hive/warehouse/test.db/test_truncate_table # owner: liangchi # group: supergroup user::rwx user:liangchi:rwx group::r-- mask::rwx other::r-- ``` Then run `sql("truncate table test.test_truncate_table")`, it works by normally truncating the table and preserve ACLs. Closes #27548 from viirya/fix-truncate-table-permission. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5b76367a9d0aaca53ce96ab7e555a596567e8335) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 February 2020, 22:30:37 UTC
e627b0a [MINOR][SQL][DOCS][2.4] Fix the timestamp pattern in the example for `to_timestamp` ### What changes were proposed in this pull request? In the PR, I propose to change the description of the `to_timestamp()` function, and change the pattern in the example. ### Why are the changes needed? To inform users about valid patterns for `to_timestamp` function. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27438 from MaxGekk/to_timestamp-z-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 February 2020, 21:40:13 UTC
40062e3 [SPARK-30737][SPARK-27262][R][BUILD] Reenable CRAN check with UTF-8 encoding to DESCRIPTION ### What changes were proposed in this pull request? This PR proposes to reenable CRAN check disabled at https://github.com/apache/spark/pull/27460. Given the tests https://github.com/apache/spark/pull/27468, seems we should also port https://github.com/apache/spark/pull/23823 together. ### Why are the changes needed? To check CRAN back. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? It was tested at https://github.com/apache/spark/pull/27468 and Jenkins should test it out. Closes #27472 from HyukjinKwon/SPARK-30737. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b95ccb1d8b726b11435789cdb5882df6643430ed) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 06 February 2020, 04:01:45 UTC
f674d88 [SPARK-30733][R][HOTFIX] Fix SparkR tests per testthat and R version upgrade, and disable CRAN ### What changes were proposed in this pull request? There are currently the R test failures after upgrading `testthat` to 2.0.0, and R version 3.5.2 as of SPARK-23435. This PR targets to fix the tests and make the tests pass. See the explanations and causes below: ``` test_context.R:49: failure: Check masked functions length(maskedCompletely) not equal to length(namesOfMaskedCompletely). 1/1 mismatches [1] 6 - 4 == 2 test_context.R:53: failure: Check masked functions sort(maskedCompletely, na.last = TRUE) not equal to sort(namesOfMaskedCompletely, na.last = TRUE). 5/6 mismatches x[2]: "endsWith" y[2]: "filter" x[3]: "filter" y[3]: "not" x[4]: "not" y[4]: "sample" x[5]: "sample" y[5]: NA x[6]: "startsWith" y[6]: NA ``` From my cursory look, R base and R's version are mismatched. I fixed accordingly and Jenkins will test it out. ``` test_includePackage.R:31: error: include inside function package or namespace load failed for ���plyr���: package ���plyr��� was installed by an R version with different internals; it needs to be reinstalled for use with this R version Seems it's a package installation issue. Looks like plyr has to be re-installed. ``` From my cursory look, previously installed `plyr` remains and it's not compatible with the new R version. I fixed accordingly and Jenkins will test it out. ``` test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA as date and time Your system is mis-configured: ���/etc/localtime��� is not a symlink ``` Seems a env problem. I suppressed the warnings for now. ``` test_sparkSQL.R:499: warning: SPARK-17811: can create DataFrame containing NA as date and time It is strongly recommended to set envionment variable TZ to ���America/Los_Angeles��� (or equivalent) ``` Seems a env problem. I suppressed the warnings for now. ``` test_sparkSQL.R:1814: error: string operators unable to find an inherited method for function ���startsWith��� for signature ���"character"��� 1: expect_true(startsWith("Hello World", "Hello")) at /home/jenkins/workspace/SparkPullRequestBuilder2/R/pkg/tests/fulltests/test_sparkSQL.R:1814 2: quasi_label(enquo(object), label) 3: eval_bare(get_expr(quo), get_env(quo)) 4: startsWith("Hello World", "Hello") 5: (function (classes, fdef, mtable) { methods <- .findInheritedMethods(classes, fdef, mtable) if (length(methods) == 1L) return(methods[[1L]]) else if (length(methods) == 0L) { cnames <- paste0("\"", vapply(classes, as.character, ""), "\"", collapse = ", ") stop(gettextf("unable to find an inherited method for function %s for signature %s", sQuote(fdefgeneric), sQuote(cnames)), domain = NA) } else stop("Internal error in finding inherited methods; didn't return a unique method", domain = NA) })(list("character"), new("nonstandardGenericFunction", .Data = function (x, prefix) { standardGeneric("startsWith") }, generic = structure("startsWith", package = "SparkR"), package = "SparkR", group = list(), valueClass = character(0), signature = c("x", "prefix"), default = NULL, skeleton = (function (x, prefix) stop("invalid call in method dispatch to 'startsWith' (no default method)", domain = NA))(x, prefix)), <environment>) 6: stop(gettextf("unable to find an inherited method for function %s for signature %s", sQuote(fdefgeneric), sQuote(cnames)), domain = NA) ``` From my cursory look, R base and R's version are mismatched. I fixed accordingly and Jenkins will test it out. Also, this PR causes a CRAN check failure as below: ``` * creating vignettes ... ERROR Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics: package ���htmltools��� was installed by an R version with different internals; it needs to be reinstalled for use with this R version ``` This PR disables it for now. ### Why are the changes needed? To unblock other PRs. ### Does this PR introduce any user-facing change? No. Test only and dev only. ### How was this patch tested? No. I am going to use Jenkins to test. Closes #27460 from HyukjinKwon/r-test-failure. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e2d984aa1c79eb389cc8d333f656196b17af1c32) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 05 February 2020, 07:46:28 UTC
9bf11ed Preparing development version 2.4.6-SNAPSHOT 02 February 2020, 19:23:14 UTC
cee4ecb Preparing Spark release v2.4.5-rc2 02 February 2020, 19:23:09 UTC
cb4a736 [SPARK-30704][INFRA] Use jekyll-redirect-from 0.15.0 instead of the latest This PR aims to pin the version of `jekyll-redirect-from` to 0.15.0. This is a release blocker for both Apache Spark 3.0.0 and 2.4.5. `jekyll-redirect-from` released 0.16.0 a few days ago and that requires Ruby 2.4.0. - https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0 ``` $ cd dev/create-release/spark-rm/ $ docker build -t spark:test . ... ERROR: Error installing jekyll-redirect-from: jekyll-redirect-from requires Ruby version >= 2.4.0. ... ``` No. Manually do the above command to build `spark-rm` Docker image. ``` ... Successfully installed jekyll-redirect-from-0.15.0 Parsing documentation for jekyll-redirect-from-0.15.0 Installing ri documentation for jekyll-redirect-from-0.15.0 Done installing documentation for jekyll-redirect-from after 0 seconds 1 gem installed Successfully installed rouge-3.15.0 Parsing documentation for rouge-3.15.0 Installing ri documentation for rouge-3.15.0 Done installing documentation for rouge after 4 seconds 1 gem installed Removing intermediate container e0ec7c77b69f ---> 32dec37291c6 ``` Closes #27434 from dongjoon-hyun/SPARK-30704. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1adf3520e3c753e6df8dccb752e8239de682a09a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 February 2020, 08:46:20 UTC
c7c2bda [SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns (Backport of #26700) ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ |col1|col2|col2| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes #27411 from imback82/backport-SPARK-30065. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 January 2020, 07:01:57 UTC
4c3c1d6 [SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns (Backport of #26593) ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ |col1| col2| col2| +----+-----+-----+ | 1|hello| 2| | 3| 4|hello| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #27407 from imback82/backport-SPARK-29890. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 January 2020, 00:52:25 UTC
eeef0e7 [SPARK-29578][TESTS][2.4] Add "8634" as another skipped day for Kwajalein timzeone due to more recent timezone updates in later JDK 8 (Backport of https://github.com/apache/spark/pull/26236) ### What changes were proposed in this pull request? Recent timezone definition changes in very new JDK 8 (and beyond) releases cause test failures. The below was observed on JDK 1.8.0_232. As before, the easy fix is to allow for these inconsequential variations in test results due to differing definition of timezones. ### Why are the changes needed? Keeps test passing on the latest JDK releases. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests Closes #27386 from srowen/SPARK-29578.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 January 2020, 01:20:32 UTC
b93f250 [SPARK-29367][DOC][2.4] Add compatibility note for Arrow 0.15.0 to SQL guide ### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? Yes. ![arrow](https://user-images.githubusercontent.com/9700541/73404705-faec0e80-42a6-11ea-952a-25c544a6d90b.png) ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes #27383 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367-branch24. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 January 2020, 22:54:00 UTC
c7c9f9e [SPARK-30310][CORE][2.4] Resolve missing match case in SparkUncaughtExceptionHandler and added tests ### What changes were proposed in this pull request? Backport of SPARK-30310 from master (f5f05d549efd8f9a81376bfc31474292be7aaa8a) 1) Added missing match case to SparkUncaughtExceptionHandler, so that it would not halt the process when the exception doesn't match any of the match case statements. 2) Added log message before halting process. During debugging it wasn't obvious why the Worker process would DEAD (until we set SPARK_NO_DAEMONIZE=1) due to the shell-scripts puts the process into background and essentially absorbs the exit code. 3) Added SparkUncaughtExceptionHandlerSuite. Basically we create a Spark exception-throwing application with SparkUncaughtExceptionHandler and then check its exit code. ### Why are the changes needed? SPARK-30310, because the process would halt unexpectedly. ### How was this patch tested? All unit tests (mvn test) were ran and OK. Closes #27384 from n-marion/branch-2.4_30310-backport. Authored-by: git <tinto@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 January 2020, 22:40:51 UTC
12f4492 [SPARK-30512] Added a dedicated boss event loop group ### What changes were proposed in this pull request? Adding a dedicated boss event loop group to the Netty pipeline in the External Shuffle Service to avoid the delay in channel registration. ``` EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1, conf.getModuleName() + "-boss"); EventLoopGroup workerGroup = NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server"); bootstrap = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NettyUtils.getServerChannelClass(ioMode)) .option(ChannelOption.ALLOCATOR, allocator) ``` ### Why are the changes needed? We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service. ``` java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) ``` The investigation that we have done is described here: https://github.com/netty/netty/issues/9890 After adding `LoggingHandler` to the netty pipeline, we saw that the registration of the channel was getting delay which is because the worker threads are busy with the existing channels. ### Does this PR introduce any user-facing change? No ### How was this patch tested? We have tested the patch on our clusters and with a stress testing tool. After this change, we didn't see any SASL requests timing out. Existing unit tests pass. Closes #27240 from otterc/SPARK-30512. Authored-by: Chandni Singh <chsingh@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 6b47ace27d04012bcff47951ea1eea2aa6fb7d60) Signed-off-by: Thomas Graves <tgraves@apache.org> 29 January 2020, 21:13:20 UTC
6c29070 [SPARK-23435][2.4][SPARKR][TESTS] Update testthat to >= 2.0.0 ### What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/27359: - Update `testthat` to >= 2.0.0 - Replace of `testthat:::run_tests` with `testthat:::test_package_dir` - Add trivial assertions for tests, without any expectations, to avoid skipping. - Update related docs. ### Why are the changes needed? `testthat` version has been frozen by [SPARK-22817](https://issues.apache.org/jira/browse/SPARK-22817) / https://github.com/apache/spark/pull/20003, but 1.0.2 is pretty old, and we shouldn't keep things in this state forever. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - Existing CI pipeline: - Windows build on AppVeyor, R 3.6.2, testthtat 2.3.1 - Linux build on Jenkins, R 3.1.x, testthat 1.0.2 - Additional builds with thesthat 2.3.1 using [sparkr-build-sandbox](https://github.com/zero323/sparkr-build-sandbox) on c7ed64af9e697b3619779857dd820832176b3be3 R 3.4.4 (image digest ec9032f8cf98) ``` docker pull zero323/sparkr-build-sandbox:3.4.4 docker run zero323/sparkr-build-sandbox:3.4.4 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc ``` 3.5.3 (image digest 0b1759ee4d1d) ``` docker pull zero323/sparkr-build-sandbox:3.5.3 docker run zero323/sparkr-build-sandbox:3.5.3 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc ``` and 3.6.2 (image digest 6594c8ceb72f) ``` docker pull zero323/sparkr-build-sandbox:3.6.2 docker run zero323/sparkr-build-sandbox:3.6.2 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc ```` Corresponding [asciicast](https://asciinema.org/) are available as 10.5281/zenodo.3629431 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3629431.svg)](https://doi.org/10.5281/zenodo.3629431) (a bit to large to burden asciinema.org, but can run locally via `asciinema play`). ---------------------------- Continued from #27328 Closes #27379 from HyukjinKwon/testthat-2.0. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 January 2020, 02:39:15 UTC
ad9f578 [SPARK-30633][SQL] Append L to seed when type is LongType ### What changes were proposed in this pull request? Allow for using longs as seed for xxHash. ### Why are the changes needed? Codegen fails when passing a seed to xxHash that is > 2^31. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests pass. Should more be added? Closes #27354 from patrickcording/fix_xxhash_seed_bug. Authored-by: Patrick Cording <patrick.cording@datarobot.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c5c580ba0d253a04a3df5bbfd5acf6b5d23cdc1c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 January 2020, 18:32:30 UTC
5f1cb2f Revert "[SPARK-29777][FOLLOW-UP][SPARKR] Remove no longer valid test for recursive calls" This reverts commit 81ea5a4cc1617de0dbbc61811847320724b3644f. 26 January 2020, 05:43:49 UTC
81ea5a4 [SPARK-29777][FOLLOW-UP][SPARKR] Remove no longer valid test for recursive calls ### What changes were proposed in this pull request? Disabling test for cleaning closure of recursive function. ### Why are the changes needed? As of https://github.com/apache/spark/commit/9514b822a70d77a6298ece48e6c053200360302c this test is no longer valid, and recursive calls, even simple ones: ```lead f <- function(x) { if(x > 0) { f(x - 1) } else { x } } ``` lead to ``` Error: node stack overflow ``` This is issue is silenced when tested with `testthat` 1.x (reason unknown), but cause failures when using `testthat` 2.x (issue can be reproduced outside test context). Problem is known and tracked by [SPARK-30629](https://issues.apache.org/jira/browse/SPARK-30629) Therefore, keeping this test active doesn't make sense, as it will lead to continuous test failures, when `testthat` is updated (https://github.com/apache/spark/pull/27359 / SPARK-23435). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. CC falaki Closes #27363 from zero323/SPARK-29777-FOLLOWUP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 January 2020, 05:41:15 UTC
1b3ddcf [SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file ### What changes were proposed in this pull request? Reference data for "collect() support Unicode characters" has been moved to an external file, to make test OS and locale independent. ### Why are the changes needed? As-is, embedded data is not properly encoded on Windows: ``` library(SparkR) SparkR::sparkR.session() Sys.info() # sysname release version # "Windows" "Server x64" "build 17763" # nodename machine login # "WIN-5BLT6Q610KH" "x86-64" "Administrator" # user effective_user # "Administrator" "Administrator" Sys.getlocale() # [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" lines <- c("{\"name\":\"안녕하세요\"}", "{\"name\":\"您好\", \"age\":30}", "{\"name\":\"こんにちは\", \"age\":19}", "{\"name\":\"Xin chào\"}") system(paste0("cat ", jsonPath)) # {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"} # {"name":"<U+60A8><U+597D>", "age":30} # {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19} # {"name":"Xin chào"} # [1] 0 jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") writeLines(lines, jsonPath) df <- read.df(jsonPath, "json") printSchema(df) # root # |-- _corrupt_record: string (nullable = true) # |-- age: long (nullable = true) # |-- name: string (nullable = true) head(df) # _corrupt_record age name # 1 <NA> NA <U+C548><U+B155><U+D558><U+C138><U+C694> # 2 <NA> 30 <U+60A8><U+597D> # 3 <NA> 19 <U+3053><U+3093><U+306B><U+3061><U+306F> # 4 {"name":"Xin ch<U+FFFD>o"} NA <NA> ``` This can be reproduced outside tests (Windows Server 2019, English locale), and causes failures, when `testthat` is updated to 2.x (https://github.com/apache/spark/pull/27359). Somehow problem is not picked-up when test is executed on `testthat` 1.0.2. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Running modified test, manual testing. ### Note Alternative seems to be to used bytes, but it hasn't been properly tested. ``` test_that("collect() support Unicode characters", { lines <- markUtf8(c( '{"name": "안녕하세요"}', '{"name": "您好", "age": 30}', '{"name": "こんにちは", "age": 19}', '{"name": "Xin ch\xc3\xa0o"}' )) jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") writeLines(lines, jsonPath, useBytes = TRUE) expected <- regmatches(lines, regexec('(?<="name": ").*?(?=")', lines, perl = TRUE)) df <- read.df(jsonPath, "json") rdf <- collect(df) expect_true(is.data.frame(rdf)) rdf$name <- markUtf8(rdf$name) expect_equal(rdf$name[1], expected[[1]]) expect_equal(rdf$name[2], expected[[2]]) expect_equal(rdf$name[3], expected[[3]]) expect_equal(rdf$name[4], expected[[4]]) df1 <- createDataFrame(rdf) expect_equal( collect( where(df1, df1$name == expected[[2]]) )$name, expected[[2]] ) }) ``` Closes #27362 from zero323/SPARK-30645. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 40b1f4d87e0f24e4e7e2fd6fe37cc5398ae778f8) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 January 2020, 04:00:12 UTC
d7be535 [SPARK-30630][ML][2.4] Deprecate numTrees in GBT in 2.4.5 ### What changes were proposed in this pull request? Deprecate numTrees in GBT in 2.4.5 so it can be removed in 3.0.0 ### Why are the changes needed? Currently, GBT has ``` /** * Number of trees in ensemble */ Since("2.0.0") val getNumTrees: Int = trees.length ``` and ``` /** Number of trees in ensemble */ val numTrees: Int = trees.length ``` I think we should remove one of them. I will deprecate it in 2.4.5 and remove it in 3.0.0 ### Does this PR introduce any user-facing change? Deprecate numTrees in 2.4.5 ### How was this patch tested? Existing tests Closes #27352 from huaxingao/spark-tree. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 January 2020, 10:01:49 UTC
2fc562c [SPARK-30556][SQL][2.4] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext ### What changes were proposed in this pull request? In `org.apache.spark.sql.execution.SubqueryExec#relationFuture` make a copy of `org.apache.spark.SparkContext#localProperties` and pass it to the sub-execution thread in `org.apache.spark.sql.execution.SubqueryExec#executionContext` ### Why are the changes needed? Local properties set via sparkContext are not available as TaskContext properties when executing jobs and threadpools have idle threads which are reused Explanation: When `SubqueryExec`, the relationFuture is evaluated via a separate thread. The threads inherit the `localProperties` from `sparkContext` as they are the child threads. These threads are created in the `executionContext` (thread pools). Each Thread pool has a default keepAliveSeconds of 60 seconds for idle threads. Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via `sparkContext.runJob/submitJob` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT Closes #27340 from ajithme/subquerylocalprop2. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2020, 17:00:01 UTC
c6b02cf [SPARK-30601][BUILD][2.4] Add a Google Maven Central as a primary repository ### What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/27307 This PR proposes to address four things. Three issues and fixes were a bit mixed so this PR sorts it out. See also http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html for the discussion in the mailing list. 1. Add the Google Maven Central mirror (GCS) as a primary repository. This will not only help development more stable but also in order to make Github Actions build (where it is always required to download jars) stable. In case of Jenkins PR builder, it wouldn't be affected too much as it uses the pre-downloaded jars under `.m2`. - Google Maven Central seems stable for heavy workload but not synced very quickly (e.g., new release is missing) - Maven Central (default) seems less stable but synced quickly. We already added this GCS mirror as a default additional remote repository at SPARK-29175. So I don't see an issue to add it as a repo. https://github.com/apache/spark/blob/abf759a91e01497586b8bb6b7a314dd28fd6cff1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2111-L2118 2. Currently, we have the hard-corded repository in [`sbt-pom-reader`](https://github.com/JoshRosen/sbt-pom-reader/blob/v1.0.0-spark/src/main/scala/com/typesafe/sbt/pom/MavenPomResolver.scala#L32) and this seems overwriting Maven's existing resolver by the same ID `central` with `http://` when initially the pom file is ported into SBT instance. This uses `http://` which latently Maven Central disallowed (see https://github.com/apache/spark/pull/27242) My speculation is that we just need to be able to load plugin and let it convert POM to SBT instance with another fallback repo. After that, it _seems_ using `central` with `https` properly. See also https://github.com/apache/spark/pull/27307#issuecomment-576720395. I double checked that we use `https` properly from the SBT build as well: ``` [debug] downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom ... [debug] public: downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom [debug] public: downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom.sha1 ``` This was fixed by adding the same repo (https://github.com/apache/spark/pull/27281), `central_without_mirror`, which is a bit awkward. Instead, this PR adds GCS as a main repo, and community Maven central as a fallback repo. So, presumably the community Maven central repo is used when the plugin is loaded as a fallback. 3. While I am here, I fix another issue. Github Action at https://github.com/apache/spark/pull/27279 is being failed. The reason seems to be scalafmt 1.0.3 is in Maven central but not in GCS. ``` org.apache.maven.plugin.PluginResolutionException: Plugin org.antipathy:mvn-scalafmt_2.12:1.0.3 or one of its dependencies could not be resolved: Could not find artifact org.antipathy:mvn-scalafmt_2.12:jar:1.0.3 in google-maven-central (https://maven-central.storage-download.googleapis.com/repos/central/data/) at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolve (DefaultPluginDependenciesResolver.java:131) ``` `mvn-scalafmt` exists in Maven central: ```bash $ curl https://repo.maven.apache.org/maven2/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom ``` ```xml <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> ... ``` whereas not in GCS mirror: ```bash $ curl https://maven-central.storage-download.googleapis.com/repos/central/data/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom ``` ```xml <?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: maven-central/repos/central/data/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom</Details></Error>% ``` In this PR, simply make both repos accessible by adding to `pluginRepositories`. 4. Remove the workarounds in Github Actions to switch mirrors because now we have same repos in the same order (Google Maven Central first, and Maven Central second) ### Why are the changes needed? To make the build and Github Action more stable. ### Does this PR introduce any user-facing change? No, dev only change. ### How was this patch tested? I roughly checked local and PR against my fork (https://github.com/HyukjinKwon/spark/pull/2 and https://github.com/HyukjinKwon/spark/pull/3). Closes #27335 from HyukjinKwon/tmp. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2020, 16:56:39 UTC
back to top