https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
5830101 Preparing Spark release v2.4.1-rc9 26 March 2019, 04:38:19 UTC
c7fd233 [SPARK-26961][CORE] Enable parallel classloading capability ## What changes were proposed in this pull request? As per https://docs.oracle.com/javase/8/docs/api/java/lang/ClassLoader.html ``Class loaders that support concurrent loading of classes are known as parallel capable class loaders and are required to register themselves at their class initialization time by invoking the ClassLoader.registerAsParallelCapable method. Note that the ClassLoader class is registered as parallel capable by default. However, its subclasses still need to register themselves if they are parallel capable. `` i.e we can have finer class loading locks by registering classloaders as parallel capable. (Refer to deadlock due to macro lock https://issues.apache.org/jira/browse/SPARK-26961). All the classloaders we have are wrapper of URLClassLoader which by itself is parallel capable. But this cannot be achieved by scala code due to static registration Refer https://github.com/scala/bug/issues/11429 ## How was this patch tested? All Existing UT must pass Closes #24126 from ajithme/driverlock. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit b61dce23d2ee7ca95770bc7c390029aae8c65f7e) Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 March 2019, 00:07:51 UTC
6e4cd88 [SPARK-27274][DOCS] Fix references to scala 2.11 in 2.4.1+ docs; Note 2.11 support is deprecated in 2.4.1+ ## What changes were proposed in this pull request? Fix references to scala 2.11 in 2.4.x docs; should default to 2.12. Note 2.11 support is deprecated in 2.4.x. Note that this change isn't needed in master as it's already on 2.12 in docs by default. ## How was this patch tested? Docs build. Closes #24210 from srowen/Scala212docs24. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 March 2019, 00:06:17 UTC
f27c951 [SPARK-27198][CORE] Heartbeat interval mismatch in driver and executor ## What changes were proposed in this pull request? When heartbeat interval is configured via spark.executor.heartbeatInterval without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds) ## How was this patch tested? Will add UTs Closes #24140 from ajithme/intervalissue. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 25 March 2019, 20:38:07 UTC
704b75d [SPARK-27094][YARN][BRANCH-2.4] Work around RackResolver swallowing thread interrupt. To avoid the case where the YARN libraries would swallow the exception and prevent YarnAllocator from shutting down, call the offending code in a separate thread, so that the parent thread can respond appropriately to the shut down. As a safeguard, also explicitly stop the executor launch thread pool when shutting down the application, to prevent new executors from coming up after the application started its shutdown. Tested with unit tests + some internal tests on real cluster. Closes #24206 from vanzin/SPARK-27094-2.4. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 25 March 2019, 18:13:20 UTC
0faf828 Revert "Revert "[SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client"" This reverts commits 3fc626d874d0201ada8387a7e5806672c79cd6b3. Closes #24192 from HeartSaVioR/WIP-testing-SPARK-26606-in-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 March 2019, 23:46:36 UTC
0cfefa7 [SPARK-24935][SQL] fix Hive UDAF with two aggregation buffers ## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close https://github.com/apache/spark/pull/23778 ## How was this patch tested? a new test Closes #24144 from cloud-fan/hive. Lead-authored-by: pgandhi <pgandhi@verizonmedia.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit a6c207c9c0c7aa057cfa27d16fe882b396440113) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 24 March 2019, 23:08:39 UTC
3fc626d Revert "[SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client" This reverts commit 6f1a8d8bfdd8dccc9af2d144ea5ad644ddc63a81. 23 March 2019, 19:19:34 UTC
f3ba73a [SPARK-27160][SQL] Fix DecimalType when building orc filters DecimalType Literal should not be casted to Long. eg. For `df.filter("x < 3.14")`, assuming df (x in DecimalType) reads from a ORC table and uses the native ORC reader with predicate push down enabled, we will push down the `x < 3.14` predicate to the ORC reader via a SearchArgument. OrcFilters will construct the SearchArgument, but not handle the DecimalType correctly. The previous impl will construct `x < 3` from `x < 3.14`. ``` $ sbt > sql/testOnly *OrcFilterSuite > sql/testOnly *OrcQuerySuite -- -z "27160" ``` Closes #24092 from sadhen/spark27160. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 March 2019, 17:33:22 UTC
6f1a8d8 [SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client This patch fixes the issue that ClientEndpoint in standalone cluster doesn't recognize about driver options which are passed to SparkConf instead of system properties. When `Client` is executed via cli they should be provided as system properties, but with `spark-submit` they can be provided as SparkConf. (SpartSubmit will call `ClientApp.start` with SparkConf which would contain these options.) Manually tested via following steps: 1) setup standalone cluster (launch master and worker via `./sbin/start-all.sh`) 2) submit one of example app with standalone cluster mode ``` ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master "spark://localhost:7077" --conf "spark.driver.extraJavaOptions=-Dfoo=BAR" --deploy-mode "cluster" --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/jars/spark-examples*.jar 10 ``` 3) check whether `foo=BAR` is provided in system properties in Spark UI <img width="877" alt="Screen Shot 2019-03-21 at 8 18 04 AM" src="https://user-images.githubusercontent.com/1317309/54728501-97db1700-4bc1-11e9-89da-078445c71e9b.png"> Closes #24163 from HeartSaVioR/SPARK-26606. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 8a9eb05137cd4c665f39a54c30d46c0c4eb7d20b) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 22 March 2019, 22:12:35 UTC
95e73b3 [SPARK-27112][CORE] : Create a resource ordering between threads to r… …esolve the deadlocks encountered when trying to kill executors either due to dynamic allocation or blacklisting Closes #24072 from pgandhi999/SPARK-27112-2. Authored-by: pgandhi <pgandhiverizonmedia.com> Signed-off-by: Imran Rashid <irashidcloudera.com> ## What changes were proposed in this pull request? There are two deadlocks as a result of the interplay between three different threads: **task-result-getter thread** **spark-dynamic-executor-allocation thread** **dispatcher-event-loop thread(makeOffers())** The fix ensures ordering synchronization constraint by acquiring lock on `TaskSchedulerImpl` before acquiring lock on `CoarseGrainedSchedulerBackend` in `makeOffers()` as well as killExecutors() method. This ensures resource ordering between the threads and thus, fixes the deadlocks. ## How was this patch tested? Manual Tests Closes #24134 from pgandhi999/branch-2.4-SPARK-27112. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> 19 March 2019, 21:22:40 UTC
342e91f [SPARK-27178][K8S][BRANCH-2.4] adding nss package to fix tests ## What changes were proposed in this pull request? see also: https://github.com/apache/spark/pull/24111 while performing some tests on our existing minikube and k8s infrastructure, i noticed that the integration tests were failing. i dug in and discovered the following message buried at the end of the stacktrace: ``` Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:218) ... 81 more ``` after i added the `nss` package to `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile`, everything worked. this is also impacting current builds. see: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8959/console ## How was this patch tested? i tested locally before pushing, and the build system will test the rest. Closes #24137 from shaneknapp/add-nss-package. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 18 March 2019, 23:51:57 UTC
361c942 [SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ |array_col | +----------------------------------------+ |[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [1, 2], [1, 2]] | +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [3, 4], [4, 5]] | +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes #24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit aea9a574c44768d1d93ee7e8069729383859292c) Signed-off-by: Sean Owen <sean.owen@databricks.com> 16 March 2019, 19:31:17 UTC
a2f9684 [SPARK-27165][SPARK-27107][BRANCH-2.4][BUILD][SQL] Upgrade Apache ORC to 1.5.5 ## What changes were proposed in this pull request? This PR aims to update Apache ORC dependency to fix [SPARK-27107](https://issues.apache.org/jira/browse/SPARK-27107) . ``` [ORC-452] Support converting MAP column from JSON to ORC Improvement [ORC-447] Change the docker scripts to keep a persistent m2 cache [ORC-463] Add `version` command [ORC-475] ORC reader should lazily get filesystem [ORC-476] Make SearchAgument kryo buffer size configurable ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #24097 from dongjoon-hyun/SPARK-27165-2.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 March 2019, 03:13:21 UTC
2d4e9cf [SPARK-26742][K8S][BRANCH-2.4] Update k8s client version to 4.1.2 ## What changes were proposed in this pull request? Updates client version and fixes some related issues. ## How was this patch tested? Tested with the latest minikube version and k8s 1.13. KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. Run completed in 4 minutes, 20 seconds. Total number of tests run: 14 Suites: completed 2, aborted 0 Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0 All tests passed. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM 2.4.2-SNAPSHOT ............ SUCCESS [ 2.980 s] [INFO] Spark Project Tags ................................. SUCCESS [ 2.880 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 1.954 s] [INFO] Spark Project Networking ........................... SUCCESS [ 3.369 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 1.791 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 1.845 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 3.725 s] [INFO] Spark Project Core ................................. SUCCESS [ 23.572 s] [INFO] Spark Project Kubernetes Integration Tests 2.4.2-SNAPSHOT SUCCESS [04:25 min] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 05:08 min [INFO] Finished at: 2019-03-06T18:03:55Z [INFO] ------------------------------------------------------------------------ Closes #23993 from skonto/fix-k8s-version. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 14 March 2019, 16:29:52 UTC
7f5bdd7 [MINOR][CORE] Use https for bintray spark-packages repository ## What changes were proposed in this pull request? This patch changes the schema of url from http to https for bintray spark-packages repository. Looks like we already changed the schema of repository url for pom.xml but missed inside the code. ## How was this patch tested? Manually ran the `--package` via `./bin/spark-shell --verbose --packages "RedisLabs:spark-redis:0.3.2"` ``` ... Ivy Default Cache set to: /Users/jlim/.ivy2/cache The jars for the packages stored in: /Users/jlim/.ivy2/jars :: loading settings :: url = jar:file:/Users/jlim/WorkArea/ScalaProjects/spark/dist/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml RedisLabs#spark-redis added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-2fee2e18-7832-4a4d-9e97-7b3d0fef766d;1.0 confs: [default] found RedisLabs#spark-redis;0.3.2 in spark-packages found redis.clients#jedis;2.7.2 in central found org.apache.commons#commons-pool2;2.3 in central downloading https://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar ... [SUCCESSFUL ] RedisLabs#spark-redis;0.3.2!spark-redis.jar (824ms) downloading https://repo1.maven.org/maven2/redis/clients/jedis/2.7.2/jedis-2.7.2.jar ... [SUCCESSFUL ] redis.clients#jedis;2.7.2!jedis.jar (576ms) downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.3/commons-pool2-2.3.jar ... [SUCCESSFUL ] org.apache.commons#commons-pool2;2.3!commons-pool2.jar (150ms) :: resolution report :: resolve 4586ms :: artifacts dl 1555ms :: modules in use: RedisLabs#spark-redis;0.3.2 from spark-packages in [default] org.apache.commons#commons-pool2;2.3 from central in [default] redis.clients#jedis;2.7.2 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 3 | 3 | 0 || 3 | 3 | --------------------------------------------------------------------- ``` Closes #24061 from HeartSaVioR/MINOR-use-https-to-bintray-repository. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit f57af2286f85bf67706e14fecfbfd9ef034c2927) Signed-off-by: Sean Owen <sean.owen@databricks.com> 12 March 2019, 23:01:43 UTC
432ea69 [SPARK-26927][CORE] Ensure executor is active when processing events in dynamic allocation manager. There is a race condition in the `ExecutorAllocationManager` that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`. Then, when some executor idles, the real executors will be removed even actual executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong positive number of executorIds was kept in memory. What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap` of `CoaseGrainedSchedulerBackend`, so that the `onExecutorRemoved` method will never be called again. For details see https://issues.apache.org/jira/browse/SPARK-26927 This PR is to fix this problem. existUT and added UT Closes #23842 from liupc/Fix-race-condition-that-casues-dyanmic-allocation-not-working. Lead-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit d5cfe08fdc7ad07e948f329c0bdeeca5c2574a18) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 12 March 2019, 21:13:20 UTC
dba5bac Preparing development version 2.4.2-SNAPSHOT 10 March 2019, 06:34:15 UTC
746b3dd Preparing Spark release v2.4.1-rc8 10 March 2019, 06:33:54 UTC
a017a1c [SPARK-27097][CHERRY-PICK 2.4] Avoid embedding platform-dependent offsets literally in whole-stage generated code ## What changes were proposed in this pull request? Spark SQL performs whole-stage code generation to speed up query execution. There are two steps to it: - Java source code is generated from the physical query plan on the driver. A single version of the source code is generated from a query plan, and sent to all executors. - It's compiled to bytecode on the driver to catch compilation errors before sending to executors, but currently only the generated source code gets sent to the executors. The bytecode compilation is for fail-fast only. - Executors receive the generated source code and compile to bytecode, then the query runs like a hand-written Java program. In this model, there's an implicit assumption about the driver and executors being run on similar platforms. Some code paths accidentally embedded platform-dependent object layout information into the generated code, such as: ```java Platform.putLong(buffer, /* offset */ 24, /* value */ 1); ``` This code expects a field to be at offset +24 of the `buffer` object, and sets a value to that field. But whole-stage code generation generally uses platform-dependent information from the driver. If the object layout is significantly different on the driver and executors, the generated code can be reading/writing to wrong offsets on the executors, causing all kinds of data corruption. One code pattern that leads to such problem is the use of `Platform.XXX` constants in generated code, e.g. `Platform.BYTE_ARRAY_OFFSET`. Bad: ```scala val baseOffset = Platform.BYTE_ARRAY_OFFSET // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will embed the value of `Platform.BYTE_ARRAY_OFFSET` on the driver into the generated code. Good: ```scala val baseOffset = "Platform.BYTE_ARRAY_OFFSET" // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will generate the offset symbolically -- `Platform.putLong(buffer, Platform.BYTE_ARRAY_OFFSET, value)`, which will be able to pick up the correct value on the executors. Caveat: these offset constants are declared as runtime-initialized `static final` in Java, so they're not compile-time constants from the Java language's perspective. It does lead to a slightly increased size of the generated code, but this is necessary for correctness. NOTE: there can be other patterns that generate platform-dependent code on the driver which is invalid on the executors. e.g. if the endianness is different between the driver and the executors, and if some generated code makes strong assumption about endianness, it would also be problematic. ## How was this patch tested? Added a new test suite `WholeStageCodegenSparkSubmitSuite`. This test suite needs to set the driver's extraJavaOptions to force the driver and executor use different Java object layouts, so it's run as an actual SparkSubmit job. Authored-by: Kris Mok <kris.mokdatabricks.com> Closes #24032 from gatorsmile/testFailure. Lead-authored-by: Kris Mok <kris.mok@databricks.com> Co-authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 10 March 2019, 06:00:36 UTC
53590f2 [SPARK-27111][SS] Fix a race that a continuous query may fail with InterruptedException Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally. To handle this case, this PR has the following changes: - Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread). - Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly. I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up. Jenkins Closes #24034 from zsxwing/SPARK-27111. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 6e1c0827ece1cdc615196e60cb11c76b917b8eeb) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 09 March 2019, 22:33:34 UTC
c1b6fe4 [SPARK-27080][SQL] bug fix: mergeWithMetastoreSchema with uniform lower case comparison When reading parquet file with merging metastore schema and file schema, we should compare field names using uniform case. In current implementation, lowercase is used but one omission. And this patch fix it. Unit test Closes #24001 from codeborui/mergeSchemaBugFix. Authored-by: CodeGod <> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a29df5fa02111f57965be2ab5e208f5c815265fe) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 March 2019, 13:30:55 UTC
0297ff5 Preparing development version 2.4.2-SNAPSHOT 08 March 2019, 20:38:02 UTC
e87fe15 Preparing Spark release v2.4.1-rc7 08 March 2019, 20:37:43 UTC
216eeec [SPARK-26604][CORE][BACKPORT-2.4] Clean up channel registration for StreamManager ## What changes were proposed in this pull request? This is mostly a clean backport of https://github.com/apache/spark/pull/23521 to branch-2.4 ## How was this patch tested? I've tested this with a hack in `TransportRequestHandler` to force `ChunkFetchRequest` to get dropped. Then making a number of `ExternalShuffleClient.fetchChunk` requests (which `OpenBlocks` then `ChunkFetchRequest`) and closing out of my test harness. A heap dump later reveals that the `StreamState` references are unreachable. I haven't run this through the unit test suite, but doing that now. Wanted to get this up as I think folks are waiting for it for 2.4.1 Closes #24013 from abellina/SPARK-26604_cherry_pick_2_4. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Alessandro Bellina <abellina@yahoo-inc.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 08 March 2019, 03:48:20 UTC
f7ad4ff [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats ## What changes were proposed in this pull request? `CodeGenerator.updateAndGetCompilationStats` throws an unsupported exception for empty code size statistics. This pr added code to check if it is empty or not. ## How was this patch tested? Pass Jenkins. Closes #23947 from maropu/SPARK-21871-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 07 March 2019, 08:38:48 UTC
9702915 [SPARK-27078][SQL] Fix NoSuchFieldError when read Hive materialized views ## What changes were proposed in this pull request? This pr fix `NoSuchFieldError` when reading Hive materialized views from Hive 2.3.4. How to reproduce: Hive side: ```sql CREATE TABLE materialized_view_tbl (key INT); CREATE MATERIALIZED VIEW view_1 DISABLE REWRITE AS SELECT * FROM materialized_view_tbl; ``` Spark side: ```java bin/spark-sql --conf spark.sql.hive.metastore.version=2.3.4 --conf spark.sql.hive.metastore.jars=maven spark-sql> select * from view_1; 19/03/05 19:55:37 ERROR SparkSQLDriver: Failed in [select * from view_1] java.lang.NoSuchFieldError: INDEX_TABLE at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$3(HiveClientImpl.scala:438) at scala.Option.map(Option.scala:163) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$1(HiveClientImpl.scala:370) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:277) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:215) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:214) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:260) at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:368) ``` ## How was this patch tested? unit tests Closes #23984 from wangyum/SPARK-24360. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 32848eecc55946ad91e62e231d2e310a0270a63d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 March 2019, 00:58:03 UTC
35381dd [SPARK-27019][SQL][WEBUI] onJobStart happens after onExecutionEnd shouldn't overwrite kvstore ## What changes were proposed in this pull request? Currently, when the event reordering happens, especially onJobStart event come after onExecutionEnd event, SQL page in the UI displays weirdly.(for eg:test mentioned in JIRA and also this issue randomly occurs when the TPCDS query fails due to broadcast timeout etc.) The reason is that, In the SQLAppstatusListener, we remove the liveExecutions entry once the execution ends. So, if a jobStart event come after that, then we create a new liveExecution entry corresponding to the execId. Eventually this will overwrite the kvstore and UI displays confusing entries. ## How was this patch tested? Added UT, Also manually tested with the eventLog, provided in the jira, of the failed query. Before fix: ![screenshot from 2019-03-03 03-05-52](https://user-images.githubusercontent.com/23054875/53687929-53e2b800-3d61-11e9-9dca-620fa41e605c.png) After fix: ![screenshot from 2019-03-03 02-40-18](https://user-images.githubusercontent.com/23054875/53687928-4f1e0400-3d61-11e9-86aa-584646ac68f9.png) Closes #23939 from shahidki31/SPARK-27019. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 62fd133f744ab2d1aa3c409165914b5940e4d328) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 06 March 2019, 22:02:45 UTC
7df5aa6 [SPARK-27065][CORE] avoid more than one active task set managers for a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. https://github.com/apache/spark/pull/17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. https://github.com/apache/spark/pull/21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit cb20fbc43e7f54af1ed30b9eb6d76ca50b4eb750) Signed-off-by: Imran Rashid <irashid@cloudera.com> 06 March 2019, 18:01:07 UTC
db86ccb [SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the finished partitions ## What changes were proposed in this pull request? This is an optional solution for #22806 . #21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why #22806 hit the issue. This pr extends #21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes #23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit e5c61436a5720f13eb6d530ebf80635522bd64c6) Signed-off-by: Imran Rashid <irashid@cloudera.com> 06 March 2019, 17:53:39 UTC
5ec4563 [SPARK-24669][SQL] Invalidate tables in case of DROP DATABASE CASCADE ## What changes were proposed in this pull request? Before dropping database refresh the tables of that database, so as to refresh all cached entries associated with those tables. We follow the same when dropping a table. UT is added Closes #23905 from Udbhav30/SPARK-24669. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9bddf7180e9e76e1cabc580eee23962dd66f84c3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 March 2019, 17:07:57 UTC
b583bfe [SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes #23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caead43423d1f994f42502496d57ea8389dc0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 March 2019, 20:08:07 UTC
498fb70 [MINOR][DOCS] Clarify that Spark apps should mark Spark as a 'provided' dependency, not package it ## What changes were proposed in this pull request? Spark apps do not need to package Spark. In fact it can cause problems in some cases. Our examples should show depending on Spark as a 'provided' dependency. Packaging Spark makes the app much bigger by tens of megabytes. It can also bring in conflicting dependencies that wouldn't otherwise be a problem. https://issues.apache.org/jira/browse/SPARK-26146 was what reminded me of this. ## How was this patch tested? Doc build Closes #23938 from srowen/Provided. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 39092236819da097e9c8a3b2fa975105f08ae5b9) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 March 2019, 14:27:03 UTC
ae462b1 [SPARK-27046][DSTREAMS] Remove SPARK-19185 related references from documentation ## What changes were proposed in this pull request? SPARK-19185 is resolved so the reference can be removed from the documentation. ## How was this patch tested? cd docs/ SKIP_API=1 jekyll build Manual webpage check. Closes #23959 from gaborgsomogyi/SPARK-27046. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 5252d8b9872cbf200651b0bb7b8c6edd649ebb58) Signed-off-by: Sean Owen <sean.owen@databricks.com> 04 March 2019, 15:32:19 UTC
3336a21 [SPARK-26990][SQL][BACKPORT-2.4] FileIndex: use user specified field names if possible ## What changes were proposed in this pull request? Back-port of #23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes #23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes #23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 February 2019, 01:37:07 UTC
b031f4a [MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org" ## What changes were proposed in this pull request? Below build failed with Java checkstyle test, but instead of violation it shows FileNotFound on dtd file. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102751/ Looks like the link of dtd file is dead `http://www.puppycrawl.com/dtds/configuration_1_3.dtd`. This patch updates the dtd link to "https://checkstyle.org/dtds/" given checkstyle repository also updated the URL path. https://github.com/checkstyle/checkstyle/issues/5601 ## How was this patch tested? Checked the new links. Closes #23887 from HeartSaVioR/java-checkstyle-dtd-change-url. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit c5de804093540509929f6de211dbbe644b33e6db) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 February 2019, 19:26:17 UTC
073c47b Preparing development version 2.4.2-SNAPSHOT 22 February 2019, 22:54:37 UTC
eb2af24 Preparing Spark release v2.4.1-rc5 22 February 2019, 22:54:15 UTC
ef67be3 [SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values ## What changes were proposed in this pull request? Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations. ```scala scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array res1: Array[Byte] = Array(127, -64, 0, 0) scala> val x = java.lang.Float.intBitsToFloat(-6966608) x: Float = NaN scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array res2: Array[Byte] = Array(-1, -107, -78, -80) ``` Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/ ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23851 from dongjoon-hyun/SPARK-26950. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ffef3d40741b0be321421aa52a6e17a26d89f541) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 February 2019, 04:27:17 UTC
b403612 Revert "[R][BACKPORT-2.3] update package description" This reverts commit 8d68d54f2e2cbbe55a4bb87c2216cff896add517. 22 February 2019, 02:14:56 UTC
8d68d54 [R][BACKPORT-2.3] update package description doesn't port cleanly to 2.3. we need this in branch-2.4 and branch-2.3 Closes #23861 from felixcheung/2.3rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 36db45d5b90ddc3ce54febff2ed41cd29c0a8a04) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2019, 02:13:38 UTC
3282544 Preparing development version 2.4.2-SNAPSHOT 21 February 2019, 23:02:17 UTC
79c1f7e Preparing Spark release v2.4.1-rc4 21 February 2019, 23:01:58 UTC
d857630 [R][BACKPORT-2.4] update package description #23852 doesn't port cleanly to 2.4. we need this in branch-2.4 and branch-2.3 Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #23860 from felixcheung/2.4rdesc. 21 February 2019, 16:42:15 UTC
0926f49 Preparing development version 2.4.2-SNAPSHOT 21 February 2019, 00:46:07 UTC
061185b Preparing Spark release v2.4.1-rc3 21 February 2019, 00:45:49 UTC
274142b [SPARK-26859][SQL] Fix field writer index bug in non-vectorized ORC deserializer ## What changes were proposed in this pull request? This happens in a schema evolution use case only when a user specifies the schema manually and use non-vectorized ORC deserializer code path. There is a bug in `OrcDeserializer.scala` that results in `null`s being set at the wrong column position, and for state from previous records to remain uncleared in next records. There are more details for when exactly the bug gets triggered and what the outcome is in the [JIRA issue](https://jira.apache.org/jira/browse/SPARK-26859). The high-level summary is that this bug results in severe data correctness issues, but fortunately the set of conditions to expose the bug are complicated and make the surface area somewhat small. This change fixes the problem and adds a respective test. ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23766 from IvanVergiliev/fix-orc-deserializer. Lead-authored-by: Ivan Vergiliev <ivan.vergiliev@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 096552ae4d6fcef5e20c54384a2687db41ba2fa1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 February 2019, 13:53:40 UTC
4c60056 Preparing development version 2.4.2-SNAPSHOT 19 February 2019, 21:54:45 UTC
229ad52 Preparing Spark release v2.4.1-rc2 19 February 2019, 21:54:26 UTC
383b662 [MINOR][DOCS] Fix the update rule in StreamingKMeansModel documentation ## What changes were proposed in this pull request? The formatting for the update rule (in the documentation) now appears as ![image](https://user-images.githubusercontent.com/14948437/52933807-5a0c7980-3309-11e9-8573-642a73e77c26.png) instead of ![image](https://user-images.githubusercontent.com/14948437/52933897-a8ba1380-3309-11e9-8e16-e47c27b4a044.png) Closes #23819 from joelgenter/patch-1. Authored-by: joelgenter <joelgenter@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 885aa553c5e8f478b370f8a733102b67f6cd2d99) Signed-off-by: Sean Owen <sean.owen@databricks.com> 19 February 2019, 14:41:29 UTC
633de74 [SPARK-26740][SQL][BRANCH-2.4] Read timestamp/date column stats written by Spark 3.0 ## What changes were proposed in this pull request? - Backport of #23662 to `branch-2.4` - Added `Timestamp`/`DateFormatter` - Set version of column stats to `1` to keep backward compatibility with previous versions ## How was this patch tested? The changes were tested by `StatisticsCollectionSuite` and by `StatisticsSuite`. Closes #23809 from MaxGekk/column-stats-time-date-2.4. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 February 2019, 03:46:42 UTC
094cabc [SPARK-26897][SQL][TEST][FOLLOW-UP] Remove workaround for 2.2.0 and 2.1.x in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? This pr just removed workaround for 2.2.0 and 2.1.x in HiveExternalCatalogVersionsSuite. ## How was this patch tested? Pass the Jenkins. Closes #23817 from maropu/SPARK-26607-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e2b8cc65cd579374ddbd70b93c9fcefe9b8873d9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 February 2019, 03:25:16 UTC
dfda97a [SPARK-26897][SQL][TEST] Update Spark 2.3.x testing from HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? The maintenance release of `branch-2.3` (v2.3.3) vote passed, so this issue updates PROCESS_TABLES.testingVersions in HiveExternalCatalogVersionsSuite ## How was this patch tested? Pass the Jenkins. Closes #23807 from maropu/SPARK-26897. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit dcdbd06b687fafbf29df504949db0a5f77608c8e) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 17 February 2019, 23:06:15 UTC
7270283 [SPARK-26864][SQL][BACKPORT-2.4] Query may return incorrect result when python udf is used as a join condition and the udf uses attributes from both legs of left semi join ## What changes were proposed in this pull request? n SPARK-25314, we supported the scenario of having a python UDF that refers to attributes from both legs of a join condition by rewriting the plan to convert an inner join or left semi join to a filter over a cross join. In case of left semi join, this transformation may cause incorrect results when the right leg of join condition produces duplicate rows based on the join condition. This fix disallows the rewrite for left semi join and raises an error in the case like we do for other types of join. In future, we should have separate rule in optimizer to convert left semi join to inner join (I am aware of one case we could do it if we leverage informational constraint i.e when we know the right side does not produce duplicates). **Python** ```SQL >>> from pyspark import SparkContext >>> from pyspark.sql import SparkSession, Column, Row >>> from pyspark.sql.functions import UserDefinedFunction, udf >>> from pyspark.sql.types import * >>> from pyspark.sql.utils import AnalysisException >>> >>> spark.conf.set("spark.sql.crossJoin.enabled", "True") >>> left = spark.createDataFrame([Row(lc1=1, lc2=1), Row(lc1=2, lc2=2)]) >>> right = spark.createDataFrame([Row(rc1=1, rc2=1), Row(rc1=1, rc2=1)]) >>> func = udf(lambda a, b: a == b, BooleanType()) >>> df = left.join(right, func("lc1", "rc1"), "leftsemi").show() 19/02/12 16:07:10 WARN PullOutPythonUDFInJoinCondition: The join condition:<lambda>(lc1#0L, rc1#4L) of the join plan contains PythonUDF only, it will be moved out and the join plan will be turned to cross join. +---+---+ |lc1|lc2| +---+---+ | 1| 1| | 1| 1| +---+---+ ``` **Scala** ```SQL scala> val left = Seq((1, 1), (2, 2)).toDF("lc1", "lc2") left: org.apache.spark.sql.DataFrame = [lc1: int, lc2: int] scala> val right = Seq((1, 1), (1, 1)).toDF("rc1", "rc2") right: org.apache.spark.sql.DataFrame = [rc1: int, rc2: int] scala> val equal = udf((p1: Integer, p2: Integer) => { | p1 == p2 | }) equal: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2141/11016292394666f1b5,BooleanType,List(Some(Schema(IntegerType,true)), Some(Schema(IntegerType,true))),None,false,true) scala> val df = left.join(right, equal(col("lc1"), col("rc1")), "leftsemi") df: org.apache.spark.sql.DataFrame = [lc1: int, lc2: int] scala> df.show() +---+---+ |lc1|lc2| +---+---+ | 1| 1| +---+---+ ``` ## How was this patch tested? Modified existing tests. Closes #23780 from dilipbiswal/dkb_python_udf_2.4_2. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 February 2019, 09:05:10 UTC
fccc6d3 [SPARK-25922][K8S] Spark Driver/Executor "spark-app-selector" label mismatch (branch-2.4) In K8S Cluster mode, the algorithm to generate spark-app-selector/spark.app.id of spark driver is different with spark executor. This patch makes sure spark driver and executor to use the same spark-app-selector/spark.app.id if spark.app.id is set, otherwise it will use superclass applicationId. In K8S Client mode, spark-app-selector/spark.app.id for executors will use superclass applicationId. Manually run. Closes #23779 from vanzin/SPARK-25922. Authored-by: suxingfate <suxingfate@163.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 15 February 2019, 18:08:33 UTC
bc1e960 [SPARK-26873][SQL] Use a consistent timestamp to build Hadoop Job IDs. ## What changes were proposed in this pull request? Updates FileFormatWriter to create a consistent Hadoop Job ID for a write. ## How was this patch tested? Existing tests for regressions. Closes #23777 from rdblue/SPARK-26873-fix-file-format-writer-job-ids. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 33334e2728f8d2e4cf7d542049435b589ed05a5e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 14 February 2019, 16:25:48 UTC
61b4787 [SPARK-26572][SQL] fix aggregate codegen result evaluation This PR is a correctness fix in `HashAggregateExec` code generation. It forces evaluation of result expressions before calling `consume()` to avoid multiple executions. This PR fixes a use case where an aggregate is nested into a broadcast join and appears on the "stream" side. The issue is that Broadcast join generates it's own loop. And without forcing evaluation of `resultExpressions` of `HashAggregateExec` before the join's loop these expressions can be executed multiple times giving incorrect results. New UT was added. Closes #23731 from peter-toth/SPARK-26572. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2228ee51ce3550d7e6740a1833aae21ab8596764) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 February 2019, 15:08:06 UTC
455a57d [MINOR][DOCS] Fix for contradiction in condition formula of keeping intermediate state of window in structured streaming docs This change solves contradiction in structured streaming documentation in formula which tests if specific window will be updated by calculating watermark and comparing with "T" parameter(intermediate state is cleared as (max event time seen by the engine - late threshold > T), otherwise kept(written as "until")). By further examples the "T" seems to be the end of the window, not start like documentation says firstly. For more information please take a look at my question in stackoverflow https://stackoverflow.com/questions/54599594/understanding-window-with-watermark-in-apache-spark-structured-streaming Can be tested by building documentation. Closes #23765 from vitektarasenko/master. Authored-by: Viktor Tarasenko <v.tarasenko@vezet.ru> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 5894f767d1f159fc05e11d77d61089efcd0c50b4) Signed-off-by: Sean Owen <sean.owen@databricks.com> 13 February 2019, 14:01:49 UTC
351b44d Preparing development version 2.4.2-SNAPSHOT 12 February 2019, 18:45:14 UTC
50eba0e Preparing Spark release v2.4.1-rc1 12 February 2019, 18:45:06 UTC
af3c711 [SPARK-26082][MESOS][FOLLOWUP][BRANCH-2.4] Add UT on fetcher cache option on MesosClusterScheduler ## What changes were proposed in this pull request? This patch adds UT on testing SPARK-26082 to avoid regression. While #23743 reduces the possibility to make a similar mistake, the needed lines of code for adding tests are not that huge, so I guess it might be worth to add them. ## How was this patch tested? Newly added UTs. Test "supports setting fetcher cache" fails when #23734 is not applied and succeeds when #23734 is applied. Closes #23753 from HeartSaVioR/SPARK-26082-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 February 2019, 22:49:37 UTC
f691b2c Revert "[SPARK-26082][MESOS][FOLLOWUP] Add UT on fetcher cache option on MesosClusterScheduler" This reverts commit e645743ad57978823adac57d95fe02fa6f45dad0. 09 February 2019, 03:51:25 UTC
e645743 [SPARK-26082][MESOS][FOLLOWUP] Add UT on fetcher cache option on MesosClusterScheduler ## What changes were proposed in this pull request? This patch adds UT on testing SPARK-26082 to avoid regression. While #23743 reduces the possibility to make a similar mistake, the needed lines of code for adding tests are not that huge, so I guess it might be worth to add them. ## How was this patch tested? Newly added UTs. Test "supports setting fetcher cache" fails when #23743 is not applied and succeeds when #23743 is applied. Closes #23744 from HeartSaVioR/SPARK-26082-add-unit-test. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b4e1d145135445eeed85784dab0c2c088930dd26) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 February 2019, 16:52:11 UTC
c41a5e1 [SPARK-26082][MESOS] Fix mesos fetch cache config name ## What changes were proposed in this pull request? * change MesosClusterScheduler to use correct argument name for Mesos fetch cache (spark.mesos.fetchCache.enable -> spark.mesos.fetcherCache.enable) ## How was this patch tested? Not sure this requires a test, since it's just a string change. Closes #23734 from mwlon/SPARK-26082. Authored-by: mwlon <mloncaric@hmc.edu> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c0811e8b4d11892f60b7032ba4c8e3adc40fe82f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 February 2019, 09:22:15 UTC
9b2eedc [SPARK-26734][STREAMING] Fix StackOverflowError with large block queue ## What changes were proposed in this pull request? SPARK-23991 introduced a bug in `ReceivedBlockTracker#allocateBlocksToBatch`: when a queue with more than a few thousand blocks are in the queue, serializing the queue throws a StackOverflowError. This change just adds `dequeueAll` to the new `clone` operation on the queue so that the fix in 23991 is preserved but the serialized data comes from an ArrayBuffer which doesn't have the serialization problems that mutable.Queue has. ## How was this patch tested? A unit test was added. Closes #23716 from rlodge/SPARK-26734. Authored-by: Ross Lodge <rlodge@concentricsky.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 8427e9ba5cae28233d1bdc54208b46889b83a821) Signed-off-by: Sean Owen <sean.owen@databricks.com> 06 February 2019, 16:44:25 UTC
570edc6 [SPARK-26677][FOLLOWUP][BRANCH-2.4] Update Parquet manifest with Hadoop-2.6 ## What changes were proposed in this pull request? During merging Parquet upgrade PR, `hadoop-2.6` profile dependency manifest is missed. ## How was this patch tested? Manual. ``` ./dev/test-dependencies.sh ``` Also, this will recover `branch-2.4` with `hadoop-2.6` build. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.6/281/ Closes #23738 from dongjoon-hyun/SPARK-26677-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 February 2019, 01:22:33 UTC
7187c01 [SPARK-26758][CORE] Idle Executors are not getting killed after spark.dynamiAllocation.executorIdleTimeout value ## What changes were proposed in this pull request? **updateAndSyncNumExecutorsTarget** API should be called after **initializing** flag is unset ## How was this patch tested? Added UT and also manually tested After Fix ![afterfix](https://user-images.githubusercontent.com/35216143/51983136-ed4a5000-24bd-11e9-90c8-c4a562c17a4b.png) Closes #23697 from sandeep-katta/executorIssue. Authored-by: sandeep-katta <sandeep.katta2007@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 1dd7419702c5bc7e36fee9fa1eec06b66f25806e) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 February 2019, 04:13:53 UTC
3d4aa5b [SPARK-26751][SQL] Fix memory leak when statement run in background and throw exception which is not HiveSQLException ## What changes were proposed in this pull request? When we run in background and we get exception which is not HiveSQLException, we may encounter memory leak since handleToOperation will not removed correctly. The reason is below: 1. When calling operation.run() in HiveSessionImpl#executeStatementInternal we throw an exception which is not HiveSQLException 2. Then the opHandle generated by SparkSQLOperationManager will not be added into opHandleSet of HiveSessionImpl , and operationManager.closeOperation(opHandle) will not be called 3. When we close the session we will also call operationManager.closeOperation(opHandle),since we did not add this opHandle into the opHandleSet. For the reasons above,the opHandled will always in SparkSQLOperationManager#handleToOperation,which will cause memory leak. More details and a case has attached on https://issues.apache.org/jira/browse/SPARK-26751 This patch will always throw HiveSQLException when running in background ## How was this patch tested? Exist UT Closes #23673 from caneGuy/zhoukang/fix-hivesessionimpl-leak. Authored-by: zhoukang <zhoukang199191@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 255faaf3436e1f41838062ed460f801bb0be40ec) Signed-off-by: Sean Owen <sean.owen@databricks.com> 03 February 2019, 14:46:24 UTC
190e48c [SPARK-26677][BUILD] Update Parquet to 1.10.1 with notEq pushdown fix. ## What changes were proposed in this pull request? Update to Parquet Java 1.10.1. ## How was this patch tested? Added a test from HyukjinKwon that validates the notEq case from SPARK-26677. Closes #23704 from rdblue/SPARK-26677-fix-noteq-parquet-bug. Lead-authored-by: Ryan Blue <blue@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Ryan Blue <rdblue@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f72d2177882dc47b043fdc7dec9a46fe65df4ee9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 February 2019, 17:18:08 UTC
bd4ce51 [SPARK-26806][SS] EventTimeStats.merge should handle zeros correctly ## What changes were proposed in this pull request? Right now, EventTimeStats.merge doesn't handle `zero.merge(zero)` correctly. This will make `avg` become `NaN`. And whatever gets merged with the result of `zero.merge(zero)`, `avg` will still be `NaN`. Then finally, we call `NaN.toLong` and get `0`, and the user will see the following incorrect report: ``` "eventTime" : { "avg" : "1970-01-01T00:00:00.000Z", "max" : "2019-01-31T12:57:00.000Z", "min" : "2019-01-30T18:44:04.000Z", "watermark" : "1970-01-01T00:00:00.000Z" } ``` This issue was reported by liancheng . This PR fixes the above issue. ## How was this patch tested? The new unit tests. Closes #23718 from zsxwing/merge-zero. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 03a928cbecaf38bbbab3e6b957fcbb542771cfbd) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 01 February 2019, 19:15:47 UTC
2a83431 [SPARK-26745][SPARK-24959][SQL][BRANCH-2.4] Revert count optimization in JSON datasource by ## What changes were proposed in this pull request? This PR reverts JSON count optimization part of #21909. We cannot distinguish the cases below without parsing: ``` [{...}, {...}] ``` ``` [] ``` ``` {...} ``` ```bash # empty string ``` when we `count()`. One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input. See also https://github.com/apache/spark/pull/23665#discussion_r251276720. ## How was this patch tested? Manually tested. Closes #23708 from HyukjinKwon/SPARK-26745-backport. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 February 2019, 02:22:05 UTC
2b5e033 [SPARK-26757][GRAPHX] Return 0 for `count` on empty Edge/Vertex RDDs ## What changes were proposed in this pull request? Previously a "java.lang.UnsupportedOperationException: empty collection" exception would be thrown due to using `reduce`, rather than `fold` or similar that can tolerate empty RDDs. This behaviour has existed for the Vertex RDDs since it was introduced in b30e0ae0351be1cbc0b1cf179293587b466ee026. It seems this behaviour was inherited by the Edge RDDs via copy-paste in ee29ef3800438501e0ff207feb00a28973fc0769. ## How was this patch tested? Two new unit tests. Closes #23681 from huonw/empty-graphx. Authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit da526985c7574dccdcc0cca7452e2e999a5b3012) Signed-off-by: Sean Owen <sean.owen@databricks.com> 31 January 2019, 23:27:46 UTC
d9403e4 [SPARK-26726] Synchronize the amount of memory used by the broadcast variable to the UI display …not synchronized to the UI display ## What changes were proposed in this pull request? The amount of memory used by the broadcast variable is not synchronized to the UI display. I added the case for BroadcastBlockId and updated the memory usage. ## How was this patch tested? We can test this patch with unit tests. Closes #23649 from httfighter/SPARK-26726. Lead-authored-by: 韩田田00222924 <han.tiantian@zte.com.cn> Co-authored-by: han.tiantian@zte.com.cn <han.tiantian@zte.com.cn> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit f4a17e916b729f9dc46e859b50a416db1e37b92e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 31 January 2019, 17:17:58 UTC
710d81e [SPARK-26732][CORE][TEST] Wait for listener bus to process events in SparkContextInfoSuite. Otherwise the RDD data may be out of date by the time the test tries to check it. Tested with an artificial delay inserted in AppStatusListener. Closes #23654 from vanzin/SPARK-26732. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 6a2f3dcc2bd601fd1fe7610854bc0f5bf90300f4) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 30 January 2019, 15:11:09 UTC
ae0592d [SPARK-26718][SS][BRANCH-2.4] Fixed integer overflow in SS kafka rateLimit calculation ## What changes were proposed in this pull request? Fix the integer overflow issue in rateLimit. ## How was this patch tested? Pass the Jenkins with newly added UT for the possible case where integer could be overflowed. Closes #23652 from linehrr/fix/integer_overflow_rateLimit. Authored-by: ryne.yang <ryne.yang@acuityads.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 29 January 2019, 20:40:28 UTC
d5cc890 [SPARK-26708][SQL][BRANCH-2.4] Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan ## What changes were proposed in this pull request? When performing non-cascading cache invalidation, `recache` is called on the other cache entries which are dependent on the cache being invalidated. It leads to the the physical plans of those cache entries being re-compiled. For those cache entries, if the cache RDD has already been persisted, chances are there will be inconsistency between the data and the new plan. It can cause a correctness issue if the new plan's `outputPartitioning` or `outputOrdering` is different from the that of the actual data, and meanwhile the cache is used by another query that asks for specific `outputPartitioning` or `outputOrdering` which happens to match the new plan but not the actual data. The fix is to keep the cache entry as it is if the data has been loaded, otherwise re-build the cache entry, with a new plan and an empty cache buffer. ## How was this patch tested? Added UT. Closes #23678 from maryannxue/spark-26708-2.4. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 29 January 2019, 12:33:46 UTC
448a063 [SPARK-26379][SS][FOLLOWUP] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp ## What changes were proposed in this pull request? Spark replaces `CurrentTimestamp` with `CurrentBatchTimestamp`. However, `CurrentBatchTimestamp` is `TimeZoneAwareExpression` while `CurrentTimestamp` isn't. Without TimeZoneId, `CurrentBatchTimestamp` becomes unresolved and raises `UnresolvedException`. Since `CurrentDate` is `TimeZoneAwareExpression`, there is no problem with `CurrentDate`. This PR reverts the [previous patch](https://github.com/apache/spark/pull/23609) on `MicroBatchExecution` and fixes the root cause. ## How was this patch tested? Pass the Jenkins with the updated test cases. Closes #23660 from dongjoon-hyun/SPARK-26379. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1ca6b8bc3df19503c00414e62161227725a99520) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 January 2019, 18:05:08 UTC
46a9018 [SPARK-26379][SS] Fix issue on adding current_timestamp/current_date to streaming query ## What changes were proposed in this pull request? This patch proposes to fix issue on adding `current_timestamp` / `current_date` with streaming query. The root reason is that Spark transforms `CurrentTimestamp`/`CurrentDate` to `CurrentBatchTimestamp` in MicroBatchExecution which makes transformed attributes not-yet-resolved. They will be resolved by IncrementalExecution. (In ContinuousExecution, Spark doesn't allow using `current_timestamp` and `current_date` so it has been OK.) It's OK for DataSource V1 sink because it simply leverages transformed logical plan and don't evaluate until they're resolved, but for DataSource V2 sink, Spark tries to extract the schema of transformed logical plan in prior to IncrementalExecution, and unresolved attributes will raise errors. This patch fixes the issue via having separate pre-resolved logical plan to pass the schema to StreamingWriteSupport safely. ## How was this patch tested? Added UT. Closes #23609 from HeartSaVioR/SPARK-26379. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 25 January 2019, 23:25:38 UTC
08b6379 [SPARK-26427][BUILD][BACKPORT-2.4] Upgrade Apache ORC to 1.5.4 ## What changes were proposed in this pull request? This is a backport of #23364. To make Apache Spark 2.4.1 more robust, this PR aims to update Apache ORC dependency to the latest version 1.5.4 released at Dec. 20. ([Release Notes](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12344187])) ``` [ORC-237] OrcFile.mergeFiles Specified block size is less than configured minimum value [ORC-409] Changes for extending MemoryManagerImpl [ORC-410] Fix a locale-dependent test in TestCsvReader [ORC-416] Avoid opening data reader when there is no stripe [ORC-417] Use dynamic Apache Maven mirror link [ORC-419] Ensure to call `close` at RecordReaderImpl constructor exception [ORC-432] openjdk 8 has a bug that prevents surefire from working [ORC-435] Ability to read stripes that are greater than 2GB [ORC-437] Make acid schema checks case insensitive [ORC-411] Update build to work with Java 10. [ORC-418] Fix broken docker build script ``` ## How was this patch tested? Build and pass Jenkins. Closes #23646 from dongjoon-hyun/SPARK-26427-2.4. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 25 January 2019, 17:08:00 UTC
8d957d7 [SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit f5b9370da2745a744f8b2f077f1690e0e7035140) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 25 January 2019, 02:25:56 UTC
e8e9b11 [SPARK-26680][SQL] Eagerly create inputVars while conditions are appropriate ## What changes were proposed in this pull request? When a user passes a Stream to groupBy, ```CodegenSupport.consume``` ends up lazily generating ```inputVars``` from a Stream, since the field ```output``` will be a Stream. At the time ```output.zipWithIndex.map``` is called, conditions are correct. However, by the time the map operation actually executes, conditions are no longer appropriate. The closure used by the map operation ends up using a reference to the partially created ```inputVars```. As a result, a StackOverflowError occurs. This PR ensures that ```inputVars``` is eagerly created while conditions are appropriate. It seems this was also an issue with the code path for creating ```inputVars``` from ```outputVars``` (SPARK-25767). I simply extended the solution for that code path to encompass both code paths. ## How was this patch tested? SQL unit tests new test python tests Closes #23617 from bersprockets/SPARK-26680_opt1. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> (cherry picked from commit d4a30fa9af81a8bbb50d75f495ca3787f68f10e4) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 24 January 2019, 10:19:00 UTC
63fa6f5 [SPARK-26682][SQL] Use taskAttemptID instead of attemptNumber for Hadoop. ## What changes were proposed in this pull request? Updates the attempt ID used by FileFormatWriter. Tasks in stage attempts use the same task attempt number and could conflict. Using Spark's task attempt ID guarantees that Hadoop TaskAttemptID instances are unique. ## How was this patch tested? Existing tests. Also validated that we no longer detect this failure case in our logs after deployment. Closes #23608 from rdblue/SPARK-26682-fix-hadoop-task-attempt-id. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d5a97c1c2c86ae335e91008fa25b3359c4560915) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 January 2019, 04:46:15 UTC
921c22b [SPARK-26706][SQL] Fix `illegalNumericPrecedence` for ByteType This PR contains a minor change in `Cast$mayTruncate` that fixes its logic for bytes. Right now, `mayTruncate(ByteType, LongType)` returns `false` while `mayTruncate(ShortType, LongType)` returns `true`. Consequently, `spark.range(1, 3).as[Byte]` and `spark.range(1, 3).as[Short]` behave differently. Potentially, this bug can silently corrupt someone's data. ```scala // executes silently even though Long is converted into Byte spark.range(Long.MaxValue - 10, Long.MaxValue).as[Byte] .map(b => b - 1) .show() +-----+ |value| +-----+ | -12| | -11| | -10| | -9| | -8| | -7| | -6| | -5| | -4| | -3| +-----+ // throws an AnalysisException: Cannot up cast `id` from bigint to smallint as it may truncate spark.range(Long.MaxValue - 10, Long.MaxValue).as[Short] .map(s => s - 1) .show() ``` This PR comes with a set of unit tests. Closes #23632 from aokolnychyi/cast-fix. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 24 January 2019, 00:30:55 UTC
f36d0c5 [SPARK-26228][MLLIB] OOM issue encountered when computing Gramian matrix Avoid memory problems in closure cleaning when handling large Gramians (>= 16K rows/cols) by using null as zeroValue Existing tests. Note that it's hard to test the case that triggers this issue as it would require a large amount of memory and run a while. I confirmed locally that a 16K x 16K Gramian failed with tons of driver memory before, and didn't fail upfront after this change. Closes #23600 from srowen/SPARK-26228. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 6dcad38ba3393188084f378b7ff6dfc12b685b13) Signed-off-by: Sean Owen <sean.owen@databricks.com> 23 January 2019, 01:25:18 UTC
10d7713 [SPARK-26605][YARN] Update AM's credentials when creating tokens. This ensures new executors in client mode also get the new tokens, instead of being started with potentially expired tokens. Closes #23523 from vanzin/SPARK-26605. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 23 January 2019, 00:46:00 UTC
9814108 [SPARK-26665][CORE] Fix a bug that BlockTransferService.fetchBlockSync may hang forever ## What changes were proposed in this pull request? `ByteBuffer.allocate` may throw `OutOfMemoryError` when the block is large but no enough memory is available. However, when this happens, right now BlockTransferService.fetchBlockSync will just hang forever as its `BlockFetchingListener. onBlockFetchSuccess` doesn't complete `Promise`. This PR catches `Throwable` and uses the error to complete `Promise`. ## How was this patch tested? Added a unit test. Since I cannot make `ByteBuffer.allocate` throw `OutOfMemoryError`, I passed a negative size to make `ByteBuffer.allocate` fail. Although the error type is different, it should trigger the same code path. Closes #23590 from zsxwing/SPARK-26665. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 66450bbc1bb4397f06ca9a6ecba4d16c82d711fd) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 22 January 2019, 17:01:34 UTC
123adbd [SPARK-26351][MLLIB] Update doc and minor correction in the mllib evaluation metrics ## What changes were proposed in this pull request? Currently, there are some minor inconsistencies in doc compared to the code. In this PR, I am correcting those inconsistencies. 1) Links related to the evaluation metrics in the docs are not working 2) Minor correction in the evaluation metrics formulas in docs. ## How was this patch tested? NA Closes #23589 from shahidki31/docCorrection. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 9a30e23211e165a44acc0dbe19693950f7a7cc73) Signed-off-by: Sean Owen <sean.owen@databricks.com> 21 January 2019, 00:13:49 UTC
5a2128c [SPARK-26638][PYSPARK][ML] Pyspark vector classes always return error for unary negation ## What changes were proposed in this pull request? Fix implementation of unary negation (`__neg__`) in Pyspark DenseVectors ## How was this patch tested? Existing tests, plus new doctest Closes #23570 from srowen/SPARK-26638. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 0b3abef1950f486001160ec578e4f628c199eeb4) Signed-off-by: Sean Owen <sean.owen@databricks.com> 17 January 2019, 20:24:54 UTC
d608325 [SPARK-26633][REPL] Add ExecutorClassLoader.getResourceAsStream ## What changes were proposed in this pull request? Add `ExecutorClassLoader.getResourceAsStream`, so that classes dynamically generated by the REPL can be accessed by user code as `InputStream`s for non-class-loading purposes, such as reading the class file for extracting method/constructor parameter names. Caveat: The convention in Java's `ClassLoader` is that `ClassLoader.getResourceAsStream()` should be considered as a convenience method of `ClassLoader.getResource()`, where the latter provides a `URL` for the resource, and the former invokes `openStream()` on it to serve the resource as an `InputStream`. The former should also catch `IOException` from `openStream()` and convert it to `null`. This PR breaks this convention by only overriding `ClassLoader.getResourceAsStream()` instead of also overriding `ClassLoader.getResource()`, so after this PR, it would be possible to get a non-null result from the former, but get a null result from the latter. This isn't ideal, but it's sufficient to cover the main use case and practically it shouldn't matter. To implement the convention properly, we'd need to register a URL protocol handler with Java to allow it to properly handle the `spark://` protocol, etc, which sounds like an overkill for the intent of this PR. Credit goes to zsxwing for the initial investigation and fix suggestion. ## How was this patch tested? Added new test case in `ExecutorClassLoaderSuite` and `ReplSuite`. Closes #23558 from rednaxelafx/executorclassloader-getresourceasstream. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit dc3b35c5da42def803dd05e2db7506714018e27b) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 16 January 2019, 23:22:11 UTC
1843c16 [SPARK-26629][SS] Fixed error with multiple file stream in a query + restart on a batch that has no data for one file stream ## What changes were proposed in this pull request? When a streaming query has multiple file streams, and there is a batch where one of the file streams dont have data in that batch, then if the query has to restart from that, it will throw the following error. ``` java.lang.IllegalStateException: batch 1 doesn't exist at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$.verifyBatchIds(HDFSMetadataLog.scala:300) at org.apache.spark.sql.execution.streaming.FileStreamSourceLog.get(FileStreamSourceLog.scala:120) at org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:181) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:294) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets$2.apply(MicroBatchExecution.scala:291) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:291) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:178) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:205) ``` Existing `HDFSMetadata.verifyBatchIds` threw error whenever the `batchIds` list was empty. In the context of `FileStreamSource.getBatch` (where verify is called) and `FileStreamSourceLog` (subclass of `HDFSMetadata`), this is usually okay because, in a streaming query with one file stream, the `batchIds` can never be empty: - A batch is planned only when the `FileStreamSourceLog` has seen new offset (that is, there are new data files). - So `FileStreamSource.getBatch` will be called on X to Y where X will always be > Y. This calls internally`HDFSMetadata.verifyBatchIds (X+1, Y)` with X+1-Y ids. For example.,`FileStreamSource.getBatch(4, 5)` will call `verify(batchIds = Seq(5), start = 5, end = 5)`. However, the invariant of X > Y is not true when there are two file stream sources, as a batch may be planned even when only one of the file streams has data. So one of the file stream may not have data, which can call `FileStreamSource.getBatch(X, X)` -> `verify(batchIds = Seq.empty, start = X+1, end = X)` -> failure. Note that `FileStreamSource.getBatch(X, X)` gets called **only when restarting a query in a batch where a file source did not have data**. This is because in normal planning of batches, `MicroBatchExecution` avoids calling `FileStreamSource.getBatch(X, X)` when offset X has not changed. However, when restarting a stream at such a batch, `MicroBatchExecution.populateStartOffsets()` calls `FileStreamSource.getBatch(X, X)` (DataSource V1 hack to initialize the source with last known offsets) thus hitting this issue. The minimum solution here is to skip verification when `FileStreamSource.getBatch(X, X)`. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23557 from tdas/SPARK-26629. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 06d5b173b687c23aa53e293ed6e12ec746393876) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 16 January 2019, 17:43:13 UTC
3337477 [SPARK-25992][PYTHON] Document SparkContext cannot be shared for multiprocessing This PR proposes to explicitly document that SparkContext cannot be shared for multiprocessing, and multi-processing execution is not guaranteed in PySpark. I have seen some cases that users attempt to use multiple processes via `multiprocessing` module time to time. For instance, see the example in the JIRA (https://issues.apache.org/jira/browse/SPARK-25992). Py4J itself does not support Python's multiprocessing out of the box (sharing the same JavaGateways for instance). In general, such pattern can cause errors with somewhat arbitrary symptoms difficult to diagnose. For instance, see the error message in JIRA: ``` Traceback (most recent call last): File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock self.process_request(request, client_address) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, in process_request self.finish_request(request, client_address) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, in finish_request self.RequestHandlerClass(request, client_address, self) File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, in __init__ self.handle() File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 238, in handle _accumulatorRegistry[aid] += update KeyError: 0 ``` The root cause of this was because global `_accumulatorRegistry` is not shared across processes. Using thread instead of process is quite easy in Python. See `threading` vs `multiprocessing` in Python - they can be usually direct replacement for each other. For instance, Python also support threadpool as well (`multiprocessing.pool.ThreadPool`) which can be direct replacement of process-based thread pool (`multiprocessing.Pool`). Manually tested, and manually built the doc. Closes #23564 from HyukjinKwon/SPARK-25992. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 670bc55f8d357a5cd894e290cc2834e952a7cfe0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 January 2019, 15:28:06 UTC
e52acc2 [MINOR][PYTHON] Fix SQLContext to SparkSession in Python API main page ## What changes were proposed in this pull request? This PR proposes to fix deprecated `SQLContext` to `SparkSession` in Python API main page. **Before:** ![screen shot 2019-01-16 at 5 30 19 pm](https://user-images.githubusercontent.com/6477701/51239583-bac82f80-19b4-11e9-9129-8dae2c23ec79.png) **After:** ![screen shot 2019-01-16 at 5 29 54 pm](https://user-images.githubusercontent.com/6477701/51239577-b734a880-19b4-11e9-8539-592cb772168d.png) ## How was this patch tested? Manually checked the doc after building it. I also checked by `grep -r "SQLContext"` and looks this is the only instance left. Closes #23565 from HyukjinKwon/minor-doc-change. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e92088de4d6755f975eb8b44b4d75b81e5a0720e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 January 2019, 15:24:24 UTC
22ab94f [SPARK-26615][CORE] Fixing transport server/client resource leaks in the core unittests ## What changes were proposed in this pull request? Fixing resource leaks where TransportClient/TransportServer instances are not closed properly. In StandaloneSchedulerBackend the null check is added because during the SparkContextSchedulerCreationSuite #"local-cluster" test it turned out that client is not initialised as org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend#start isn't called. It throw an NPE and some resource remained in open. ## How was this patch tested? By executing the unittests and using some extra temporary logging for counting created and closed TransportClient/TransportServer instances. Closes #23540 from attilapiros/leaks. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 819e5ea7c290f842c51ead8b4a6593678aeef6bf) Signed-off-by: Sean Owen <sean.owen@databricks.com> 16 January 2019, 15:00:54 UTC
743dedb [MINOR][BUILD] Remove binary license/notice files in a source release for branch-2.4+ only ## What changes were proposed in this pull request? To skip some steps to remove binary license/notice files in a source release for branch2.3 (these files only exist in master/branch-2.4 now), this pr checked a Spark release version in `dev/create-release/release-build.sh`. ## How was this patch tested? Manually checked. Closes #23538 from maropu/FixReleaseScript. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit abc937b24756e5d7479bac7229b0b4c1dc82efeb) Signed-off-by: Sean Owen <sean.owen@databricks.com> 15 January 2019, 01:18:12 UTC
dde4d1d [SPARK-26538][SQL] Set default precision and scale for elements of postgres numeric array ## What changes were proposed in this pull request? When determining CatalystType for postgres columns with type `numeric[]` set the type of array element to `DecimalType(38, 18)` instead of `DecimalType(0,0)`. ## How was this patch tested? Tested with modified `org.apache.spark.sql.jdbc.JDBCSuite`. Ran the `PostgresIntegrationSuite` manually. Closes #23456 from a-shkarupin/postgres_numeric_array. Lead-authored-by: Oleksii Shkarupin <a.shkarupin@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5b37092311bfc1255f1d4d81127ae4242ba1d1aa) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 January 2019, 19:06:55 UTC
bb97459 [SPARK-26607][SQL][TEST] Remove Spark 2.2.x testing from HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? The vote of final release of `branch-2.2` passed and the branch goes EOL. This PR removes Spark 2.2.x from the testing coverage. ## How was this patch tested? Pass the Jenkins. Closes #23526 from dongjoon-hyun/SPARK-26607. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3587a9a2275615b82492b89204b141636542ce52) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 January 2019, 06:54:16 UTC
bbf61eb [SPARK-26586][SS] Fix race condition that causes streams to run with unexpected confs ## What changes were proposed in this pull request? Fix race condition where streams can have unexpected conf values. New streaming queries should run with isolated SparkSessions so that they aren't affected by conf updates after they are started. In StreamExecution, the parent SparkSession is cloned and used to run each batch, but this cloning happens in a separate thread and may happen after DataStreamWriter.start() returns. If a stream is started and a conf key is set immediately after, the stream is likely to have the new value. ## How was this patch tested? New unit test that fails prior to the production change and passes with it. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23513 from mukulmurthy/26586. Authored-by: Mukul Murthy <mukul.murthy@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit ae382c94dd10ff494dde4de44e66182bf6dbe8f8) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 11 January 2019, 19:46:49 UTC
0e5b316 [SPARK-26551][SQL] Fix schema pruning error when selecting one complex field and having is not null predicate on another one ## What changes were proposed in this pull request? Schema pruning has errors when selecting one complex field and having is not null predicate on another one: ```scala val query = sql("select * from contacts") .where("name.middle is not null") .select( "id", "name.first", "name.middle", "name.last" ) .where("last = 'Jones'") .select(count("id")) ``` ``` java.lang.IllegalArgumentException: middle does not exist. Available: last [info] at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303) [info] at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119) [info] at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58) [info] at scala.Option.map(Option.scala:163) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque tSchemaPruning.scala:153) ``` ## How was this patch tested? Added tests. Closes #23474 from viirya/SPARK-26551. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit 50ebf3a43b84c8538ec60437189221c2c527990b) Signed-off-by: DB Tsai <d_tsai@apple.com> 11 January 2019, 19:24:06 UTC
b9eb0e8 [SPARK-26576][SQL] Broadcast hint not applied to partitioned table ## What changes were proposed in this pull request? Make sure broadcast hint is applied to partitioned tables. Since the issue exists in branch 2.0 to 2.4, but not in master, I created this PR for branch-2.4. ## How was this patch tested? - A new unit test in PruneFileSourcePartitionsSuite - Unit test suites touched by SPARK-14581: JoinOptimizationSuite, FilterPushdownSuite, ColumnPruningSuite, and PruneFiltersSuite cloud-fan davies rxin Closes #23507 from jzhuge/SPARK-26576. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com> 11 January 2019, 17:21:13 UTC
da0b69f [SPARK-22128][CORE][BUILD] Add `paranamer` dependency to `core` module ## What changes were proposed in this pull request? With Scala-2.12 profile, Spark application fails while Spark is okay. For example, our documented `SimpleApp` Java example succeeds to compile but it fails at runtime because it doesn't use `paranamer 2.8` and hits [SPARK-22128](https://issues.apache.org/jira/browse/SPARK-22128). This PR aims to declare it explicitly for the Spark applications. Note that this doesn't introduce new dependency to Spark itself. https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html The following is the dependency tree from the Spark application. **BEFORE** ``` $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) simple --- [INFO] my.test:simple:jar:1.0-SNAPSHOT [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile [INFO] \- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile [INFO] \- org.apache.avro:avro:jar:1.8.2:compile [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile ``` **AFTER** ``` [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) simple --- [INFO] my.test:simple:jar:1.0-SNAPSHOT [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile [INFO] \- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.8:compile ``` ## How was this patch tested? Pass the Jenkins. And manually test with the sample app is running. Closes #23502 from dongjoon-hyun/SPARK-26583. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c7daa95d7f095500b416ba405660f98cd2a39727) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 January 2019, 08:40:36 UTC
back to top