https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
c3e32bf Preparing Spark release v2.4.3-rc1 30 April 2019, 22:24:34 UTC
1323ddc Revert "[SPARK-24601][SPARK-27051][BACKPORT][CORE] Update to Jackson 2.9.8 ## What changes were proposed in this pull request? This reverts commit 6f394a20bf49f67b4d6329a1c25171c8024a2fae. In general, we need to be very cautious about the Jackson upgrade in the patch releases, especially when this upgrade could break the existing behaviors of the external packages or data sources, and generate different results after the upgrade. The external packages and data sources need to change their source code to keep the original behaviors. The upgrade requires more discussions before releasing it, I think. In the previous PR https://github.com/apache/spark/pull/22071, we turned off `spark.master.rest.enabled` by default and added the following claim in our security doc: > The Rest Submission Server and the MesosClusterDispatcher do not support authentication. You should ensure that all network access to the REST API & MesosClusterDispatcher (port 6066 and 7077 respectively by default) are restricted to hosts that are trusted to submit jobs. We need to understand whether this Jackson CVE applies to Spark. Before officially releasing it, we need more inputs from all of you. Currently, I would suggest to revert this upgrade from the upcoming 2.4.3 release, which is trying to fix the accidental default Scala version changes in pre-built artifacts. ## How was this patch tested? N/A Closes #24493 from gatorsmile/revert24418. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 April 2019, 15:55:41 UTC
3d49bd4 [SPARK-24935][SQL][FOLLOWUP] support INIT -> UPDATE -> MERGE -> FINISH in Hive UDAF adapter ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/24144 . #24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH. However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). The buffer for UPDATE may not support MERGE. This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH. ## How was this patch tested? a new test case Closes #24459 from cloud-fan/hive-udaf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7432e7ded44cc0014590d229827546f5d8f93868) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 April 2019, 02:35:52 UTC
ba9e12d [SPARK-26745][SQL][TESTS] JsonSuite test case: empty line -> 0 record count This PR consists of the `test` components of #23665 only, minus the associated patch from that PR. It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior. This PR is intended to be deployed alongside #23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745). Manual testing, existing `JsonSuite` unit tests. Closes #23674 from sumitsu/json_emptyline_count_test. Authored-by: Branden Smith <branden.smith@publicismedia.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 63bced9375ec1ec6ded220d768cd746050861a09) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 April 2019, 04:44:47 UTC
fce9b2b [SPARK-25535][CORE][BRANCH-2.4] Work around bad error handling in commons-crypto. The commons-crypto library does some questionable error handling internally, which can lead to JVM crashes if some call into native code fails and cleans up state it should not. While the library is not fixed, this change adds some workarounds in Spark code so that when an error is detected in the commons-crypto side, Spark avoids calling into the library further. Tested with existing and added unit tests. Closes #24476 from vanzin/SPARK-25535-2.4. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 April 2019, 04:23:17 UTC
ec53a19 [SPARK-26891][BACKPORT-2.4][YARN] Fixing flaky test in YarnSchedulerBackendSuite ## What changes were proposed in this pull request? The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check [Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue)](https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed. This multithreaded access of the `nodeBlacklist()` method is coming from: 1) the unit test thread via calling of the method `prepareRequestExecutors()` 2) the `DriverEndpoint.onStart` which runs a periodic task that ends up calling this method ## How was this patch tested? Existing unittest. (cherry picked from commit e4e4e2b842bffba6805623f2258b27b162b451ba) Closes #24474 from attilapiros/SPARK-26891-branch-2.4. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 April 2019, 21:49:49 UTC
29a4e04 [SPARK-27563][SQL][TEST] automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? We can get the latest downloadable Spark versions from https://dist.apache.org/repos/dist/release/spark/ ## How was this patch tested? manually. Closes #24454 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 April 2019, 08:46:14 UTC
ed0739a add missing import and fix compilation 26 April 2019, 07:33:20 UTC
705507f [SPARK-27494][SS] Null values don't work in Kafka source v2 ## What changes were proposed in this pull request? Right now Kafka source v2 doesn't support null values. The issue is in org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which doesn't handle null values. ## How was this patch tested? add new unit tests Closes #24441 from uncleGen/SPARK-27494. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d2656aaecd4a7b5562d8d2065aaa66fdc72d253d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 April 2019, 06:28:25 UTC
ca32108 [MINOR][TEST] switch from 2.4.1 to 2.4.2 in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? update `HiveExternalCatalogVersionsSuite` to test 2.4.2, as 2.4.1 will be removed from Mirror Network soon. ## How was this patch tested? N/A Closes #24452 from cloud-fan/release. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7f9830670d3bf6c1f80c1a7310517dbc0052d1d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 April 2019, 02:27:22 UTC
34fd79d [SPARK-27550][TEST][BRANCH-2.4] Fix `test-dependencies.sh` not to use `kafka-0-8`profile for Scala-2.12 ## What changes were proposed in this pull request? Since SPARK-27274 deprecated Scala-2.11 at Spark 2.4.1, we need to test Scala-2.12 more. Since Kakfa 0.8 doesn't have Scala-2.12 artifacts, e.g., `org.apache.kafka:kafka_2.12:jar:0.8.2.1`, this PR aims to fix `test-dependencies.sh` script to understand Scala binary version. ``` $ dev/change-scala-version.sh 2.12 $ dev/test-dependencies.sh Using `mvn` from path: /usr/local/bin/mvn Using `mvn` from path: /usr/local/bin/mvn Performing Maven install for hadoop-2.6 Using `mvn` from path: /usr/local/bin/mvn [ERROR] Failed to execute goal on project spark-streaming-kafka-0-8_2.12: Could not resolve dependencies for project org.apache.spark:spark-streaming-kafka-0-8_2.12:jar:spark-335572: Could not find artifact org.apache.kafka:kafka_2.12:jar:0.8.2.1 in central (https://repo.maven.apache.org/maven2) -> [Help 1] ``` ## How was this patch tested? Manually do `dev/change-scala-version.sh 2.12` and `dev/test-dependencies.sh`. The script should show `DO NOT MATCH` message instead of Maven `[ERROR]`. ``` $ dev/test-dependencies.sh Using `mvn` from path: /usr/local/bin/mvn ... Generating dependency manifest for hadoop-3.1 Using `mvn` from path: /usr/local/bin/mvn Spark's published dependencies DO NOT MATCH the manifest file (dev/spark-deps). To update the manifest file, run './dev/test-dependencies.sh --replace-manifest'. diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/pr-deps/spark-deps-hadoop-2.6 ... ``` Closes #24445 from dongjoon-hyun/SPARK-27550. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 April 2019, 14:30:19 UTC
b615f22 [SPARK-27544][PYTHON][TEST][BRANCH-2.4] Fix Python test script to work on Scala-2.12 build ## What changes were proposed in this pull request? Since [SPARK-27274](https://issues.apache.org/jira/browse/SPARK-27274) deprecated Scala-2.11 at Spark 2.4.1, we need to test Scala-2.12 more. This PR aims to fix the Python test script on Scala-2.12 build in `branch-2.4`. **BEFORE** ``` $ dev/change-scala-version.sh 2.12 $ build/sbt -Pscala-2.12 package $ python/run-tests.py --python-executables python2.7 --modules pyspark-sql Traceback (most recent call last): File "python/run-tests.py", line 70, in <module> raise Exception("Cannot find assembly build directory, please build Spark first.") Exception: Cannot find assembly build directory, please build Spark first. ``` **AFTER** ``` $ python/run-tests.py --python-executables python2.7 --modules pyspark-sql Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-sql'] Starting test(python2.7): pyspark.sql.tests ... ``` ## How was this patch tested? Manually do the above procedure because Jenkins doesn't test Scala-2.12 in `branch-2.4`. Closes #24439 from dongjoon-hyun/SPARK-27544. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 April 2019, 14:00:17 UTC
42cb4a2 [SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values ## What changes were proposed in this pull request? This PR is follow up of https://github.com/apache/spark/pull/24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d9b2ce0f0f71eb98ba556244ce50bdb57e566723) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 April 2019, 03:30:58 UTC
4472a9f [SPARK-27469][BUILD][BRANCH-2.4] Unify commons-beanutils deps to latest 1.9.3 ## What changes were proposed in this pull request? Unify commons-beanutils deps to latest 1.9.3 Backport of https://github.com/apache/spark/pull/24378 ## How was this patch tested? Existing tests. Closes #24433 from srowen/SPARK-27469.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 April 2019, 16:07:26 UTC
3ba71e9 [SPARK-27419][FOLLOWUP][DOCS] Add note about spark.executor.heartbeatInterval change to migration guide Add note about spark.executor.heartbeatInterval change to migration guide See also https://github.com/apache/spark/pull/24329 N/A Closes #24432 from srowen/SPARK-27419.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d4a16f46f71021178bfc7dca511e47390986197d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 April 2019, 04:05:57 UTC
33864a8 [SPARK-27496][CORE] Fatal errors should also be sent back to the sender ## What changes were proposed in this pull request? When a fatal error (such as StackOverflowError) throws from "receiveAndReply", we should try our best to notify the sender. Otherwise, the sender will hang until timeout. In addition, when a MessageLoop is dying unexpectedly, it should resubmit a new one so that Dispatcher is still working. ## How was this patch tested? New unit tests. Closes #24396 from zsxwing/SPARK-27496. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 009059e3c261a73d605bc49aee4aecb0eb0e8267) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 April 2019, 00:03:34 UTC
6f394a2 [SPARK-24601][SPARK-27051][BACKPORT][CORE] Update to Jackson 2.9.8 ## What changes were proposed in this pull request? This backports: https://github.com/apache/spark/commit/ab1650d2938db4901b8c28df945d6a0691a19d31 https://github.com/apache/spark/commit/7857c6d633f3df426a6ac4618316eb83b1cefe2b which collectively updates Jackson to 2.9.8. ## How was this patch tested? Existing tests. Closes #24418 from srowen/SPARK-24601.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 April 2019, 16:56:48 UTC
eaa88ae [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x ## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers revealed that 2.4 doesn't like python 3.6... :( NOTE: this is just for branch-2.4 PLEASE DO NOT MERGE Closes #24379 from shaneknapp/update-python-executable. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com> 19 April 2019, 16:44:06 UTC
7f64963 [MINOR][TEST] Expand spark-submit test to allow python2/3 executable ## What changes were proposed in this pull request? This backports a tiny part of another change: https://github.com/apache/spark/commit/4bdfda92a1c570d7a1142ee30eb41e37661bc240#diff-3c792ce7265b69b448a984caf629c96bR161 ... which just works around the possibility that the local python interpreter is 'python3' or 'python2' when running the spark-submit tests. I'd like to backport to 2.3 too. This otherwise prevents this test from passing on my mac, though I have a custom install with brew. But may affect others. ## How was this patch tested? Existing tests. Closes #24407 from srowen/Python23check. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 April 2019, 21:30:22 UTC
7a8efc8 Preparing development version 2.4.3-SNAPSHOT 18 April 2019, 13:24:42 UTC
a44880b Preparing Spark release v2.4.2-rc1 18 April 2019, 13:24:38 UTC
2d276c0 [SPARK-27403][SQL] Fix `updateTableStats` to update table stats always with new stats or None ## What changes were proposed in this pull request? System shall update the table stats automatically if user set spark.sql.statistics.size.autoUpdate.enabled as true, currently this property is not having any significance even if it is enabled or disabled. This feature is similar to Hives auto-gather feature where statistics are automatically computed by default if this feature is enabled. Reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev As part of fix , autoSizeUpdateEnabled validation is been done initially so that system will calculate the table size for the user automatically and record it in metastore as per user expectation. ## How was this patch tested? UT is written and manually verified in cluster. Tested with unit tests + some internal tests on real cluster. Before fix: ![image](https://user-images.githubusercontent.com/12999161/55688682-cd8d4780-5998-11e9-85da-e1a4e34419f6.png) After fix ![image](https://user-images.githubusercontent.com/12999161/55688654-7d15ea00-5998-11e9-973f-1f4cee27018f.png) Closes #24315 from sujith71955/master_autoupdate. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 239082d9667a4fa4198bd9524d63c739df147e0e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 April 2019, 16:22:57 UTC
fb47b9b [SPARK-27479][BUILD] Hide API docs for org.apache.spark.util.kvstore ## What changes were proposed in this pull request? The API docs should not include the "org.apache.spark.util.kvstore" package because they are internal private APIs. See the doc link: https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/kvstore/LevelDB.html ## How was this patch tested? N/A Closes #24386 from gatorsmile/rmDoc. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 61feb1635217ef1d4ebceebc1e7c8829c5c11994) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 17 April 2019, 02:53:20 UTC
df9a506 [SPARK-27453] Pass partitionBy as options in DataFrameWriter Pass partitionBy columns as options and feature-flag this behavior. A new unit test. Closes #24365 from liwensun/partitionby. Authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> (cherry picked from commit 26ed65f4150db1fa37f8bfab24ac0873d2e42936) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 16 April 2019, 22:37:43 UTC
40668c5 [SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit… ## What changes were proposed in this pull request? The upper bound of group-by columns row number is to multiply distinct counts of group-by columns. However, column with only null value will cause the output row number to be 0 which is incorrect. Ex: col1 (distinct: 2, rowCount 2) col2 (distinct: 0, rowCount 2) => group by col1, col2 Actual: output rows: 0 Expected: output rows: 2 ## How was this patch tested? According unit test has been added, plus manual test has been done in our tpcds benchmark environement. Closes #24286 from pengbo/master. Lead-authored-by: pengbo <bo.peng1019@gmail.com> Co-authored-by: mingbo_pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c58a4fed8d79aff9fbac9f9a33141b2edbfb0cea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 April 2019, 22:37:21 UTC
ede02b6 Revert "[SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the finished partitions" This reverts commit db86ccb11821231d85b727fb889dec1d58b39e4d. 14 April 2019, 09:01:02 UTC
a8a2ba1 [SPARK-27394][WEBUI] Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate (backport 2.4) ## What changes were proposed in this pull request? This PR backports #24303 to 2.4. ## How was this patch tested? Jenkins Closes #24328 from zsxwing/SPARK-27394-2.4. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 10 April 2019, 22:17:04 UTC
3352803 [SPARK-27406][SQL] UnsafeArrayData serialization breaks when two machi… This PR is the branch-2.4 version for https://github.com/apache/spark/pull/24317 Closes #24324 from pengbo/SPARK-27406-branch-2.4. Authored-by: mingbo_pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 April 2019, 09:05:11 UTC
53658ab [SPARK-27419][CORE] Avoid casting heartbeat interval to seconds (2.4) ## What changes were proposed in this pull request? Right now as we cast the heartbeat interval to seconds, any value less than 1 second will be casted to 0. This PR just backports the changes of the heartbeat interval in https://github.com/apache/spark/pull/22473 from master. ## How was this patch tested? Jenkins Closes #24329 from zsxwing/SPARK-27419. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 April 2019, 01:41:36 UTC
baadfc8 [SPARK-27391][SS] Don't initialize a lazy val in ContinuousExecution job. ## What changes were proposed in this pull request? Fix a potential deadlock in ContinuousExecution by not initializing the toRDD lazy val. Closes #24301 from jose-torres/deadlock. Authored-by: Jose Torres <torres.joseph.f+github@gmail.com> Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com> (cherry picked from commit 4a5768b2a2adf87e3df278655918f72558f0b3b9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2019, 23:37:23 UTC
ecb0109 [SPARK-27390][CORE][SQL][TEST] Fix package name mismatch This PR aims to clean up package name mismatches. Pass the Jenkins. Closes #24300 from dongjoon-hyun/SPARK-27390. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 982c4c8e3cfc25822e0d755d8d1daa324e6399b8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2019, 18:51:39 UTC
5499902 [SPARK-27358][UI] Update jquery to 1.12.x to pick up security fixes Update jquery -> 1.12.4, datatables -> 1.10.18, mustache -> 2.3.12. Add missing mustache license I manually tested the UI locally with the javascript console open and didn't observe any problems or JS errors. The only 'risky' change seems to be mustache, but on reading its release notes, don't think the changes from 0.8.1 to 2.x would affect Spark's simple usage. Closes #24288 from srowen/SPARK-27358. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 23bde447976d0bb33cae67124bac476994634f04) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 April 2019, 17:59:38 UTC
835dd6a [MINOR][DOC] Fix html tag broken in configuration.md ## What changes were proposed in this pull request? This patch fixes wrong HTML tag in configuration.md which breaks the table tag. This is originally reported in dev mailing list: https://lists.apache.org/thread.html/744bdc83b3935776c8d91bf48fdf80d9a3fed3858391e60e343206f9%3Cdev.spark.apache.org%3E ## How was this patch tested? This change is one-liner and pretty obvious so I guess we may be able to skip testing. Closes #24304 from HeartSaVioR/MINOR-configuration-doc-html-tag-error. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a840b99daf97de06c9b1b66efed0567244ec4a01) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2019, 15:41:33 UTC
1a72c15 [SPARK-27216][CORE][BACKPORT-2.4] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue ## What changes were proposed in this pull request? Back-port of #24264 to branch-2.4. HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. It's a bug of RoaringBitmap-0.5.11 and fixed in latest version. ## How was this patch tested? Add a UT Closes #24290 from LantaoJin/SPARK-27216_BACKPORT-2.4. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 04 April 2019, 23:22:57 UTC
bcd56d0 [SPARK-27382][SQL][TEST] Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? Since Apache Spark 2.4.1 vote passed and is distributed into mirrors, we need to test 2.4.1. This should land on both `master` and `branch-2.4`. ## How was this patch tested? Pass the Jenkins. Closes #24292 from dongjoon-hyun/SPARK-27382. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 938d95437526e1108e6c09f09cec96e9800b2143) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2019, 20:50:10 UTC
af0a4bb [SPARK-27338][CORE][FOLLOWUP] remove trailing space ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it. ## How was this patch tested? N/A Closes #24289 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2019, 06:55:52 UTC
93c14c6 [SPARK-27338][CORE] Fix deadlock in UnsafeExternalSorter.SpillableIterator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager ## What changes were proposed in this pull request? In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265. To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265 ## How was this patch tested? Manual tests were being done and will also try to add a test. Closes #24265 from venkata91/deadlock-sorter. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@qubole.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6c4552c65045cfe82ed95212ee7cff684e44288b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2019, 01:58:49 UTC
ed3ffda [MINOR][DOC][SQL] Remove out-of-date doc about ORC in DataFrameReader and Writer ## What changes were proposed in this pull request? According to current status, `orc` is available even Hive support isn't enabled. This is a minor doc change to reflect it. ## How was this patch tested? Doc only change. Closes #24280 from viirya/fix-orc-doc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d04a7371daec4af046a35066f9664c5011162baa) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2019, 16:11:27 UTC
cf6bf0f [SPARK-27346][SQL] Loosen the newline assert condition on 'examples' field in ExpressionInfo ## What changes were proposed in this pull request? I haven't tested by myself on Windows and I am not 100% sure if this is going to cause an actual problem. However, this one line: https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java#L82 made me to investigate a lot today. Given my speculation, if Spark is built in Linux and it's executed on Windows, it looks possible for multiline strings, like, https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L146-L150 to throw an exception because the newline in the binary is `\n` but `System.lineSeparator` returns `\r\n`. I think this is not yet found because this particular codes are not released yet (see SPARK-26426). Looks just better to loosen the condition and forget about this stuff. This should be backported into branch-2.4 as well. ## How was this patch tested? N/A Closes #24274 from HyukjinKwon/SPARK-27346. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 949d71283932ba4ce50aa6b329665e0f8be7ecf1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 April 2019, 23:28:00 UTC
55e6f7a [SPARK-26998][CORE] Remove SSL configuration from executors ## What changes were proposed in this pull request? Different SSL passwords shown up as command line argument on executor side in standalone mode: * keyStorePassword * keyPassword * trustStorePassword In this PR I've removed SSL configurations from executors. ## How was this patch tested? Existing + additional unit tests. Additionally tested with standalone mode and checked the command line arguments: ``` [gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps 94803 CoarseGrainedExecutorBackend 94818 Jps 90149 RemoteMavenServer 91925 Nailgun 94793 SparkSubmit 94680 Worker 94556 Master 398 [gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef | egrep "94556|94680|94793|94803" 502 94556 1 0 2:02PM ttys007 0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf 502 94680 1 0 2:02PM ttys007 0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077 502 94793 94782 0 2:02PM ttys007 0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell 502 94803 94680 0 2:03PM ttys007 0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899 502 94910 57352 0 2:05PM ttys008 0:00.00 egrep 94556|94680|94793|94803 ``` Closes #24170 from gaborgsomogyi/SPARK-26998. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 57aff93886ac7d02b88294672ce0d2495b0942b8) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 02 April 2019, 16:19:13 UTC
66dfece [SPARK-27244][CORE][TEST][FOLLOWUP] toDebugString redacts sensitive information ## What changes were proposed in this pull request? This PR is a FollowUp of https://github.com/apache/spark/pull/24196. It improves the test case by using the parameters that are being used in the actual scenarios. ## How was this patch tested? N/A Closes #24257 from gatorsmile/followupSPARK-27244. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 92b6f86f6d25abbc2abbf374e77c0b70cd1779c7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2019, 05:58:45 UTC
de1238a [MINOR][R] fix R project description update as per this NOTE when running CRAN check ``` The Title field should be in title case, current version then in title case: ‘R Front end for 'Apache Spark'’ ‘R Front End for 'Apache Spark'’ ``` Closes #24255 from felixcheung/rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fa0f791d4d9f083a45ab631a2e9f88a6b749e416) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2019, 04:01:55 UTC
b5c099e [SPARK-27267][FOLLOWUP][BRANCH-2.4] Update hadoop-2.6 dependency manifest ## What changes were proposed in this pull request? This updates `hadoop-2.6` dependency manifest in `branch-2.4`, too. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.7/351/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.6/345/ ## How was this patch tested? Pass the Jenkins. Or, `dev/test-dependencies.sh`. Closes #24254 from dongjoon-hyun/SPARK-27267. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2019, 00:46:33 UTC
0ed87bf [SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data (See JIRA for problem statement) Update snappy 1.1.7.1 -> 1.1.7.3 to pick up an empty-stream and Java 9 fix. There appear to be no other changes of consequence: https://github.com/xerial/snappy-java/blob/master/Milestone.md Existing tests Closes #24242 from srowen/SPARK-27267. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 2ec650d84316d5820795d44dbc3574885b358698) Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 March 2019, 07:42:52 UTC
b41f912 [SPARK-27301][DSTREAM] Shorten the FileSystem cached life cycle to the cleanup method inner scope ## What changes were proposed in this pull request? The cached FileSystem's token will expire if no tokens explicitly are add into it. ```scala 19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 83189 19/03/28 13:40:16 INFO rdd.MapPartitionsRDD: Removing RDD 82860 from persistence list 19/03/28 13:40:16 INFO spark.ContextCleaner: Cleaned shuffle 6005 19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 82860 19/03/28 13:40:16 INFO scheduler.ReceivedBlockTracker: Deleting batches: 19/03/28 13:40:16 INFO scheduler.InputInfoTracker: remove old batch metadata: 1553750250000 ms 19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-1396157959_1] for 53 seconds. Will retry shortly ... org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy11.renewLease(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571) at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.renewLease(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:878) at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) at java.lang.Thread.run(Thread.java:748) ``` This PR shorten the FileSystem cached life cycle to the cleanup method inner scope in case of token expiry. ## How was this patch tested? existing ut Closes #24235 from yaooqinn/SPARK-27301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit f4c73b7c685b901dd69950e4929c65e3b8dd3a55) Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 March 2019, 07:36:16 UTC
4edc535 [SPARK-27244][CORE] Redact Passwords While Using Option logConf=true ## What changes were proposed in this pull request? When logConf is set to true, config keys that contain password were printed in cleartext in driver log. This change uses the already present redact method in Utils, to redact all the passwords based on redact pattern in SparkConf and then print the conf to driver log thus ensuring that sensitive information like passwords is not printed in clear text. ## How was this patch tested? This patch was tested through `SparkConfSuite` & then entire unit test through sbt Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24196 from ninadingole/SPARK-27244. Authored-by: Ninad Ingole <robert.wallis@example.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit dbc7ce18b934fbfd0743b1348fc1265778f07027) Signed-off-by: Sean Owen <sean.owen@databricks.com> 29 March 2019, 19:17:32 UTC
298e4fa [SPARK-27275][CORE] Fix potential corruption in EncryptedMessage.transferTo (2.4) ## What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/24211 to 2.4 ## How was this patch tested? Jenkins Closes #24229 from zsxwing/SPARK-27275-2.4. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 March 2019, 18:13:11 UTC
f0a3e89 Preparing development version 2.4.2-SNAPSHOT 26 March 2019, 04:38:38 UTC
5830101 Preparing Spark release v2.4.1-rc9 26 March 2019, 04:38:19 UTC
c7fd233 [SPARK-26961][CORE] Enable parallel classloading capability ## What changes were proposed in this pull request? As per https://docs.oracle.com/javase/8/docs/api/java/lang/ClassLoader.html ``Class loaders that support concurrent loading of classes are known as parallel capable class loaders and are required to register themselves at their class initialization time by invoking the ClassLoader.registerAsParallelCapable method. Note that the ClassLoader class is registered as parallel capable by default. However, its subclasses still need to register themselves if they are parallel capable. `` i.e we can have finer class loading locks by registering classloaders as parallel capable. (Refer to deadlock due to macro lock https://issues.apache.org/jira/browse/SPARK-26961). All the classloaders we have are wrapper of URLClassLoader which by itself is parallel capable. But this cannot be achieved by scala code due to static registration Refer https://github.com/scala/bug/issues/11429 ## How was this patch tested? All Existing UT must pass Closes #24126 from ajithme/driverlock. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit b61dce23d2ee7ca95770bc7c390029aae8c65f7e) Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 March 2019, 00:07:51 UTC
6e4cd88 [SPARK-27274][DOCS] Fix references to scala 2.11 in 2.4.1+ docs; Note 2.11 support is deprecated in 2.4.1+ ## What changes were proposed in this pull request? Fix references to scala 2.11 in 2.4.x docs; should default to 2.12. Note 2.11 support is deprecated in 2.4.x. Note that this change isn't needed in master as it's already on 2.12 in docs by default. ## How was this patch tested? Docs build. Closes #24210 from srowen/Scala212docs24. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 March 2019, 00:06:17 UTC
f27c951 [SPARK-27198][CORE] Heartbeat interval mismatch in driver and executor ## What changes were proposed in this pull request? When heartbeat interval is configured via spark.executor.heartbeatInterval without specifying units, we have time mismatched between driver(considers in seconds) and executor(considers as milliseconds) ## How was this patch tested? Will add UTs Closes #24140 from ajithme/intervalissue. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 25 March 2019, 20:38:07 UTC
704b75d [SPARK-27094][YARN][BRANCH-2.4] Work around RackResolver swallowing thread interrupt. To avoid the case where the YARN libraries would swallow the exception and prevent YarnAllocator from shutting down, call the offending code in a separate thread, so that the parent thread can respond appropriately to the shut down. As a safeguard, also explicitly stop the executor launch thread pool when shutting down the application, to prevent new executors from coming up after the application started its shutdown. Tested with unit tests + some internal tests on real cluster. Closes #24206 from vanzin/SPARK-27094-2.4. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 25 March 2019, 18:13:20 UTC
0faf828 Revert "Revert "[SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client"" This reverts commits 3fc626d874d0201ada8387a7e5806672c79cd6b3. Closes #24192 from HeartSaVioR/WIP-testing-SPARK-26606-in-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 March 2019, 23:46:36 UTC
0cfefa7 [SPARK-24935][SQL] fix Hive UDAF with two aggregation buffers ## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close https://github.com/apache/spark/pull/23778 ## How was this patch tested? a new test Closes #24144 from cloud-fan/hive. Lead-authored-by: pgandhi <pgandhi@verizonmedia.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit a6c207c9c0c7aa057cfa27d16fe882b396440113) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 24 March 2019, 23:08:39 UTC
3fc626d Revert "[SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client" This reverts commit 6f1a8d8bfdd8dccc9af2d144ea5ad644ddc63a81. 23 March 2019, 19:19:34 UTC
f3ba73a [SPARK-27160][SQL] Fix DecimalType when building orc filters DecimalType Literal should not be casted to Long. eg. For `df.filter("x < 3.14")`, assuming df (x in DecimalType) reads from a ORC table and uses the native ORC reader with predicate push down enabled, we will push down the `x < 3.14` predicate to the ORC reader via a SearchArgument. OrcFilters will construct the SearchArgument, but not handle the DecimalType correctly. The previous impl will construct `x < 3` from `x < 3.14`. ``` $ sbt > sql/testOnly *OrcFilterSuite > sql/testOnly *OrcQuerySuite -- -z "27160" ``` Closes #24092 from sadhen/spark27160. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 March 2019, 17:33:22 UTC
6f1a8d8 [SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client This patch fixes the issue that ClientEndpoint in standalone cluster doesn't recognize about driver options which are passed to SparkConf instead of system properties. When `Client` is executed via cli they should be provided as system properties, but with `spark-submit` they can be provided as SparkConf. (SpartSubmit will call `ClientApp.start` with SparkConf which would contain these options.) Manually tested via following steps: 1) setup standalone cluster (launch master and worker via `./sbin/start-all.sh`) 2) submit one of example app with standalone cluster mode ``` ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master "spark://localhost:7077" --conf "spark.driver.extraJavaOptions=-Dfoo=BAR" --deploy-mode "cluster" --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/jars/spark-examples*.jar 10 ``` 3) check whether `foo=BAR` is provided in system properties in Spark UI <img width="877" alt="Screen Shot 2019-03-21 at 8 18 04 AM" src="https://user-images.githubusercontent.com/1317309/54728501-97db1700-4bc1-11e9-89da-078445c71e9b.png"> Closes #24163 from HeartSaVioR/SPARK-26606. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 8a9eb05137cd4c665f39a54c30d46c0c4eb7d20b) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 22 March 2019, 22:12:35 UTC
95e73b3 [SPARK-27112][CORE] : Create a resource ordering between threads to r… …esolve the deadlocks encountered when trying to kill executors either due to dynamic allocation or blacklisting Closes #24072 from pgandhi999/SPARK-27112-2. Authored-by: pgandhi <pgandhiverizonmedia.com> Signed-off-by: Imran Rashid <irashidcloudera.com> ## What changes were proposed in this pull request? There are two deadlocks as a result of the interplay between three different threads: **task-result-getter thread** **spark-dynamic-executor-allocation thread** **dispatcher-event-loop thread(makeOffers())** The fix ensures ordering synchronization constraint by acquiring lock on `TaskSchedulerImpl` before acquiring lock on `CoarseGrainedSchedulerBackend` in `makeOffers()` as well as killExecutors() method. This ensures resource ordering between the threads and thus, fixes the deadlocks. ## How was this patch tested? Manual Tests Closes #24134 from pgandhi999/branch-2.4-SPARK-27112. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> 19 March 2019, 21:22:40 UTC
342e91f [SPARK-27178][K8S][BRANCH-2.4] adding nss package to fix tests ## What changes were proposed in this pull request? see also: https://github.com/apache/spark/pull/24111 while performing some tests on our existing minikube and k8s infrastructure, i noticed that the integration tests were failing. i dug in and discovered the following message buried at the end of the stacktrace: ``` Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:218) ... 81 more ``` after i added the `nss` package to `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile`, everything worked. this is also impacting current builds. see: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8959/console ## How was this patch tested? i tested locally before pushing, and the build system will test the rest. Closes #24137 from shaneknapp/add-nss-package. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 18 March 2019, 23:51:57 UTC
361c942 [SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array ## What changes were proposed in this pull request? Correct the logic to compute the distinct. Below is a small repro snippet. ``` scala> val df = Seq(Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))).toDF("array_col") df: org.apache.spark.sql.DataFrame = [array_col: array<array<int>>] scala> val distinctDF = df.select(array_distinct(col("array_col"))) distinctDF: org.apache.spark.sql.DataFrame = [array_distinct(array_col): array<array<int>>] scala> df.show(false) +----------------------------------------+ |array_col | +----------------------------------------+ |[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]| +----------------------------------------+ ``` Error ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [1, 2], [1, 2]] | +-------------------------+ ``` Expected result ``` scala> distinctDF.show(false) +-------------------------+ |array_distinct(array_col)| +-------------------------+ |[[1, 2], [3, 4], [4, 5]] | +-------------------------+ ``` ## How was this patch tested? Added an additional test. Closes #24073 from dilipbiswal/SPARK-27134. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit aea9a574c44768d1d93ee7e8069729383859292c) Signed-off-by: Sean Owen <sean.owen@databricks.com> 16 March 2019, 19:31:17 UTC
a2f9684 [SPARK-27165][SPARK-27107][BRANCH-2.4][BUILD][SQL] Upgrade Apache ORC to 1.5.5 ## What changes were proposed in this pull request? This PR aims to update Apache ORC dependency to fix [SPARK-27107](https://issues.apache.org/jira/browse/SPARK-27107) . ``` [ORC-452] Support converting MAP column from JSON to ORC Improvement [ORC-447] Change the docker scripts to keep a persistent m2 cache [ORC-463] Add `version` command [ORC-475] ORC reader should lazily get filesystem [ORC-476] Make SearchAgument kryo buffer size configurable ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #24097 from dongjoon-hyun/SPARK-27165-2.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 March 2019, 03:13:21 UTC
2d4e9cf [SPARK-26742][K8S][BRANCH-2.4] Update k8s client version to 4.1.2 ## What changes were proposed in this pull request? Updates client version and fixes some related issues. ## How was this patch tested? Tested with the latest minikube version and k8s 1.13. KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. Run completed in 4 minutes, 20 seconds. Total number of tests run: 14 Suites: completed 2, aborted 0 Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0 All tests passed. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM 2.4.2-SNAPSHOT ............ SUCCESS [ 2.980 s] [INFO] Spark Project Tags ................................. SUCCESS [ 2.880 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 1.954 s] [INFO] Spark Project Networking ........................... SUCCESS [ 3.369 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 1.791 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 1.845 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 3.725 s] [INFO] Spark Project Core ................................. SUCCESS [ 23.572 s] [INFO] Spark Project Kubernetes Integration Tests 2.4.2-SNAPSHOT SUCCESS [04:25 min] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 05:08 min [INFO] Finished at: 2019-03-06T18:03:55Z [INFO] ------------------------------------------------------------------------ Closes #23993 from skonto/fix-k8s-version. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 14 March 2019, 16:29:52 UTC
7f5bdd7 [MINOR][CORE] Use https for bintray spark-packages repository ## What changes were proposed in this pull request? This patch changes the schema of url from http to https for bintray spark-packages repository. Looks like we already changed the schema of repository url for pom.xml but missed inside the code. ## How was this patch tested? Manually ran the `--package` via `./bin/spark-shell --verbose --packages "RedisLabs:spark-redis:0.3.2"` ``` ... Ivy Default Cache set to: /Users/jlim/.ivy2/cache The jars for the packages stored in: /Users/jlim/.ivy2/jars :: loading settings :: url = jar:file:/Users/jlim/WorkArea/ScalaProjects/spark/dist/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml RedisLabs#spark-redis added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-2fee2e18-7832-4a4d-9e97-7b3d0fef766d;1.0 confs: [default] found RedisLabs#spark-redis;0.3.2 in spark-packages found redis.clients#jedis;2.7.2 in central found org.apache.commons#commons-pool2;2.3 in central downloading https://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar ... [SUCCESSFUL ] RedisLabs#spark-redis;0.3.2!spark-redis.jar (824ms) downloading https://repo1.maven.org/maven2/redis/clients/jedis/2.7.2/jedis-2.7.2.jar ... [SUCCESSFUL ] redis.clients#jedis;2.7.2!jedis.jar (576ms) downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.3/commons-pool2-2.3.jar ... [SUCCESSFUL ] org.apache.commons#commons-pool2;2.3!commons-pool2.jar (150ms) :: resolution report :: resolve 4586ms :: artifacts dl 1555ms :: modules in use: RedisLabs#spark-redis;0.3.2 from spark-packages in [default] org.apache.commons#commons-pool2;2.3 from central in [default] redis.clients#jedis;2.7.2 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 3 | 3 | 0 || 3 | 3 | --------------------------------------------------------------------- ``` Closes #24061 from HeartSaVioR/MINOR-use-https-to-bintray-repository. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit f57af2286f85bf67706e14fecfbfd9ef034c2927) Signed-off-by: Sean Owen <sean.owen@databricks.com> 12 March 2019, 23:01:43 UTC
432ea69 [SPARK-26927][CORE] Ensure executor is active when processing events in dynamic allocation manager. There is a race condition in the `ExecutorAllocationManager` that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`. Then, when some executor idles, the real executors will be removed even actual executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong positive number of executorIds was kept in memory. What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap` of `CoaseGrainedSchedulerBackend`, so that the `onExecutorRemoved` method will never be called again. For details see https://issues.apache.org/jira/browse/SPARK-26927 This PR is to fix this problem. existUT and added UT Closes #23842 from liupc/Fix-race-condition-that-casues-dyanmic-allocation-not-working. Lead-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit d5cfe08fdc7ad07e948f329c0bdeeca5c2574a18) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 12 March 2019, 21:13:20 UTC
dba5bac Preparing development version 2.4.2-SNAPSHOT 10 March 2019, 06:34:15 UTC
746b3dd Preparing Spark release v2.4.1-rc8 10 March 2019, 06:33:54 UTC
a017a1c [SPARK-27097][CHERRY-PICK 2.4] Avoid embedding platform-dependent offsets literally in whole-stage generated code ## What changes were proposed in this pull request? Spark SQL performs whole-stage code generation to speed up query execution. There are two steps to it: - Java source code is generated from the physical query plan on the driver. A single version of the source code is generated from a query plan, and sent to all executors. - It's compiled to bytecode on the driver to catch compilation errors before sending to executors, but currently only the generated source code gets sent to the executors. The bytecode compilation is for fail-fast only. - Executors receive the generated source code and compile to bytecode, then the query runs like a hand-written Java program. In this model, there's an implicit assumption about the driver and executors being run on similar platforms. Some code paths accidentally embedded platform-dependent object layout information into the generated code, such as: ```java Platform.putLong(buffer, /* offset */ 24, /* value */ 1); ``` This code expects a field to be at offset +24 of the `buffer` object, and sets a value to that field. But whole-stage code generation generally uses platform-dependent information from the driver. If the object layout is significantly different on the driver and executors, the generated code can be reading/writing to wrong offsets on the executors, causing all kinds of data corruption. One code pattern that leads to such problem is the use of `Platform.XXX` constants in generated code, e.g. `Platform.BYTE_ARRAY_OFFSET`. Bad: ```scala val baseOffset = Platform.BYTE_ARRAY_OFFSET // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will embed the value of `Platform.BYTE_ARRAY_OFFSET` on the driver into the generated code. Good: ```scala val baseOffset = "Platform.BYTE_ARRAY_OFFSET" // codegen template: s"Platform.putLong($buffer, $baseOffset, $value);" ``` This will generate the offset symbolically -- `Platform.putLong(buffer, Platform.BYTE_ARRAY_OFFSET, value)`, which will be able to pick up the correct value on the executors. Caveat: these offset constants are declared as runtime-initialized `static final` in Java, so they're not compile-time constants from the Java language's perspective. It does lead to a slightly increased size of the generated code, but this is necessary for correctness. NOTE: there can be other patterns that generate platform-dependent code on the driver which is invalid on the executors. e.g. if the endianness is different between the driver and the executors, and if some generated code makes strong assumption about endianness, it would also be problematic. ## How was this patch tested? Added a new test suite `WholeStageCodegenSparkSubmitSuite`. This test suite needs to set the driver's extraJavaOptions to force the driver and executor use different Java object layouts, so it's run as an actual SparkSubmit job. Authored-by: Kris Mok <kris.mokdatabricks.com> Closes #24032 from gatorsmile/testFailure. Lead-authored-by: Kris Mok <kris.mok@databricks.com> Co-authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 10 March 2019, 06:00:36 UTC
53590f2 [SPARK-27111][SS] Fix a race that a continuous query may fail with InterruptedException Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally. To handle this case, this PR has the following changes: - Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread). - Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly. I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up. Jenkins Closes #24034 from zsxwing/SPARK-27111. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 6e1c0827ece1cdc615196e60cb11c76b917b8eeb) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 09 March 2019, 22:33:34 UTC
c1b6fe4 [SPARK-27080][SQL] bug fix: mergeWithMetastoreSchema with uniform lower case comparison When reading parquet file with merging metastore schema and file schema, we should compare field names using uniform case. In current implementation, lowercase is used but one omission. And this patch fix it. Unit test Closes #24001 from codeborui/mergeSchemaBugFix. Authored-by: CodeGod <> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a29df5fa02111f57965be2ab5e208f5c815265fe) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 March 2019, 13:30:55 UTC
0297ff5 Preparing development version 2.4.2-SNAPSHOT 08 March 2019, 20:38:02 UTC
e87fe15 Preparing Spark release v2.4.1-rc7 08 March 2019, 20:37:43 UTC
216eeec [SPARK-26604][CORE][BACKPORT-2.4] Clean up channel registration for StreamManager ## What changes were proposed in this pull request? This is mostly a clean backport of https://github.com/apache/spark/pull/23521 to branch-2.4 ## How was this patch tested? I've tested this with a hack in `TransportRequestHandler` to force `ChunkFetchRequest` to get dropped. Then making a number of `ExternalShuffleClient.fetchChunk` requests (which `OpenBlocks` then `ChunkFetchRequest`) and closing out of my test harness. A heap dump later reveals that the `StreamState` references are unreachable. I haven't run this through the unit test suite, but doing that now. Wanted to get this up as I think folks are waiting for it for 2.4.1 Closes #24013 from abellina/SPARK-26604_cherry_pick_2_4. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Alessandro Bellina <abellina@yahoo-inc.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 08 March 2019, 03:48:20 UTC
f7ad4ff [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats ## What changes were proposed in this pull request? `CodeGenerator.updateAndGetCompilationStats` throws an unsupported exception for empty code size statistics. This pr added code to check if it is empty or not. ## How was this patch tested? Pass Jenkins. Closes #23947 from maropu/SPARK-21871-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 07 March 2019, 08:38:48 UTC
9702915 [SPARK-27078][SQL] Fix NoSuchFieldError when read Hive materialized views ## What changes were proposed in this pull request? This pr fix `NoSuchFieldError` when reading Hive materialized views from Hive 2.3.4. How to reproduce: Hive side: ```sql CREATE TABLE materialized_view_tbl (key INT); CREATE MATERIALIZED VIEW view_1 DISABLE REWRITE AS SELECT * FROM materialized_view_tbl; ``` Spark side: ```java bin/spark-sql --conf spark.sql.hive.metastore.version=2.3.4 --conf spark.sql.hive.metastore.jars=maven spark-sql> select * from view_1; 19/03/05 19:55:37 ERROR SparkSQLDriver: Failed in [select * from view_1] java.lang.NoSuchFieldError: INDEX_TABLE at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$3(HiveClientImpl.scala:438) at scala.Option.map(Option.scala:163) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$1(HiveClientImpl.scala:370) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:277) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:215) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:214) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:260) at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:368) ``` ## How was this patch tested? unit tests Closes #23984 from wangyum/SPARK-24360. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 32848eecc55946ad91e62e231d2e310a0270a63d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 March 2019, 00:58:03 UTC
35381dd [SPARK-27019][SQL][WEBUI] onJobStart happens after onExecutionEnd shouldn't overwrite kvstore ## What changes were proposed in this pull request? Currently, when the event reordering happens, especially onJobStart event come after onExecutionEnd event, SQL page in the UI displays weirdly.(for eg:test mentioned in JIRA and also this issue randomly occurs when the TPCDS query fails due to broadcast timeout etc.) The reason is that, In the SQLAppstatusListener, we remove the liveExecutions entry once the execution ends. So, if a jobStart event come after that, then we create a new liveExecution entry corresponding to the execId. Eventually this will overwrite the kvstore and UI displays confusing entries. ## How was this patch tested? Added UT, Also manually tested with the eventLog, provided in the jira, of the failed query. Before fix: ![screenshot from 2019-03-03 03-05-52](https://user-images.githubusercontent.com/23054875/53687929-53e2b800-3d61-11e9-9dca-620fa41e605c.png) After fix: ![screenshot from 2019-03-03 02-40-18](https://user-images.githubusercontent.com/23054875/53687928-4f1e0400-3d61-11e9-86aa-584646ac68f9.png) Closes #23939 from shahidki31/SPARK-27019. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 62fd133f744ab2d1aa3c409165914b5940e4d328) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 06 March 2019, 22:02:45 UTC
7df5aa6 [SPARK-27065][CORE] avoid more than one active task set managers for a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. https://github.com/apache/spark/pull/17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. https://github.com/apache/spark/pull/21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit cb20fbc43e7f54af1ed30b9eb6d76ca50b4eb750) Signed-off-by: Imran Rashid <irashid@cloudera.com> 06 March 2019, 18:01:07 UTC
db86ccb [SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the finished partitions ## What changes were proposed in this pull request? This is an optional solution for #22806 . #21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why #22806 hit the issue. This pr extends #21131 's behavior by adding `stageIdToFinishedPartitions` into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into `stageIdToFinishedPartitions ` and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes #23871 from Ngone51/dev-23433-25250. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit e5c61436a5720f13eb6d530ebf80635522bd64c6) Signed-off-by: Imran Rashid <irashid@cloudera.com> 06 March 2019, 17:53:39 UTC
5ec4563 [SPARK-24669][SQL] Invalidate tables in case of DROP DATABASE CASCADE ## What changes were proposed in this pull request? Before dropping database refresh the tables of that database, so as to refresh all cached entries associated with those tables. We follow the same when dropping a table. UT is added Closes #23905 from Udbhav30/SPARK-24669. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9bddf7180e9e76e1cabc580eee23962dd66f84c3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 March 2019, 17:07:57 UTC
b583bfe [SPARK-26932][DOC] Add a warning for Hive 2.1.1 ORC reader issue Hive 2.1.1 cannot read ORC table created by Spark 2.4.0 in default, and I add the information into sql-migration-guide-upgrade.md. for details to see: [SPARK-26932](https://issues.apache.org/jira/browse/SPARK-26932) doc build Closes #23944 from haiboself/SPARK-26932. Authored-by: Bo Hai <haibo-self@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c27caead43423d1f994f42502496d57ea8389dc0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 March 2019, 20:08:07 UTC
498fb70 [MINOR][DOCS] Clarify that Spark apps should mark Spark as a 'provided' dependency, not package it ## What changes were proposed in this pull request? Spark apps do not need to package Spark. In fact it can cause problems in some cases. Our examples should show depending on Spark as a 'provided' dependency. Packaging Spark makes the app much bigger by tens of megabytes. It can also bring in conflicting dependencies that wouldn't otherwise be a problem. https://issues.apache.org/jira/browse/SPARK-26146 was what reminded me of this. ## How was this patch tested? Doc build Closes #23938 from srowen/Provided. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 39092236819da097e9c8a3b2fa975105f08ae5b9) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 March 2019, 14:27:03 UTC
ae462b1 [SPARK-27046][DSTREAMS] Remove SPARK-19185 related references from documentation ## What changes were proposed in this pull request? SPARK-19185 is resolved so the reference can be removed from the documentation. ## How was this patch tested? cd docs/ SKIP_API=1 jekyll build Manual webpage check. Closes #23959 from gaborgsomogyi/SPARK-27046. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 5252d8b9872cbf200651b0bb7b8c6edd649ebb58) Signed-off-by: Sean Owen <sean.owen@databricks.com> 04 March 2019, 15:32:19 UTC
3336a21 [SPARK-26990][SQL][BACKPORT-2.4] FileIndex: use user specified field names if possible ## What changes were proposed in this pull request? Back-port of #23894 to branch-2.4. WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root |-- ID: long (nullable = true) |-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. Closes #23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wangdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ## How was this patch tested? Unit test Closes #23909 from bersprockets/backport-SPARK-26990. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 February 2019, 01:37:07 UTC
b031f4a [MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org" ## What changes were proposed in this pull request? Below build failed with Java checkstyle test, but instead of violation it shows FileNotFound on dtd file. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102751/ Looks like the link of dtd file is dead `http://www.puppycrawl.com/dtds/configuration_1_3.dtd`. This patch updates the dtd link to "https://checkstyle.org/dtds/" given checkstyle repository also updated the URL path. https://github.com/checkstyle/checkstyle/issues/5601 ## How was this patch tested? Checked the new links. Closes #23887 from HeartSaVioR/java-checkstyle-dtd-change-url. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit c5de804093540509929f6de211dbbe644b33e6db) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 February 2019, 19:26:17 UTC
073c47b Preparing development version 2.4.2-SNAPSHOT 22 February 2019, 22:54:37 UTC
eb2af24 Preparing Spark release v2.4.1-rc5 22 February 2019, 22:54:15 UTC
ef67be3 [SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values ## What changes were proposed in this pull request? Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations. ```scala scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array res1: Array[Byte] = Array(127, -64, 0, 0) scala> val x = java.lang.Float.intBitsToFloat(-6966608) x: Float = NaN scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array res2: Array[Byte] = Array(-1, -107, -78, -80) ``` Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/ ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23851 from dongjoon-hyun/SPARK-26950. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ffef3d40741b0be321421aa52a6e17a26d89f541) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 February 2019, 04:27:17 UTC
b403612 Revert "[R][BACKPORT-2.3] update package description" This reverts commit 8d68d54f2e2cbbe55a4bb87c2216cff896add517. 22 February 2019, 02:14:56 UTC
8d68d54 [R][BACKPORT-2.3] update package description doesn't port cleanly to 2.3. we need this in branch-2.4 and branch-2.3 Closes #23861 from felixcheung/2.3rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 36db45d5b90ddc3ce54febff2ed41cd29c0a8a04) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2019, 02:13:38 UTC
3282544 Preparing development version 2.4.2-SNAPSHOT 21 February 2019, 23:02:17 UTC
79c1f7e Preparing Spark release v2.4.1-rc4 21 February 2019, 23:01:58 UTC
d857630 [R][BACKPORT-2.4] update package description #23852 doesn't port cleanly to 2.4. we need this in branch-2.4 and branch-2.3 Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #23860 from felixcheung/2.4rdesc. 21 February 2019, 16:42:15 UTC
0926f49 Preparing development version 2.4.2-SNAPSHOT 21 February 2019, 00:46:07 UTC
061185b Preparing Spark release v2.4.1-rc3 21 February 2019, 00:45:49 UTC
274142b [SPARK-26859][SQL] Fix field writer index bug in non-vectorized ORC deserializer ## What changes were proposed in this pull request? This happens in a schema evolution use case only when a user specifies the schema manually and use non-vectorized ORC deserializer code path. There is a bug in `OrcDeserializer.scala` that results in `null`s being set at the wrong column position, and for state from previous records to remain uncleared in next records. There are more details for when exactly the bug gets triggered and what the outcome is in the [JIRA issue](https://jira.apache.org/jira/browse/SPARK-26859). The high-level summary is that this bug results in severe data correctness issues, but fortunately the set of conditions to expose the bug are complicated and make the surface area somewhat small. This change fixes the problem and adds a respective test. ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23766 from IvanVergiliev/fix-orc-deserializer. Lead-authored-by: Ivan Vergiliev <ivan.vergiliev@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 096552ae4d6fcef5e20c54384a2687db41ba2fa1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 February 2019, 13:53:40 UTC
4c60056 Preparing development version 2.4.2-SNAPSHOT 19 February 2019, 21:54:45 UTC
229ad52 Preparing Spark release v2.4.1-rc2 19 February 2019, 21:54:26 UTC
383b662 [MINOR][DOCS] Fix the update rule in StreamingKMeansModel documentation ## What changes were proposed in this pull request? The formatting for the update rule (in the documentation) now appears as ![image](https://user-images.githubusercontent.com/14948437/52933807-5a0c7980-3309-11e9-8573-642a73e77c26.png) instead of ![image](https://user-images.githubusercontent.com/14948437/52933897-a8ba1380-3309-11e9-8e16-e47c27b4a044.png) Closes #23819 from joelgenter/patch-1. Authored-by: joelgenter <joelgenter@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 885aa553c5e8f478b370f8a733102b67f6cd2d99) Signed-off-by: Sean Owen <sean.owen@databricks.com> 19 February 2019, 14:41:29 UTC
633de74 [SPARK-26740][SQL][BRANCH-2.4] Read timestamp/date column stats written by Spark 3.0 ## What changes were proposed in this pull request? - Backport of #23662 to `branch-2.4` - Added `Timestamp`/`DateFormatter` - Set version of column stats to `1` to keep backward compatibility with previous versions ## How was this patch tested? The changes were tested by `StatisticsCollectionSuite` and by `StatisticsSuite`. Closes #23809 from MaxGekk/column-stats-time-date-2.4. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 February 2019, 03:46:42 UTC
094cabc [SPARK-26897][SQL][TEST][FOLLOW-UP] Remove workaround for 2.2.0 and 2.1.x in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? This pr just removed workaround for 2.2.0 and 2.1.x in HiveExternalCatalogVersionsSuite. ## How was this patch tested? Pass the Jenkins. Closes #23817 from maropu/SPARK-26607-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e2b8cc65cd579374ddbd70b93c9fcefe9b8873d9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 February 2019, 03:25:16 UTC
back to top