https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
8c6f815 Preparing Spark release v2.3.4-rc1 25 August 2019, 14:38:17 UTC
adb5255 [SPARK-26895][CORE][2.3] prepareSubmitEnvironment should be called within doAs for proxy users This also includes the follow-up fix by Hyukjin Kwon: [SPARK-26895][CORE][FOLLOW-UP] Uninitializing log after `prepareSubmitEnvironment` in SparkSubmit `prepareSubmitEnvironment` performs globbing that will fail in the case where a proxy user (`--proxy-user`) doesn't have permission to the file. This is a bug also with 2.3, so we should backport, as currently you can't launch an application that for instance is passing a file under `--archives`, and that file is owned by the target user. The solution is to call `prepareSubmitEnvironment` within a doAs context if proxying. Manual tests running with `--proxy-user` and `--archives`, before and after, showing that the globbing is successful when the resource is owned by the target user. I've looked at writing unit tests, but I am not sure I can do that cleanly (perhaps with a custom FileSystem). Open to ideas. Closes #25517 from vanzin/SPARK-26895-2.3. Lead-authored-by: Alessandro Bellina <abellina@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 August 2019, 07:39:36 UTC
5ce02fc [SPARK-28780][ML][2.3] deprecate LinearSVCModel.setWeightCol ### What changes were proposed in this pull request? deprecate `LinearSVCModel.setWeightCol`, and make it a no-op ### Why are the changes needed? `LinearSVCModel` should not provide this setter, moreover, this method is wrongly defined. ### Does this PR introduce any user-facing change? no, this method is only deprecated ### How was this patch tested? existing suites Closes #25546 from zhengruifeng/svc_model_deprecate_setWeight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 August 2019, 07:20:00 UTC
c04f796 [SPARK-28844][SQL] Fix typo in SQLConf FILE_COMRESSION_FACTOR Fix minor typo in SQLConf. `FILE_COMRESSION_FACTOR` -> `FILE_COMPRESSION_FACTOR` Make conf more understandable. No. (`spark.sql.sources.fileCompressionFactor` is unchanged.) Pass the Jenkins with the existing tests. Closes #25538 from triplesheep/TYPO-FIX. Authored-by: triplesheep <triplesheep0419@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 48578a41b50308185c7eefd6e562bd0f6575a921) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 August 2019, 07:10:13 UTC
e6fd838 [SPARK-28699][CORE][2.3] Fix a corner case for aborting indeterminate stage ### What changes were proposed in this pull request? Change the logic of collecting the indeterminate stage, we should look at stages from mapStage, not failedStage during handle FetchFailed. ### Why are the changes needed? In the fetch failed error handle logic, the original logic of collecting indeterminate stage from the fetch failed stage. And in the scenario of the fetch failed happened in the first task of this stage, this logic will cause the indeterminate stage to resubmit partially. Eventually, we are capable of getting correctness bug. ### Does this PR introduce any user-facing change? It makes the corner case of indeterminate stage abort as expected. ### How was this patch tested? New UT in DAGSchedulerSuite. Run below integrated test with `local-cluster[5, 2, 5120]`, and set `spark.sql.execution.sortBeforeRepartition`=false, it will abort the indeterminate stage as expected: ``` import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)} // kill an executor in the stage that performs repartition(239) val df = res.repartition(113).map{ x => (x._1 + 1, x._2)}.repartition(239).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) { throw new Exception("pkill -f -n java".!!) } x } val r2 = df.distinct.count() ``` Closes #25508 from xuanyuanking/spark-28699-backport-2.3. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 August 2019, 06:35:32 UTC
79cda23 [SPARK-28699][SQL] Disable using radix sort for ShuffleExchangeExec in repartition case Disable using radix sort in ShuffleExchangeExec when we do repartition. In #20393, we fixed the indeterministic result in the shuffle repartition case by performing a local sort before repartitioning. But for the newly added sort operation, we use radix sort which is wrong because binary data can't be compared by only the prefix. This makes the sort unstable and fails to solve the indeterminate shuffle output problem. Fix the correctness bug caused by repartition after a shuffle. Yes, user will get the right result in the case of repartition stage rerun. Test with `local-cluster[5, 2, 5120]`, use the integrated test below, it can return a right answer 100000000. ``` import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)} // kill an executor in the stage that performs repartition(239) val df = res.repartition(113).map{ x => (x._1 + 1, x._2)}.repartition(239).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) { throw new Exception("pkill -f -n java".!!) } x } val r2 = df.distinct.count() ``` Closes #25491 from xuanyuanking/SPARK-28699-fix. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2d9cc42aa83beb5952bb44d3cd0327d4432d385e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 August 2019, 18:01:17 UTC
19f7b74 [SPARK-28777][PYTHON][DOCS] Fix format_string doc string with the correct parameters ### What changes were proposed in this pull request? The parameters doc string of the function format_string was changed from _col_, _d_ to _format_, _cols_ which is what the actual function declaration states ### Why are the changes needed? The parameters stated by the documentation was inaccurate ### Does this PR introduce any user-facing change? Yes. **BEFORE** ![before](https://user-images.githubusercontent.com/9700541/63310013-e21a0e80-c2ad-11e9-806b-1d272c5cde12.png) **AFTER** ![after](https://user-images.githubusercontent.com/9700541/63315812-6b870c00-c2c1-11e9-8165-82782628cd1a.png) ### How was this patch tested? N/A: documentation only <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Closes #25506 from darrentirto/SPARK-28777. Authored-by: darrentirto <darrentirto@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a787bc28840eafae53a08137a53ea56500bfd675) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 August 2019, 03:45:27 UTC
8b1067e [SPARK-28775][CORE][TESTS] Skip date 8633 in Kwajalein due to changes in tzdata2018i that only some JDK 8s use ### What changes were proposed in this pull request? Some newer JDKs use the tzdata2018i database, which changes how certain (obscure) historical dates and timezones are handled. As previously, we can pretty much safely ignore these in tests, as the value may vary by JDK. ### Why are the changes needed? Test otherwise fails using, for example, JDK 1.8.0_222. https://bugs.openjdk.java.net/browse/JDK-8215982 has a full list of JDKs which has this. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #25504 from srowen/SPARK-28775. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3b4e345fa1afa0d4004988f8800b63150c305fd4) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 August 2019, 00:55:12 UTC
d97d0ac Revert "[SPARK-25474][SQL][2.3] Support `spark.sql.statistics.fallBackToHdfs` in data source tables" This reverts commit 416aba48b40d931da97e549fcffa86c47790e5f2. 18 August 2019, 14:45:24 UTC
8731194 [SPARK-28766][R][DOC] Fix CRAN incoming feasibility warning on invalid URL ### What changes were proposed in this pull request? This updates an URL in R doc to fix `Had CRAN check errors; see logs`. ### Why are the changes needed? Currently, this invalid link causes a warning during CRAN incoming feasibility. We had better fix this before submitting `3.0.0/2.4.4/2.3.4`. **BEFORE** ``` * checking CRAN incoming feasibility ... NOTE Maintainer: ‘Shivaram Venkataraman <shivaramcs.berkeley.edu>’ Found the following (possibly) invalid URLs: URL: https://wiki.apache.org/hadoop/HCFS (moved to https://cwiki.apache.org/confluence/display/hadoop/HCFS) From: man/spark.addFile.Rd Status: 404 Message: Not Found ``` **AFTER** ``` * checking CRAN incoming feasibility ... Note_to_CRAN_maintainers Maintainer: ‘Shivaram Venkataraman <shivaramcs.berkeley.edu>’ ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Check the warning message during R testing. ``` $ R/install-dev.sh $ R/run-tests.sh ``` Closes #25483 from dongjoon-hyun/SPARK-28766. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5756a47a9fafca2d0b31de2b2374429f73b6e5e2) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 August 2019, 18:12:12 UTC
e6a008f [MINOR][DOC] Use `Java 8` instead of `Java 8+` as a running environment After Apache Spark 3.0.0 supports JDK11 officially, people will try JDK11 on old Spark releases (especially 2.4.4/2.3.4) in the same way because our document says `Java 8+`. We had better avoid that misleading situation. This PR aims to remove `+` from `Java 8+` in the documentation (master/2.4/2.3). Especially, 2.4.4 release and 2.3.4 release (cc kiszk ) On master branch, we will add JDK11 after [SPARK-24417.](https://issues.apache.org/jira/browse/SPARK-24417) This is a documentation only change. <img width="923" alt="java8" src="https://user-images.githubusercontent.com/9700541/63116589-e1504800-bf4e-11e9-8904-b160ec7a42c0.png"> Closes #25466 from dongjoon-hyun/SPARK-DOC-JDK8. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 123eb58d61ad7c7ebe540f1634088696a3cc85bc) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 August 2019, 18:24:54 UTC
31aed37 [SPARK-26859][SQL][BACKPORT-2.3] Fix field writer index bug in non-vectorized ORC deserializer ## What changes were proposed in this pull request? Back port of https://github.com/apache/spark/pull/23766 to branch-2.3 ## How was this patch tested? Added UT Closes #25384 from manuzhang/backport-pr-23766. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 August 2019, 21:55:18 UTC
e686178 [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 This PR picks up https://github.com/apache/spark/pull/25315 back after removing `Popen.wait` usage which exists in Python 3 only. I saw the last test results wrongly and thought it was passed. Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` **Before** Fail on test "test_termination_sigterm". And we can see daemon process do not exit. **After** Test passed Closes #25343 from HyukjinKwon/SPARK-28582. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b3394db1930b3c9f55438cb27bb2c584bf041f8e) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 August 2019, 01:35:59 UTC
d3a73df Revert "[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7" This reverts commit 21ef0fdf7244a169ac0c3e701cb21c35f1038a5d. 02 August 2019, 17:07:13 UTC
21ef0fd [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` **Before** Fail on test "test_termination_sigterm". And we can see daemon process do not exit. **After** Test passed Closes #25315 from WeichenXu123/fix_py37_daemon. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit fbeee0c5bcea32346b2279c5b67044f12e5faf7d) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 August 2019, 13:14:15 UTC
5a1e54d [SPARK-24352][CORE][TESTS] De-flake StandaloneDynamicAllocationSuite blacklist test The issue is that the test tried to stop an existing scheduler and replace it with a new one set up for the test. That can cause issues because both were sharing the same RpcEnv underneath, and unregistering RpcEndpoints is actually asynchronous (see comment in Dispatcher.unregisterRpcEndpoint). So that could lead to races where the new scheduler tried to register before the old one was fully unregistered. The updated test avoids the issue by using a separate RpcEnv / scheduler instance altogether, and also avoids a misleading NPE in the test logs. Closes #25318 from vanzin/SPARK-24352. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b3ffd8be14779cbb824d14b409f0a6eab93444ba) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 August 2019, 00:48:47 UTC
78d1bb1 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress | true | Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- | -- | -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 31 July 2019, 03:18:29 UTC
998ac04 [SPARK-28156][SQL][BACKPORT-2.3] Self-join should not miss cached view Back-port of #24960 to branch-2.3. The issue is when self-join a cached view, only one side of join uses cached relation. The cause is in `ResolveReferences` we do deduplicate for a view to have new output attributes. Then in `AliasViewChild`, the rule adds extra project under a view. So it breaks cache matching. The fix is when dedup, we only dedup a view which has output different to its child plan. Otherwise, we dedup on the view's child plan. ```scala val df = Seq.tabulate(5) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", "c", "d") df.write.mode("overwrite").format("orc").saveAsTable("table1") sql("drop view if exists table1_vw") sql("create view table1_vw as select * from table1") val cachedView = sql("select a, b, c, d from table1_vw") cachedView.createOrReplaceTempView("cachedview") cachedView.persist() val queryDf = sql( s"""select leftside.a, leftside.b |from cachedview leftside |join cachedview rightside |on leftside.a = rightside.a """.stripMargin) ``` Query plan before this PR: ```scala == Physical Plan == *(2) Project [a#12664, b#12665] +- *(2) BroadcastHashJoin [a#12664], [a#12660], Inner, BuildRight :- *(2) Filter isnotnull(a#12664) : +- *(2) InMemoryTableScan [a#12664, b#12665], [isnotnull(a#12664)] : +- InMemoryRelation [a#12664, b#12665, c#12666, d#12667], StorageLevel(disk, memory, deserialized, 1 replicas) : +- *(1) FileScan orc default.table1[a#12660,b#12661,c#12662,d#12663] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryF ileIndex[file:/Users/viirya/repos/spark-1/sql/core/spark-warehouse/org.apache.spark.sql...., PartitionFilters: [], PushedFilters: [], ReadSchema: struc t<a:int,b:int,c:int,d:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) +- *(1) Project [a#12660] +- *(1) Filter isnotnull(a#12660) +- *(1) FileScan orc default.table1[a#12660] Batched: true, DataFilters: [isnotnull(a#12660)], Format: ORC, Location: InMemoryFileIndex[fil e:/Users/viirya/repos/spark-1/sql/core/spark-warehouse/org.apache.spark.sql...., PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struc t<a:int> ``` Query plan after this PR: ```scala == Physical Plan == *(2) Project [a#12664, b#12665] +- *(2) BroadcastHashJoin [a#12664], [a#12692], Inner, BuildRight :- *(2) Filter isnotnull(a#12664) : +- *(2) InMemoryTableScan [a#12664, b#12665], [isnotnull(a#12664)] : +- InMemoryRelation [a#12664, b#12665, c#12666, d#12667], StorageLevel(disk, memory, deserialized, 1 replicas) : +- *(1) FileScan orc default.table1[a#12660,b#12661,c#12662,d#12663] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/Users/viirya/repos/spark-1/sql/core/spark-warehouse/org.apache.spark.sql...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:int,c:int,d:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- *(1) Filter isnotnull(a#12692) +- *(1) InMemoryTableScan [a#12692], [isnotnull(a#12692)] +- InMemoryRelation [a#12692, b#12693, c#12694, d#12695], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) FileScan orc default.table1[a#12660,b#12661,c#12662,d#12663] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex[file:/Users/viirya/repos/spark-1/sql/core/spark-warehouse/org.apache.spark.sql...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:int,c:int,d:int> ``` Added test. Closes #25293 from bersprockets/selfjoin_23_noextra. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 July 2019, 07:40:51 UTC
416aba4 [SPARK-25474][SQL][2.3] Support `spark.sql.statistics.fallBackToHdfs` in data source tables ## What changes were proposed in this pull request? Backport the commit https://github.com/apache/spark/commit/485ae6d1818e8756a86da38d6aefc8f1dbde49c2 into SPARK-2.3 branch. ## How was this patch tested? Added tests Closes #25285 from shahidki31/branch-2.3. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 July 2019, 16:23:30 UTC
9fbacda [SPARK-28545][SQL] Add the hash map size to the directional log of ObjectAggregationIterator ## What changes were proposed in this pull request? `ObjectAggregationIterator` shows a directional info message to increase `spark.sql.objectHashAggregate.sortBased.fallbackThreshold` when the size of the in-memory hash map grows too large and it falls back to sort-based aggregation. However, we don't know how much we need to increase. This PR adds the size of the current in-memory hash map size to the log message. **BEFORE** ``` 15:21:41.669 Executor task launch worker for task 0 INFO ObjectAggregationIterator: Aggregation hash map reaches threshold capacity (2 entries), ... ``` **AFTER** ``` 15:20:05.742 Executor task launch worker for task 0 INFO ObjectAggregationIterator: Aggregation hash map size 2 reaches threshold capacity (2 entries), ... ``` ## How was this patch tested? Manual. For example, run `ObjectHashAggregateSuite.scala`'s `typed_count fallback to sort-based aggregation` and search the above message in `target/unit-tests.log`. Closes #25276 from dongjoon-hyun/SPARK-28545. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d943ee0a881540aa356cdce533b693baaf7c644f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 July 2019, 01:56:06 UTC
abff292 [SPARK-28535][CORE][TEST] Slow down tasks to de-flake JobCancellationSuite This test tries to detect correct behavior in racy code, where the event thread is racing with the executor thread that's trying to kill the running task. If the event that signals the stage end arrives first, any delay in the delivery of the message to kill the task causes the code to rapidly process elements, and may cause the test to assert. Adding a 10ms delay in LocalSchedulerBackend before the task kill makes the test run through ~1000 elements. A longer delay can easily cause the 10000 elements to be processed. Instead, by adding a small delay (10ms) in the test code that processes elements, there's a much lower probability that the kill event will not arrive before the end; that leaves a window of 100s for the event to be delivered to the executor. And because each element only sleeps for 10ms, the test is not really slowed down at all. Closes #25270 from vanzin/SPARK-28535. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7f84104b3981dc69238730e0bed7c8c5bd113d76) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 July 2019, 18:07:09 UTC
e466662 [SPARK-28430][UI] Fix stage table rendering when some tasks' metrics are missing ## What changes were proposed in this pull request? The Spark UI's stages table misrenders the input/output metrics columns when some tasks are missing input metrics. See the screenshot below for an example of the problem: ![image](https://user-images.githubusercontent.com/50748/61420042-a3abc100-a8b5-11e9-8a92-7986563ee712.png) This is because those columns' are defined as ```scala {if (hasInput(stage)) { metricInfo(task) { m => ... <td>....</td> } } ``` where `metricInfo` renders the node returned by the closure in case metrics are defined or returns `Nil` in case metrics are not defined. If metrics are undefined then we'll fail to render the empty `<td></td>` tag, causing columns to become misaligned as shown in the screenshot. To fix this, this patch changes this to ```scala {if (hasInput(stage)) { <td>{ metricInfo(task) { m => ... Unparsed(...) } }</td> } ``` which is an idiom that's already in use for the shuffle read / write columns. ## How was this patch tested? It isn't. I'm arguing for correctness because the modifications are consistent with rendering methods that work correctly for other columns. Closes #25183 from JoshRosen/joshrosen/fix-task-table-with-partial-io-metrics. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3776fbdfdeac07d191f231b29cf906cabdc6de3f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 July 2019, 20:16:11 UTC
10a8ee9 [SPARK-28418][PYTHON][SQL] Wait for event process in 'test_query_execution_listener_on_collect' It fixes a flaky test: ``` ERROR [0.164s]: test_query_execution_listener_on_collect (pyspark.sql.tests.test_dataframe.QueryExecutionListenerTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 758, in test_query_execution_listener_on_collect "The callback from the query execution listener should be called after 'collect'") AssertionError: The callback from the query execution listener should be called after 'collect' ``` Seems it can be failed because the event was somehow delayed but checked first. Manually. Closes #25177 from HyukjinKwon/SPARK-28418. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 66179fa8426324e11819e04af4cf3eabf9f2627f) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 17 July 2019, 09:48:53 UTC
ad807fe [SPARK-28404][SS] Fix negative timeout value in RateStreamContinuousPartitionReader ## What changes were proposed in this pull request? `System.currentTimeMillis` read two times in a loop in `RateStreamContinuousPartitionReader`. If the test machine is slow enough and it spends quite some time between the `while` condition check and the `Thread.sleep` then the timeout value is negative and throws `IllegalArgumentException`. In this PR I've fixed this issue. ## How was this patch tested? Existing unit tests. Closes #25162 from gaborgsomogyi/SPARK-28404. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8f7ccc5e9c8be723947947c4130a48781bf6e355) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 July 2019, 18:01:33 UTC
53c903c [SPARK-28361][SQL][TEST] Test equality of generated code with id in class name A code gen test in WholeStageCodeGenSuite was flaky because it used the codegen metrics class to test if the generated code for equivalent plans was identical under a particular flag. This patch switches the test to compare the generated code directly. N/A Closes #25131 from gatorsmile/WholeStageCodegenSuite. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 60b89cf8097ff583a29a6a19f1db4afa780f3109) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 July 2019, 23:23:29 UTC
8821a8c [SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile appender - size-based rolling compressed ## What changes were proposed in this pull request? `SizeBasedRollingPolicy.shouldRollover` returns false when the size is equal to `rolloverSizeBytes`. ```scala /** Should rollover if the next set of bytes is going to exceed the size limit */ def shouldRollover(bytesToBeWritten: Long): Boolean = { logDebug(s"$bytesToBeWritten + $bytesWrittenSinceRollover > $rolloverSizeBytes") bytesToBeWritten + bytesWrittenSinceRollover > rolloverSizeBytes } ``` - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/ ``` org.scalatest.exceptions.TestFailedException: 1000 was not less than 1000 ``` ## How was this patch tested? Pass the Jenkins with the updated test. Closes #25125 from dongjoon-hyun/SPARK-28357. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1c29212394adcbde2de4f4dfdc43a1cf32671ae1) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 12 July 2019, 09:40:47 UTC
81bf44d [SPARK-28015][SQL] Check stringToDate() consumes entire input for the yyyy and yyyy-[m]m formats Fix `stringToDate()` for the formats `yyyy` and `yyyy-[m]m` that assumes there are no additional chars after the last components `yyyy` and `[m]m`. In the PR, I propose to check that entire input was consumed for the formats. After the fix, the input `1999 08 01` will be invalid because it matches to the pattern `yyyy` but the strings contains additional chars ` 08 01`. Since Spark 1.6.3 ~ 2.4.3, the behavior is the same. ``` spark-sql> SELECT CAST('1999 08 01' AS DATE); 1999-01-01 ``` This PR makes it return NULL like Hive. ``` spark-sql> SELECT CAST('1999 08 01' AS DATE); NULL ``` Added new checks to `DateTimeUtilsSuite` for the `1999 08 01` and `1999 08` inputs. Closes #25097 from MaxGekk/spark-28015-invalid-date-format. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 17974e269d52a96932bc0fa8d95e95a618379b86) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 July 2019, 02:54:37 UTC
523818a [SPARK-28335][DSTREAMS][TEST] DirectKafkaStreamSuite wait for Kafka async commit `DirectKafkaStreamSuite.offset recovery from kafka` commits offsets to Kafka with `Consumer.commitAsync` API (and then reads it back). Since this API is asynchronous it may send notifications late(or not at all). The actual test makes the assumption if the data sent and collected then the offset must be committed as well. This is not true. In this PR I've made the following modifications: * Wait for async offset commit before context stopped * Added commit succeed log to see whether it arrived at all * Using `ConcurrentHashMap` for committed offsets because 2 threads are using the variable (`JobGenerator` and `ScalaTest...`) Existing unit test in a loop + jenkins runs. Closes #25100 from gaborgsomogyi/SPARK-28335. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 579edf472822802285b5cd7d07f63503015eff5a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 July 2019, 16:38:19 UTC
74caacf [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows ## What changes were proposed in this pull request? When using SparkLauncher to submit applications **concurrently** with multiple threads under **Windows**, some apps would show that "The process cannot access the file because it is being used by another process" and remains in LOST state at the end. The issue can be reproduced by this [demo](https://issues.apache.org/jira/secure/attachment/12973920/Main.scala). After digging into the code, I find that, Windows cmd `%RANDOM%` would return the same number if we call it instantly(e.g. < 500ms) after last call. As a result, SparkLauncher would get same output file(spark-class-launcher-output-%RANDOM%.txt) for apps. Then, the following app would hit the issue when it tries to write the same file which has already been opened for writing by another app. We should make sure to generate unique output file for SparkLauncher on Windows to avoid this issue. ## How was this patch tested? Tested manually on Windows. Closes #25076 from Ngone51/SPARK-28302. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 925f620570a022ff8229bfde076e7dde6bf242df) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 09 July 2019, 06:50:57 UTC
8d7a724 [SPARK-28308][CORE] CalendarInterval sub-second part should be padded before parsing The sub-second part of the interval should be padded before parsing. Currently, Spark gives a correct value only when there is 9 digits below `.`. ``` spark-sql> select interval '0 0:0:0.123456789' day to second; interval 123 milliseconds 456 microseconds spark-sql> select interval '0 0:0:0.12345678' day to second; interval 12 milliseconds 345 microseconds spark-sql> select interval '0 0:0:0.1234' day to second; interval 1 microseconds ``` Pass the Jenkins with the fixed test cases. Closes #25079 from dongjoon-hyun/SPARK-28308. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a5ff9221fc23cc758db228493d451f542591eff7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 072e0eb8881be7df7d5c81efa472b448b9d67e95) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 July 2019, 02:57:15 UTC
d1875f6 [SPARK-28261][CORE] Fix client reuse test There is the following code in [TransportClientFactory#createClient](https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java#L150) ``` int clientIndex = rand.nextInt(numConnectionsPerPeer); TransportClient cachedClient = clientPool.clients[clientIndex]; ``` which choose a client from its pool randomly. If we are unlucky we might not get the max number of connections out, but less than that. To prove that I've tried out the following test: ```java Test public void testRandom() { Random rand = new Random(); Set<Integer> clients = Collections.synchronizedSet(new HashSet<>()); long iterCounter = 0; while (true) { iterCounter++; int maxConnections = 4; clients.clear(); for (int i = 0; i < maxConnections * 10; i++) { int clientIndex = rand.nextInt(maxConnections); clients.add(clientIndex); } if (clients.size() != maxConnections) { System.err.println("Unexpected clients size (iterCounter=" + iterCounter + "): " + clients.size() + ", maxConnections: " + maxConnections); } if (iterCounter % 100000 == 0) { System.out.println("IterCounter: " + iterCounter); } } } ``` Result: ``` Unexpected clients size (iterCounter=22388): 3, maxConnections: 4 Unexpected clients size (iterCounter=36244): 3, maxConnections: 4 Unexpected clients size (iterCounter=85798): 3, maxConnections: 4 IterCounter: 100000 Unexpected clients size (iterCounter=97108): 3, maxConnections: 4 Unexpected clients size (iterCounter=119121): 3, maxConnections: 4 Unexpected clients size (iterCounter=129948): 3, maxConnections: 4 Unexpected clients size (iterCounter=173736): 3, maxConnections: 4 Unexpected clients size (iterCounter=178138): 3, maxConnections: 4 Unexpected clients size (iterCounter=195108): 3, maxConnections: 4 IterCounter: 200000 Unexpected clients size (iterCounter=209006): 3, maxConnections: 4 Unexpected clients size (iterCounter=217105): 3, maxConnections: 4 Unexpected clients size (iterCounter=222456): 3, maxConnections: 4 Unexpected clients size (iterCounter=226899): 3, maxConnections: 4 Unexpected clients size (iterCounter=229101): 3, maxConnections: 4 Unexpected clients size (iterCounter=253549): 3, maxConnections: 4 Unexpected clients size (iterCounter=277550): 3, maxConnections: 4 Unexpected clients size (iterCounter=289637): 3, maxConnections: 4 ... ``` In this PR I've adapted the test code not to have this flakyness. Additional (not committed test) + existing unit tests in a loop. Closes #25075 from gaborgsomogyi/SPARK-28261. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e11a55827e7475aab77e8a4ea0baed7c14059908) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 July 2019, 18:14:58 UTC
a3d7c4f [MINOR] Add requestHeaderSize debug log ## What changes were proposed in this pull request? `requestHeaderSize` is added in https://github.com/apache/spark/pull/23090 and applies to Spark + History server UI as well. Without debug log it's hard to find out on which side what configuration is used. In this PR I've added a log message which prints out the value. ## How was this patch tested? Manually checked log files. Closes #25045 from gaborgsomogyi/SPARK-26118. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 0b6c2c259c1ed109a824b678c9ccbd9fd767d2fe) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 July 2019, 16:35:01 UTC
4fdbd87 [SPARK-28160][CORE] Fix a bug that callback function may hang when unchecked exception missed This is very like #23590 . `ByteBuffer.allocate` may throw `OutOfMemoryError` when the response is large but no enough memory is available. However, when this happens, `TransportClient.sendRpcSync` will just hang forever if the timeout set to unlimited. This PR catches `Throwable` and uses the error to complete `SettableFuture`. I tested in my IDE by setting the value of size to -1 to verify the result. Without this patch, it won't be finished until timeout (May hang forever if timeout set to MAX_INT), or the expected `IllegalArgumentException` will be caught. ```java Override public void onSuccess(ByteBuffer response) { try { int size = response.remaining(); ByteBuffer copy = ByteBuffer.allocate(size); // set size to -1 in runtime when debug copy.put(response); // flip "copy" to make it readable copy.flip(); result.set(copy); } catch (Throwable t) { result.setException(t); } } ``` Closes #24964 from LantaoJin/SPARK-28160. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: lajin <lajin@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 0e421000e0ea2c090b6fab0201a6046afceec132) Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 June 2019, 20:19:14 UTC
960375f [SPARK-28157][CORE][2.3] Make SHS clear KVStore `LogInfo`s for the blacklisted entries ## What changes were proposed in this pull request? At Spark 2.4.0/2.3.2/2.2.3, [SPARK-24948](https://issues.apache.org/jira/browse/SPARK-24948) delegated access permission checks to the file system, and maintains a blacklist for all event log files failed once at reading. The blacklisted log files are released back after `CLEAN_INTERVAL_S` seconds. However, the released files whose sizes don't changes are ignored forever due to `info.fileSize < entry.getLen()` condition (previously [here](https://github.com/apache/spark/commit/3c96937c7b1d7a010b630f4b98fd22dafc37808b#diff-a7befb99e7bd7e3ab5c46c2568aa5b3eR454) and now at [shouldReloadLog](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L571)) which returns `false` always when the size is the same with the existing value in `KVStore`. This is recovered only via SHS restart. This PR aims to remove the existing entry from `KVStore` when it goes to the blacklist. ## How was this patch tested? Pass the Jenkins with the updated test case. Closes #24977 from dongjoon-hyun/SPARK-28157-2.3. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 27 June 2019, 16:58:39 UTC
a054a00 [SPARK-28164] Fix usage description of `start-slave.sh` ## What changes were proposed in this pull request? updated the usage message in sbin/start-slave.sh. <masterURL> argument moved to first ## How was this patch tested? tested locally with Starting master starting slave with (./start-slave.sh spark://<IP>:<PORT> -c 1 and opening spark shell with ./spark-shell --master spark://<IP>:<PORT> Closes #24974 from shivusondur/jira28164. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit bd232b98b470a609472a4ea8df912f8fad06ba06) Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 June 2019, 17:43:26 UTC
3a0b0c1 Revert "[SPARK-28093][SPARK-28109][SQL][2.3] Fix TRIM/LTRIM/RTRIM function parameter order issue" ## What changes were proposed in this pull request? This reverts commit 1201a0a7. The parameter orders might be different in different vendors. We can change it in 3.x to make it more consistent with the other vendors. However, it could break the existing customers' workloads if they already use trim. ## How was this patch tested? N/A Closes #24944 from wangyum/SPARK-28093-REVERT-2.3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 June 2019, 16:05:31 UTC
35cb2ff [SPARK-27018][CORE] Fix incorrect removal of checkpointed file in PeriodicCheckpointer ## What changes were proposed in this pull request? remove the oldest checkpointed file only if next checkpoint exists. I think this patch needs back-porting. ## How was this patch tested? existing test local check in spark-shell with following suite: ``` import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.classification.GBTClassifier case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int) sc.setCheckpointDir("/checkpoints") val trainingData = sc.parallelize(1 to 2426874, 256).map(x => Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF val classifier = new GBTClassifier() .setLabelCol("label") .setFeaturesCol("features") .setProbabilityCol("probability") .setMaxIter(100) .setMaxDepth(10) .setCheckpointInterval(2) classifier.fit(trainingData) ``` Closes #24870 from zhengruifeng/ck_update. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 6064368415636e8668d8db835a218872f6846d98) Signed-off-by: Sean Owen <sean.owen@databricks.com> 24 June 2019, 14:35:20 UTC
1201a0a [SPARK-28093][SPARK-28109][SQL][2.3] Fix TRIM/LTRIM/RTRIM function parameter order issue ## What changes were proposed in this pull request? This pr backport #24902 and #24911 to branch-2.3. ## How was this patch tested? unit tests Closes #24908 from wangyum/SPARK-28093-branch-2.3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 June 2019, 01:40:23 UTC
3fcbdb8 [MINOR][DOC] Fix python variance() documentation ## What changes were proposed in this pull request? The Python documentation incorrectly says that `variance()` acts as `var_pop` whereas it acts like `var_samp` here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column- The alias is set on this line: https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786 ## How was this patch tested? Using variance() in pyspark 2.4.3 returns: ``` >>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show() +-----------+ |var_samp(a)| +-----------+ | 1.0| +-----------+ ``` Closes #24895 from tools4origins/patch-1. Authored-by: tools4origins <tools4origins@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 25c5d5788311ba74f4aac2f97054b4e9ef7e295c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 June 2019, 15:12:14 UTC
220f29a [SPARK-28081][ML] Handle large vocab counts in word2vec ## What changes were proposed in this pull request? The word2vec logic fails if a corpora has a word with count > 1e9. We should be able to handle very large counts generally better here by using longs to count. This takes over https://github.com/apache/spark/pull/24814 ## How was this patch tested? Existing tests. Closes #24893 from srowen/SPARK-28081. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit e96dd82f12f2b6d93860e23f4f98a86c3faf57c5) Signed-off-by: Sean Owen <sean.owen@databricks.com> 19 June 2019, 01:29:11 UTC
e6b5a5c [SPARK-24898][DOC] Adding spark.checkpoint.compress to the docs ## What changes were proposed in this pull request? Adding spark.checkpoint.compress configuration parameter to the documentation ![](https://user-images.githubusercontent.com/3538013/59580409-a7013080-90ee-11e9-9b2c-3d29015f597e.png) ## How was this patch tested? Checked locally for jeykyll html docs. Also validated the html for any issues. Closes #24883 from sandeepvja/SPARK-24898. Authored-by: Mellacheruvu Sandeep <mellacheruvu.sandeep@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b7b445255370e29d6b420b02389b022a1c65942e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 June 2019, 05:55:50 UTC
15ddc19 [SPARK-21882][CORE] OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function ## What changes were proposed in this pull request? (Continuation of https://github.com/apache/spark/pull/19118 ; see for details) ## How was this patch tested? Existing tests. Closes #24863 from srowen/SPARK-21882.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit b508eab9858b94f14b29e023812448e3d0c97712) Signed-off-by: Sean Owen <sean.owen@databricks.com> 14 June 2019, 17:45:45 UTC
9c9102c [SPARK-27798][SQL][BRANCH-2.3] ConvertToLocalRelation should tolerate expression reusing output object ## What changes were proposed in this pull request? The original issue SPARK-27798 was reported on master branch. When using `from_avro` to deserialize avro data to catalyst StructType format, if `ConvertToLocalRelation` is applied at the time, `from_avro` produces only the last value (overriding previous values). The cause is `AvroDeserializer` reuses output row for StructType. But `ConvertToLocalRelation` doesn't tolerate expression reusing output object. This is to backport the fix of `ConvertToLocalRelation` to branch 2.3. ## How was this patch tested? Added test. Closes #24823 from viirya/SPARK-27798-2.3. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 June 2019, 19:08:31 UTC
30735db [MINOR][BRANCH-2.4] Avoid hardcoded py4j-0.10.7-src.zip in Scala ## What changes were proposed in this pull request? This PR targets to deduplicate hardcoded `py4j-0.10.7-src.zip` in order to make py4j upgrade easier. ## How was this patch tested? N/A Closes #24772 from HyukjinKwon/backport-minor-py4j. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 67151350c16569a61726b42c2c3401a9ef29e061) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 June 2019, 15:02:12 UTC
c4ed161 [SPARK-27907][SQL] HiveUDAF should return NULL in case of 0 rows ## What changes were proposed in this pull request? When query returns zero rows, the HiveUDAFFunction throws NPE ## CASE 1: create table abc(a int) select histogram_numeric(a,2) from abc // NPE ``` Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 0, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveUDAFFunction.eval(hiveUDFs.scala:471) at org.apache.spark.sql.hive.HiveUDAFFunction.eval(hiveUDFs.scala:315) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.eval(interfaces.scala:543) at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$5(AggregationIterator.scala:231) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:122) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` ## CASE 2: create table abc(a int) insert into abc values (1) select histogram_numeric(a,2) from abc where a=3 // NPE ``` Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 5, localhost, executor driver): java.lang.NullPointerException at org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:477) at org.apache.spark.sql.hive.HiveUDAFFunction.serialize(hiveUDFs.scala:315) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.serializeAggregateBufferInPlace(interfaces.scala:570) at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateResultProjection$6(AggregationIterator.scala:254) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.outputForEmptyGroupingKeyWithoutInput(ObjectAggregationIterator.scala:97) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:132) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:839) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:839) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:327) at org.apache.spark.rdd.RDD.iterator(RDD.scala:291) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:122) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1350) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Hence add a check not avoid NPE ## How was this patch tested? Added new UT case Closes #24762 from ajithme/hiveudaf. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3806887afb36266aee8749ac95ea0e9016fbf6ff) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 June 2019, 18:33:23 UTC
1d35fff [SPARK-27869][CORE] Redact sensitive information in System Properties from UI Currently system properties are not redacted. This PR fixes that, so that any credentials passed as System properties are redacted as well. Manual test. Run the following and see the UI. ``` bin/spark-shell --conf 'spark.driver.extraJavaOptions=-DMYSECRET=app' ``` Closes #24733 from aaruna/27869. Authored-by: Aaruna <aaruna.godthi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bfa7f112e3258926befe9eaa9a489d3e0c4e2a0a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 May 2019, 17:49:03 UTC
ba224f8 [SPARK-27441][SQL][TEST] Add read/write tests to Hive serde tables ## What changes were proposed in this pull request? The versions between Hive, Parquet and ORC after the built-in Hive upgraded to 2.3.5 for Hadoop 3.2: - built-in Hive is 1.2.1.spark2:   | ORC | Parquet -- | -- | -- Spark datasource table | 1.5.5 | 1.10.1 Spark hive table | Hive built-in | 1.6.0 Apache Hive 1.2.1 | Hive built-in | 1.6.0 - built-in Hive is 2.3.5:   | ORC | Parquet -- | -- | -- Spark datasource table | 1.5.5 | 1.10.1 Spark hive table | 1.5.5 | [1.10.1](https://github.com/apache/spark/pull/24346) Apache Hive 2.3.5 | 1.3.4 | 1.8.1 We should add a test for Hive Serde table. This pr adds tests to test read/write of all supported data types using Parquet and ORC. ## How was this patch tested? unit tests Closes #24345 from wangyum/SPARK-27441. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 193304b51bee164c1d355b75be039f762118bdba) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 May 2019, 15:43:03 UTC
e56a9b1 [SPARK-25139][SPARK-18406][CORE][BRANCH-2.3] Avoid NonFatals to kill the Executor in PythonRunner ## What changes were proposed in this pull request? Python uses a prefetch approach to read the result from upstream and serve them in another thread, thus it's possible that if the children operator doesn't consume all the data then the Task cleanup may happen before Python side read process finishes, this in turn create a race condition that the block read locks are freed during Task cleanup and then the reader try to release the read lock it holds and find it has been released, in this case we shall hit a AssertionError. We shall catch the AssertionError in PythonRunner and prevent this kill the Executor. ## How was this patch tested? Hard to write a unit test case for this case, manually verified with failed job. Closes #24670 from rezasafi/branch-2.3. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 May 2019, 15:38:19 UTC
db33bd2 [SPARK-27800][SQL][HOTFIX][FOLLOWUP] Fix wrong answer on BitwiseXor test cases This PR is a follow up of https://github.com/apache/spark/pull/24669 to fix the wrong answers used in test cases. Closes #24674 from dongjoon-hyun/SPARK-27800. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a24cdc00bfa4eb011022aedec22a345a0e0e981d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 May 2019, 10:12:41 UTC
a73cb04 [SPARK-27800][SQL][DOC] Fix wrong answer of example for BitwiseXor ## What changes were proposed in this pull request? Fix example for bitwise xor function. 3 ^ 5 should be 6 rather than 2. - See https://spark.apache.org/docs/latest/api/sql/index.html#_14 ## How was this patch tested? manual tests Closes #24669 from alex-lx/master. Authored-by: Liu Xiao <hhdxlx@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bf617996aa95dd51be15da64ba9254f98697560c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 May 2019, 04:52:50 UTC
533d603 [MINOR][DOCS] Fix Spark hive example. ## What changes were proposed in this pull request? Documentation has an error, https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#hive-tables. The example: ```scala scala> val dataDir = "/tmp/parquet_data" dataDir: String = /tmp/parquet_data scala> spark.range(10).write.parquet(dataDir) scala> sql(s"CREATE EXTERNAL TABLE hive_ints(key int) STORED AS PARQUET LOCATION '$dataDir'") res6: org.apache.spark.sql.DataFrame = [] scala> sql("SELECT * FROM hive_ints").show() +----+ | key| +----+ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| +----+ ``` Range does not emit `key`, but `id` instead. Closes #24657 from ScrapCodes/fix_hive_example. Lead-authored-by: Prashant Sharma <prashant@apache.org> Co-authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 5f4b50513cd34cd3dcf7f72972bfcd1f51031723) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 May 2019, 09:24:36 UTC
f5310be [MINOR][EXAMPLES] Don't use internal Spark logging in user examples ## What changes were proposed in this pull request? Don't use internal Spark logging in user examples, because users shouldn't / can't use it directly anyway. These examples already use println in some cases. Note that the usage in StreamingExamples is on purpose. ## How was this patch tested? N/A Closes #24649 from srowen/ExampleLog. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit db24b04cad421ed508413d397c6beec01f723aee) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 May 2019, 15:45:29 UTC
89095f6 [SPARK-27771][SQL] Add SQL description for grouping functions (cube, rollup, grouping and grouping_id) ## What changes were proposed in this pull request? Both look added as of 2.0 (see SPARK-12541 and SPARK-12706). I referred existing docs and examples in other API docs. ## How was this patch tested? Manually built the documentation and, by running examples, by running `DESCRIBE FUNCTION EXTENDED`. Closes #24642 from HyukjinKwon/SPARK-27771. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2431ab0999dbb322dcefeb9b1671d935945dc29a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 May 2019, 02:27:18 UTC
7bdcc77 [SPARK-27735][SS] Parsing interval string should be case-insensitive in SS Some APIs in Structured Streaming requires the user to specify an interval. Right now these APIs don't accept upper-case strings. This PR adds a new method `fromCaseInsensitiveString` to `CalendarInterval` to support paring upper-case strings, and fixes all APIs that need to parse an interval string. The new unit test. Closes #24619 from zsxwing/SPARK-27735. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6a317c8f014557dfd60931bd1eac3f545520d939) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 May 2019, 21:10:57 UTC
306ebb2 [MINOR][SS] Remove duplicate 'add' in comment of `StructuredSessionization`. ## What changes were proposed in this pull request? `StructuredSessionization` comment contains duplicate 'add', I think it should be changed. ## How was this patch tested? Exists UT. Closes #24589 from beliefer/remove-duplicate-add-in-comment. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 7dd2dd5dc5d4210bef88b75b7f5b06266dc4ce1c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 15 May 2019, 07:15:56 UTC
fd17726 [SPARK-27347][MESOS] Fix supervised driver retry logic for outdated tasks ## What changes were proposed in this pull request? This patch fixes a bug where `--supervised` Spark jobs would retry multiple times whenever an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent. That is: ``` - supervised driver is running on agent1 - agent1 crashes - driver is relaunched on another agent as `<task-id>-retry-1` - agent1 comes back online and re-registers with scheduler - spark relaunches the same job as `<task-id>-retry-2` - now there are two jobs running simultaneously ``` This is because when an agent would come back and re-register it would send a status update `TASK_FAILED` for its old driver-task. Previous logic would indiscriminately remove the `submissionId` from Zookeeper's `launchedDrivers` node and add it to `retryList` node. Then, when a new offer came in, it would relaunch another `-retry-` task even though one was previously running. For example logs, scroll to bottom ## How was this patch tested? - Added a unit test to simulate behavior described above - Tested manually on a DC/OS cluster by ``` - launching a --supervised spark job - dcos node ssh <to the agent with the running spark-driver> - systemctl stop dcos-mesos-slave - docker kill <driver-container-id> - [ wait until spark job is relaunched ] - systemctl start dcos-mesos-slave - [ observe spark driver is not relaunched as `-retry-2` ] ``` Log snippets included below. Notice the `-retry-1` task is running when status update for the old task comes in afterward: ``` 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: ... [offers] ... 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001" ... 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_STARTING message='' 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_RUNNING message='' ... 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed out' reason=REASON_SLAVE_REMOVED ... 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-1" ... 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message='' 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message='' ... 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent re-reregistered' ... 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with driver-20190115192138-0001 in status update ... 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-2" ... 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message='' 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message='' ``` Closes #24276 from samvantran/SPARK-27347-duplicate-retries. Authored-by: Sam Tran <stran@mesosphere.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bcd3b61c4be98565352491a108e6394670a0f413) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 May 2019, 17:54:10 UTC
0d5533c [SPARK-27673][SQL] Add `since` info to random, regex, null expressions We should add since info to all expressions. SPARK-7886 Rand / Randn https://github.com/apache/spark/commit/af3746ce0d724dc624658a2187bde188ab26d084 RLike, Like (I manually checked that it exists from 1.0.0) SPARK-8262 Split SPARK-8256 RegExpReplace SPARK-8255 RegExpExtract https://github.com/apache/spark/commit/9aadcffabd226557174f3ff566927f873c71672e Coalesce / IsNull / IsNotNull (I manually checked that it exists from 1.0.0) SPARK-14541 IfNull / NullIf / Nvl / Nvl2 SPARK-9080 IsNaN SPARK-9168 NaNvl N/A Closes #24579 from HyukjinKwon/SPARK-27673. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c71f217de1e0b2265f585369aa556ed26db98589) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 May 2019, 16:32:49 UTC
0aa1e97 [SPARK-27672][SQL] Add `since` info to string expressions This PR adds since information to the all string expressions below: SPARK-8241 ConcatWs SPARK-16276 Elt SPARK-1995 Upper / Lower SPARK-20750 StringReplace SPARK-8266 StringTranslate SPARK-8244 FindInSet SPARK-8253 StringTrimLeft SPARK-8260 StringTrimRight SPARK-8267 StringTrim SPARK-8247 StringInstr SPARK-8264 SubstringIndex SPARK-8249 StringLocate SPARK-8252 StringLPad SPARK-8259 StringRPad SPARK-16281 ParseUrl SPARK-9154 FormatString SPARK-8269 Initcap SPARK-8257 StringRepeat SPARK-8261 StringSpace SPARK-8263 Substring SPARK-21007 Right SPARK-21007 Left SPARK-8248 Length SPARK-20749 BitLength SPARK-20749 OctetLength SPARK-8270 Levenshtein SPARK-8271 SoundEx SPARK-8238 Ascii SPARK-20748 Chr SPARK-8239 Base64 SPARK-8268 UnBase64 SPARK-8242 Decode SPARK-8243 Encode SPARK-8245 format_number SPARK-16285 Sentences N/A Closes #24578 from HyukjinKwon/SPARK-27672. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3442fcaa9bbe2e9306ef33a655fb6d1fe75ceb47) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 May 2019, 16:12:43 UTC
564dbf6 [MINOR][DOCS] Fix invalid documentation for StreamingQueryManager Class ## What changes were proposed in this pull request? When following the example for using `spark.streams().awaitAnyTermination()` a valid pyspark code will output the following error: ``` Traceback (most recent call last): File "pyspark_app.py", line 182, in <module> spark.streams().awaitAnyTermination() TypeError: 'StreamingQueryManager' object is not callable ``` Docs URL: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries This changes the documentation line to properly call the method under the StreamingQueryManager Class https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager ## How was this patch tested? After changing the syntax, error no longer occurs and pyspark application works This is only docs change Closes #24547 from asaf400/patch-1. Authored-by: Asaf Levy <asaf400@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 09422f5139cc13abaf506453819c2bb91e174ae3) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 08 May 2019, 14:46:15 UTC
8eac465 [SPARK-27624][CORE] Fix CalenderInterval to show an empty interval correctly ## What changes were proposed in this pull request? If the interval is `0`, it doesn't show both the value `0` and the unit at all. For example, this happens in the explain plans and Spark Web UI on `EventTimeWatermark` diagram. **BEFORE** ```scala scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain == Physical Plan == EventTimeWatermark ts#0: timestamp, interval 1 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#0] scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain == Physical Plan == EventTimeWatermark ts#3: timestamp, interval +- StreamingRelation FileSource[/tmp/t], [ts#3] ``` **AFTER** ```scala scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain == Physical Plan == EventTimeWatermark ts#0: timestamp, interval 1 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#0] scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain == Physical Plan == EventTimeWatermark ts#3: timestamp, interval 0 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#3] ``` ## How was this patch tested? Pass the Jenkins with the updated test case. Closes #24516 from dongjoon-hyun/SPARK-27624. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 614a5cc600b0ac01c5d03b1dc5fdf996ef18ac0e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 May 2019, 18:09:32 UTC
3c5fca1 [SPARK-27577][MLLIB] Correct thresholds downsampled in BinaryClassificationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes #24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit d5308cd86fff1e4bf9c24e0dd73d8d2c92737c4f) Signed-off-by: Sean Owen <sean.owen@databricks.com> 07 May 2019, 13:44:56 UTC
85fc2f2 [SPARK-24935][SQL][2.3] fix Hive UDAF with two aggregation buffers ## What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/24144 and https://github.com/apache/spark/pull/24459 to 2.3. ## How was this patch tested? existing tests Closes #24539 from cloud-fan/backport. Lead-authored-by: pgandhi <pgandhi@verizonmedia.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 May 2019, 23:06:57 UTC
52daf49 [SPARK-27621][ML] Linear Regression - validate training related params such as loss only during fitting phase ## What changes were proposed in this pull request? When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered. ``` java.util.NoSuchElementException: Failed to find a default value for loss at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:786) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) ``` This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :) This issue was first reported for mleap (https://github.com/combust/mleap/issues/455) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params. ## How was this patch tested? Added a unit test to check this scenario. Please let me know if there's anything additional required, this is the first PR that I've raised in this project. Closes #24509 from ancasarb/linear_regression_params_fix. Authored-by: asarb <asarb@expedia.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 4241a72c654f13b6b4ceafb27daceb7bb553add6) Signed-off-by: Sean Owen <sean.owen@databricks.com> 03 May 2019, 23:18:19 UTC
6071653 [SPARK-27626][K8S] Fix `docker-image-tool.sh` to be robust in non-bash shell env Although we use shebang `#!/usr/bin/env bash`, `minikube docker-env` returns invalid commands in `non-bash` environment and causes failures at `eval` because it only recognizes the default shell. We had better add `--shell bash` option explicitly in our `bash` script. ```bash $ bash -c 'eval $(minikube docker-env)' bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] bash: line 0: set: -g: invalid option set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...] $ bash -c 'eval $(minikube docker-env --shell bash)' ``` Manual. Run the script with non-bash shell environment. ``` bin/docker-image-tool.sh -m -t testing build ``` Closes #24517 from dongjoon-hyun/SPARK-27626. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6c2d351f5466d42c4d227f5627bd3709c266b5ce) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 May 2019, 17:15:58 UTC
a956e9c [SPARK-27563][SQL][TEST] automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite We can get the latest downloadable Spark versions from https://dist.apache.org/repos/dist/release/spark/ manually. Closes #24454 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 April 2019, 08:47:44 UTC
257abc4 [SPARK-27496][CORE] Fatal errors should also be sent back to the sender ## What changes were proposed in this pull request? When a fatal error (such as StackOverflowError) throws from "receiveAndReply", we should try our best to notify the sender. Otherwise, the sender will hang until timeout. In addition, when a MessageLoop is dying unexpectedly, it should resubmit a new one so that Dispatcher is still working. ## How was this patch tested? New unit tests. Closes #24396 from zsxwing/SPARK-27496. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 009059e3c261a73d605bc49aee4aecb0eb0e8267) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 April 2019, 00:07:27 UTC
a85ab12 [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x ## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers revealed that 2.3 probably doesn't like python 3.6... :( NOTE: this is just for branch-2.3 PLEASE DO NOT MERGE Author: shane knapp <incomplete@gmail.com> Closes #24380 from shaneknapp/update-python-executable-2.3. 19 April 2019, 16:45:40 UTC
233646c [MINOR][TEST] Expand spark-submit test to allow python2/3 executable ## What changes were proposed in this pull request? This backports a tiny part of another change: https://github.com/apache/spark/commit/4bdfda92a1c570d7a1142ee30eb41e37661bc240#diff-3c792ce7265b69b448a984caf629c96bR161 ... which just works around the possibility that the local python interpreter is 'python3' or 'python2' when running the spark-submit tests. I'd like to backport to 2.3 too. This otherwise prevents this test from passing on my mac, though I have a custom install with brew. But may affect others. ## How was this patch tested? Existing tests. Closes #24407 from srowen/Python23check. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7f64963c9d951585f585f98667f09e9ddbe63731) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 April 2019, 21:30:40 UTC
c1e5824 [SPARK-27479][BUILD] Hide API docs for org.apache.spark.util.kvstore ## What changes were proposed in this pull request? The API docs should not include the "org.apache.spark.util.kvstore" package because they are internal private APIs. See the doc link: https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/kvstore/LevelDB.html ## How was this patch tested? N/A Closes #24386 from gatorsmile/rmDoc. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 61feb1635217ef1d4ebceebc1e7c8829c5c11994) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 April 2019, 16:30:18 UTC
e3bdb5b Revert "[SPARK-23433][SPARK-25250][CORE][BRANCH-2.3] Later created TaskSet should learn about the finished partitions" This reverts commit a1ca5663725c278b6e3785042348819a25496fe4. 14 April 2019, 09:02:02 UTC
6934629 [SPARK-27358][UI] Update jquery to 1.12.x to pick up security fixes Update jquery -> 1.12.4, datatables -> 1.10.18, mustache -> 2.3.12. Add missing mustache license I manually tested the UI locally with the javascript console open and didn't observe any problems or JS errors. The only 'risky' change seems to be mustache, but on reading its release notes, don't think the changes from 0.8.1 to 2.x would affect Spark's simple usage. Closes #24288 from srowen/SPARK-27358. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 23bde447976d0bb33cae67124bac476994634f04) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 April 2019, 18:04:01 UTC
06df899 [MINOR][DOC] Fix html tag broken in configuration.md ## What changes were proposed in this pull request? This patch fixes wrong HTML tag in configuration.md which breaks the table tag. This is originally reported in dev mailing list: https://lists.apache.org/thread.html/744bdc83b3935776c8d91bf48fdf80d9a3fed3858391e60e343206f9%3Cdev.spark.apache.org%3E ## How was this patch tested? This change is one-liner and pretty obvious so I guess we may be able to skip testing. Closes #24304 from HeartSaVioR/MINOR-configuration-doc-html-tag-error. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a840b99daf97de06c9b1b66efed0567244ec4a01) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2019, 15:42:37 UTC
9446da1 [SPARK-27216][CORE][BACKPORT-2.3] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue ## What changes were proposed in this pull request? Back-port of #24264 to branch-2.3. HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. It's a bug of RoaringBitmap-0.5.11 and fixed in latest version. ## How was this patch tested? Add a UT Closes #24291 from LantaoJin/SPARK-27216_BACKPORT-2.3. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 04 April 2019, 23:23:52 UTC
b95a58e [SPARK-27338][CORE][FOLLOWUP] remove trailing space ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it. ## How was this patch tested? N/A Closes #24289 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2019, 06:58:30 UTC
049a888 [SPARK-27338][CORE] Fix deadlock in UnsafeExternalSorter.SpillableIterator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager ## What changes were proposed in this pull request? In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265. To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265 ## How was this patch tested? Manual tests were being done and will also try to add a test. Closes #24265 from venkata91/deadlock-sorter. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@qubole.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6c4552c65045cfe82ed95212ee7cff684e44288b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2019, 02:04:07 UTC
96c2c3b [SPARK-26998][CORE] Remove SSL configuration from executors ## What changes were proposed in this pull request? Different SSL passwords shown up as command line argument on executor side in standalone mode: * keyStorePassword * keyPassword * trustStorePassword In this PR I've removed SSL configurations from executors. ## How was this patch tested? Existing + additional unit tests. Additionally tested with standalone mode and checked the command line arguments: ``` [gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps 94803 CoarseGrainedExecutorBackend 94818 Jps 90149 RemoteMavenServer 91925 Nailgun 94793 SparkSubmit 94680 Worker 94556 Master 398 [gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef | egrep "94556|94680|94793|94803" 502 94556 1 0 2:02PM ttys007 0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf 502 94680 1 0 2:02PM ttys007 0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077 502 94793 94782 0 2:02PM ttys007 0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell 502 94803 94680 0 2:03PM ttys007 0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899 502 94910 57352 0 2:05PM ttys008 0:00.00 egrep 94556|94680|94793|94803 ``` Closes #24170 from gaborgsomogyi/SPARK-26998. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 57aff93886ac7d02b88294672ce0d2495b0942b8) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 02 April 2019, 16:23:24 UTC
7c0475a [SPARK-27244][CORE][TEST][FOLLOWUP] toDebugString redacts sensitive information ## What changes were proposed in this pull request? This PR is a FollowUp of https://github.com/apache/spark/pull/24196. It improves the test case by using the parameters that are being used in the actual scenarios. ## How was this patch tested? N/A Closes #24257 from gatorsmile/followupSPARK-27244. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 92b6f86f6d25abbc2abbf374e77c0b70cd1779c7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2019, 05:59:20 UTC
69b8a04 [MINOR][R] fix R project description update as per this NOTE when running CRAN check ``` The Title field should be in title case, current version then in title case: ‘R Front end for 'Apache Spark'’ ‘R Front End for 'Apache Spark'’ ``` Closes #24255 from felixcheung/rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fa0f791d4d9f083a45ab631a2e9f88a6b749e416) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2019, 04:03:09 UTC
b57fef9 [SPARK-27301][DSTREAM] Shorten the FileSystem cached life cycle to the cleanup method inner scope ## What changes were proposed in this pull request? The cached FileSystem's token will expire if no tokens explicitly are add into it. ```scala 19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 83189 19/03/28 13:40:16 INFO rdd.MapPartitionsRDD: Removing RDD 82860 from persistence list 19/03/28 13:40:16 INFO spark.ContextCleaner: Cleaned shuffle 6005 19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 82860 19/03/28 13:40:16 INFO scheduler.ReceivedBlockTracker: Deleting batches: 19/03/28 13:40:16 INFO scheduler.InputInfoTracker: remove old batch metadata: 1553750250000 ms 19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-1396157959_1] for 53 seconds. Will retry shortly ... org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy11.renewLease(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571) at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.renewLease(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:878) at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) at java.lang.Thread.run(Thread.java:748) ``` This PR shorten the FileSystem cached life cycle to the cleanup method inner scope in case of token expiry. ## How was this patch tested? existing ut Closes #24235 from yaooqinn/SPARK-27301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit f4c73b7c685b901dd69950e4929c65e3b8dd3a55) Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 March 2019, 07:36:45 UTC
0975fe9 [SPARK-27244][CORE] Redact Passwords While Using Option logConf=true When logConf is set to true, config keys that contain password were printed in cleartext in driver log. This change uses the already present redact method in Utils, to redact all the passwords based on redact pattern in SparkConf and then print the conf to driver log thus ensuring that sensitive information like passwords is not printed in clear text. This patch was tested through `SparkConfSuite` & then entire unit test through sbt Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24196 from ninadingole/SPARK-27244. Authored-by: Ninad Ingole <robert.wallis@example.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit dbc7ce18b934fbfd0743b1348fc1265778f07027) Signed-off-by: Sean Owen <sean.owen@databricks.com> 29 March 2019, 19:24:16 UTC
530ec52 [SPARK-27275][CORE] Fix potential corruption in EncryptedMessage.transferTo (2.4) ## What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/24211 to 2.4 ## How was this patch tested? Jenkins Closes #24229 from zsxwing/SPARK-27275-2.4. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 298e4fa6f8054c54e246f91b70d62174ccdb9413) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 March 2019, 18:13:54 UTC
4a54107 [SPARK-26961][CORE] Enable parallel classloading capability ## What changes were proposed in this pull request? As per https://docs.oracle.com/javase/8/docs/api/java/lang/ClassLoader.html ``Class loaders that support concurrent loading of classes are known as parallel capable class loaders and are required to register themselves at their class initialization time by invoking the ClassLoader.registerAsParallelCapable method. Note that the ClassLoader class is registered as parallel capable by default. However, its subclasses still need to register themselves if they are parallel capable. `` i.e we can have finer class loading locks by registering classloaders as parallel capable. (Refer to deadlock due to macro lock https://issues.apache.org/jira/browse/SPARK-26961). All the classloaders we have are wrapper of URLClassLoader which by itself is parallel capable. But this cannot be achieved by scala code due to static registration Refer https://github.com/scala/bug/issues/11429 ## How was this patch tested? All Existing UT must pass Closes #24126 from ajithme/driverlock. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit b61dce23d2ee7ca95770bc7c390029aae8c65f7e) Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 March 2019, 00:08:26 UTC
ac683b7 [SPARK-27160][SQL] Fix DecimalType when building orc filters DecimalType Literal should not be casted to Long. eg. For `df.filter("x < 3.14")`, assuming df (x in DecimalType) reads from a ORC table and uses the native ORC reader with predicate push down enabled, we will push down the `x < 3.14` predicate to the ORC reader via a SearchArgument. OrcFilters will construct the SearchArgument, but not handle the DecimalType correctly. The previous impl will construct `x < 3` from `x < 3.14`. ``` $ sbt > sql/testOnly *OrcFilterSuite > sql/testOnly *OrcQuerySuite -- -z "27160" ``` Closes #24092 from sadhen/spark27160. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f3ba73a5f54cc233424cee4fdfd3a61674b2b48e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 March 2019, 17:47:42 UTC
978b68a [SPARK-26606][CORE] Handle driver options properly when submitting to standalone cluster mode via legacy Client This patch fixes the issue that ClientEndpoint in standalone cluster doesn't recognize about driver options which are passed to SparkConf instead of system properties. When `Client` is executed via cli they should be provided as system properties, but with `spark-submit` they can be provided as SparkConf. (SpartSubmit will call `ClientApp.start` with SparkConf which would contain these options.) Manually tested via following steps: 1) setup standalone cluster (launch master and worker via `./sbin/start-all.sh`) 2) submit one of example app with standalone cluster mode ``` ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master "spark://localhost:7077" --conf "spark.driver.extraJavaOptions=-Dfoo=BAR" --deploy-mode "cluster" --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/jars/spark-examples*.jar 10 ``` 3) check whether `foo=BAR` is provided in system properties in Spark UI <img width="877" alt="Screen Shot 2019-03-21 at 8 18 04 AM" src="https://user-images.githubusercontent.com/1317309/54728501-97db1700-4bc1-11e9-89da-078445c71e9b.png"> Closes #24163 from HeartSaVioR/SPARK-26606. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 8a9eb05137cd4c665f39a54c30d46c0c4eb7d20b) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 22 March 2019, 22:21:53 UTC
7bb2b42 [SPARK-27112][CORE] : Create a resource ordering between threads to r… …esolve the deadlocks encountered when trying to kill executors either due to dynamic allocation or blacklisting Closes #24072 from pgandhi999/SPARK-27112-2. Authored-by: pgandhi <pgandhiverizonmedia.com> Signed-off-by: Imran Rashid <irashidcloudera.com> ## What changes were proposed in this pull request? There are two deadlocks as a result of the interplay between three different threads: **task-result-getter thread** **spark-dynamic-executor-allocation thread** **dispatcher-event-loop thread(makeOffers())** The fix ensures ordering synchronization constraint by acquiring lock on `TaskSchedulerImpl` before acquiring lock on `CoarseGrainedSchedulerBackend` in `makeOffers()` as well as killExecutors() method. This ensures resource ordering between the threads and thus, fixes the deadlocks. ## How was this patch tested? Manual Tests Closes #24134 from pgandhi999/branch-2.4-SPARK-27112. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit 95e73b328ac883be2ced9099f20c8878e498e297) Signed-off-by: Imran Rashid <irashid@cloudera.com> 19 March 2019, 21:23:19 UTC
959a7ec [MINOR][CORE] Use https for bintray spark-packages repository ## What changes were proposed in this pull request? This patch changes the schema of url from http to https for bintray spark-packages repository. Looks like we already changed the schema of repository url for pom.xml but missed inside the code. ## How was this patch tested? Manually ran the `--package` via `./bin/spark-shell --verbose --packages "RedisLabs:spark-redis:0.3.2"` ``` ... Ivy Default Cache set to: /Users/jlim/.ivy2/cache The jars for the packages stored in: /Users/jlim/.ivy2/jars :: loading settings :: url = jar:file:/Users/jlim/WorkArea/ScalaProjects/spark/dist/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml RedisLabs#spark-redis added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-2fee2e18-7832-4a4d-9e97-7b3d0fef766d;1.0 confs: [default] found RedisLabs#spark-redis;0.3.2 in spark-packages found redis.clients#jedis;2.7.2 in central found org.apache.commons#commons-pool2;2.3 in central downloading https://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar ... [SUCCESSFUL ] RedisLabs#spark-redis;0.3.2!spark-redis.jar (824ms) downloading https://repo1.maven.org/maven2/redis/clients/jedis/2.7.2/jedis-2.7.2.jar ... [SUCCESSFUL ] redis.clients#jedis;2.7.2!jedis.jar (576ms) downloading https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.3/commons-pool2-2.3.jar ... [SUCCESSFUL ] org.apache.commons#commons-pool2;2.3!commons-pool2.jar (150ms) :: resolution report :: resolve 4586ms :: artifacts dl 1555ms :: modules in use: RedisLabs#spark-redis;0.3.2 from spark-packages in [default] org.apache.commons#commons-pool2;2.3 from central in [default] redis.clients#jedis;2.7.2 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 3 | 3 | 0 || 3 | 3 | --------------------------------------------------------------------- ``` Closes #24061 from HeartSaVioR/MINOR-use-https-to-bintray-repository. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit f57af2286f85bf67706e14fecfbfd9ef034c2927) Signed-off-by: Sean Owen <sean.owen@databricks.com> 12 March 2019, 23:02:49 UTC
d0290ea [SPARK-26927][CORE] Ensure executor is active when processing events in dynamic allocation manager. There is a race condition in the `ExecutorAllocationManager` that the `SparkListenerExecutorRemoved` event is posted before the `SparkListenerTaskStart` event, which will cause the incorrect result of `executorIds`. Then, when some executor idles, the real executors will be removed even actual executor number is equal to `minNumExecutors` due to the incorrect computation of `newExecutorTotal`(may greater than the `minNumExecutors`), thus may finally causing zero available executors but a wrong positive number of executorIds was kept in memory. What's more, even the `SparkListenerTaskEnd` event can not make the fake `executorIds` released, because later idle event for the fake executors can not cause the real removal of these executors, as they are already removed and they are not exist in the `executorDataMap` of `CoaseGrainedSchedulerBackend`, so that the `onExecutorRemoved` method will never be called again. For details see https://issues.apache.org/jira/browse/SPARK-26927 This PR is to fix this problem. existUT and added UT Closes #23842 from liupc/Fix-race-condition-that-casues-dyanmic-allocation-not-working. Lead-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit d5cfe08fdc7ad07e948f329c0bdeeca5c2574a18) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 12 March 2019, 21:27:07 UTC
4d1d0a4 [SPARK-27111][SS] Fix a race that a continuous query may fail with InterruptedException Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally. To handle this case, this PR has the following changes: - Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread). - Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly. I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up. Jenkins Closes #24034 from zsxwing/SPARK-27111. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 6e1c0827ece1cdc615196e60cb11c76b917b8eeb) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 09 March 2019, 22:55:46 UTC
b6d5b0a [SPARK-27080][SQL] bug fix: mergeWithMetastoreSchema with uniform lower case comparison When reading parquet file with merging metastore schema and file schema, we should compare field names using uniform case. In current implementation, lowercase is used but one omission. And this patch fix it. Unit test Closes #24001 from codeborui/mergeSchemaBugFix. Authored-by: CodeGod <> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a29df5fa02111f57965be2ab5e208f5c815265fe) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 March 2019, 13:33:03 UTC
c45f8da [SPARK-26604][CORE][BACKPORT-2.4] Clean up channel registration for StreamManager ## What changes were proposed in this pull request? This is mostly a clean backport of https://github.com/apache/spark/pull/23521 to branch-2.4 ## How was this patch tested? I've tested this with a hack in `TransportRequestHandler` to force `ChunkFetchRequest` to get dropped. Then making a number of `ExternalShuffleClient.fetchChunk` requests (which `OpenBlocks` then `ChunkFetchRequest`) and closing out of my test harness. A heap dump later reveals that the `StreamState` references are unreachable. I haven't run this through the unit test suite, but doing that now. Wanted to get this up as I think folks are waiting for it for 2.4.1 Closes #24013 from abellina/SPARK-26604_cherry_pick_2_4. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Alessandro Bellina <abellina@yahoo-inc.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 216eeec2bd319f1d6a1228c9bc8d8a579d5e6571) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 08 March 2019, 03:48:42 UTC
a1ca566 [SPARK-23433][SPARK-25250][CORE][BRANCH-2.3] Later created TaskSet should learn about the finished partitions ## What changes were proposed in this pull request? This is an optional solution for #22806 . #21131 firstly implement that a previous successful completed task from zombie TaskSetManager could also succeed the active TaskSetManager, which based on an assumption that an active TaskSetManager always exists for that stage when this happen. But that's not always true as an active TaskSetManager may haven't been created when a previous task succeed, and this is the reason why #22806 hit the issue. This pr extends #21131 's behavior by adding stageIdToFinishedPartitions into TaskSchedulerImpl, which recording the finished partition whenever a task(from zombie or active) succeed. Thus, a later created active TaskSetManager could also learn about the finished partition by looking into stageIdToFinishedPartitions and won't launch any duplicate tasks. ## How was this patch tested? Add. Closes #24007 from Ngone51/dev-23433-25250-branch-2.3. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> 07 March 2019, 18:30:37 UTC
dfde0c6 [SPARK-25863][SPARK-21871][SQL] Check if code size statistics is empty or not in updateAndGetCompilationStats ## What changes were proposed in this pull request? `CodeGenerator.updateAndGetCompilationStats` throws an unsupported exception for empty code size statistics. This pr added code to check if it is empty or not. ## How was this patch tested? Pass Jenkins. Closes #23947 from maropu/SPARK-21871-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 07 March 2019, 08:36:38 UTC
877b8db [SPARK-27065][CORE] avoid more than one active task set managers for a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. https://github.com/apache/spark/pull/17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. https://github.com/apache/spark/pull/21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit cb20fbc43e7f54af1ed30b9eb6d76ca50b4eb750) Signed-off-by: Imran Rashid <irashid@cloudera.com> 06 March 2019, 18:01:48 UTC
8b70980 [SPARK-24669][SQL] Invalidate tables in case of DROP DATABASE CASCADE ## What changes were proposed in this pull request? Before dropping database refresh the tables of that database, so as to refresh all cached entries associated with those tables. We follow the same when dropping a table. UT is added Closes #23905 from Udbhav30/SPARK-24669. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9bddf7180e9e76e1cabc580eee23962dd66f84c3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 March 2019, 17:17:04 UTC
c326628 [MINOR][DOCS] Clarify that Spark apps should mark Spark as a 'provided' dependency, not package it ## What changes were proposed in this pull request? Spark apps do not need to package Spark. In fact it can cause problems in some cases. Our examples should show depending on Spark as a 'provided' dependency. Packaging Spark makes the app much bigger by tens of megabytes. It can also bring in conflicting dependencies that wouldn't otherwise be a problem. https://issues.apache.org/jira/browse/SPARK-26146 was what reminded me of this. ## How was this patch tested? Doc build Closes #23938 from srowen/Provided. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 39092236819da097e9c8a3b2fa975105f08ae5b9) Signed-off-by: Sean Owen <sean.owen@databricks.com> 05 March 2019, 14:28:22 UTC
3ece965 [MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org" ## What changes were proposed in this pull request? Below build failed with Java checkstyle test, but instead of violation it shows FileNotFound on dtd file. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102751/ Looks like the link of dtd file is dead `http://www.puppycrawl.com/dtds/configuration_1_3.dtd`. This patch updates the dtd link to "https://checkstyle.org/dtds/" given checkstyle repository also updated the URL path. https://github.com/checkstyle/checkstyle/issues/5601 ## How was this patch tested? Checked the new links. Closes #23887 from HeartSaVioR/java-checkstyle-dtd-change-url. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit c5de804093540509929f6de211dbbe644b33e6db) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 February 2019, 19:26:36 UTC
ae1b44c [SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values ## What changes were proposed in this pull request? Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations. ```scala scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array res1: Array[Byte] = Array(127, -64, 0, 0) scala> val x = java.lang.Float.intBitsToFloat(-6966608) x: Float = NaN scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array res2: Array[Byte] = Array(-1, -107, -78, -80) ``` Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/ ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23851 from dongjoon-hyun/SPARK-26950. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ffef3d40741b0be321421aa52a6e17a26d89f541) Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef67be363be6d6b6954b55ef1c243a0672b84abb) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 February 2019, 21:46:37 UTC
36db45d [R][BACKPORT-2.3] update package description ## What changes were proposed in this pull request? #23852 doesn't port cleanly to 2.3. we need this in branch-2.4 and branch-2.3 Closes #23861 from felixcheung/2.3rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2019, 02:12:38 UTC
6691c04 [R][BACKPORT-2.4] update package description doesn't port cleanly to 2.4. we need this in branch-2.4 and branch-2.3 Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #23860 from felixcheung/2.4rdesc. (cherry picked from commit d8576301fd1d33675a9542791e58e7963081ce04) Signed-off-by: Felix Cheung <felixcheung@apache.org> 21 February 2019, 16:45:11 UTC
41df43f [SPARK-26873][SQL] Use a consistent timestamp to build Hadoop Job IDs. ## What changes were proposed in this pull request? Backport SPARK-26873 (#23777) to branch-2.3. ## How was this patch tested? Existing tests to cover regressions. Closes #23832 from rdblue/SPARK-26873-branch-2.3. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 19 February 2019, 18:06:10 UTC
back to top