Revision history – Software Heritage archive

Revision	Author	Date	Message	Commit Date
514a7e6	Joseph K. Bradley	20 June 2017, 06:04:17 UTC	[SPARK-20929][ML] LinearSVC should use its own threshold param ## What changes were proposed in this pull request? LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley <joseph@databricks.com> Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup. (cherry picked from commit cc67bd573264c9046c4a034927ed8deb2a732110) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	20 June 2017, 06:04:27 UTC
8bf7f1e	Yuming Wang	20 June 2017, 01:22:30 UTC	[SPARK-21133][CORE] Fix HighlyCompressedMapStatus#writeExternal throws NPE ## What changes were proposed in this pull request? Fix HighlyCompressedMapStatus#writeExternal NPE: ``` 17/06/18 15:00:27 ERROR Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) ... 17 more 17/06/18 15:00:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.17.47.20:50188 17/06/18 15:00:27 ERROR Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18343 from wangyum/SPARK-21133. (cherry picked from commit 9b57cd8d5c594731a7b3c90ce59bcddb05193d79) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	20 June 2017, 01:22:53 UTC
cf10fa8	sharkdtu	19 June 2017, 21:54:54 UTC	[SPARK-21138][YARN] Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different ## What changes were proposed in this pull request? When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows： ``` spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark ``` The staging dir can not be deleted, it will prompt following message: ``` java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 ``` ## How was this patch tested? Existing tests Author: sharkdtu <sharkdtu@tencent.com> Closes #18352 from sharkdtu/master. (cherry picked from commit 3d4d11a80fe8953d48d8bfac2ce112e37d38dc90) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	19 June 2017, 21:55:08 UTC
e329bea	Dongjoon Hyun	19 June 2017, 19:17:54 UTC	[MINOR][BUILD] Fix Java linter errors This PR cleans up a few Java linter errors for Apache Spark 2.2 release. ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` We can check the result at Travis CI, [here](https://travis-ci.org/dongjoon-hyun/spark/builds/244297894). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18345 from dongjoon-hyun/fix_lint_java_2. (cherry picked from commit ecc5631351e81bbee4befb213f3053a4f31532a7) Signed-off-by: Sean Owen <sowen@cloudera.com>	19 June 2017, 19:33:57 UTC
7b50736	assafmendelson	19 June 2017, 17:58:58 UTC	[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table ## What changes were proposed in this pull request? The description for several options of File Source for structured streaming appeared in the File Sink description instead. This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2 ## How was this patch tested? Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide. The original documentation was written by tdas and lw-lin Author: assafmendelson <assaf.mendelson@gmail.com> Closes #18342 from assafmendelson/spark-21123. (cherry picked from commit 66a792cd88c63cc0a1d20cbe14ac5699afbb3662) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	19 June 2017, 17:59:06 UTC
f7fcdec	saturday_s	19 June 2017, 17:24:29 UTC	[SPARK-19688][STREAMING] Not to read `spark.yarn.credentials.file` from checkpoint. ## What changes were proposed in this pull request? Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint. ## How was this patch tested? Manual tested with 1.6.3 and 2.1.1. I didn't test this with master because of some compile problems, but I think it will be the same result. ## Notice This should be merged into maintenance branches too. jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008) Author: saturday_s <shi.indetail@gmail.com> Closes #18230 from saturday-shi/SPARK-21008. (cherry picked from commit e92ffe6f1771e3fe9ea2e62ba552c1b5cf255368) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	19 June 2017, 17:24:45 UTC
fab070c	Xiao Li	19 June 2017, 07:51:21 UTC	[SPARK-21132][SQL] DISTINCT modifier of function arguments should not be silently ignored ### What changes were proposed in this pull request? We should not silently ignore `DISTINCT` when they are not supported in the function arguments. This PR is to block these cases and issue the error messages. ### How was this patch tested? Added test cases for both regular functions and window functions Author: Xiao Li <gatorsmile@gmail.com> Closes #18340 from gatorsmile/firstCount. (cherry picked from commit 9413b84b5a99e264816c61f72905b392c2f9cd35) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	19 June 2017, 07:51:35 UTC
d3c79b7	liuxian	19 June 2017, 03:46:58 UTC	[SPARK-21090][CORE] Optimize the unified memory manager code ## What changes were proposed in this pull request? 1.In `acquireStorageMemory`, when the Memory Mode is OFF_HEAP ,the `maxOffHeapMemory` should be modified to `maxOffHeapStorageMemory`. after this PR,it will same as ON_HEAP Memory Mode. Because when acquire memory is between `maxOffHeapStorageMemory` and `maxOffHeapMemory`,it will fail surely, so if acquire memory is greater than `maxOffHeapStorageMemory`(not greater than `maxOffHeapMemory`),we should fail fast. 2. Borrow memory from execution, `numBytes` modified to `numBytes - storagePool.memoryFree` will be more reasonable. Because we just acquire `(numBytes - storagePool.memoryFree)`, unnecessary borrowed `numBytes` from execution ## How was this patch tested? added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes #18296 from 10110346/wip-lx-0614. (cherry picked from commit 112bd9bfc5b9729f6f86518998b5d80c5e79fe5e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	19 June 2017, 03:52:19 UTC
c0d4acc	hyukjinkwon	18 June 2017, 07:43:47 UTC	[MINOR][R] Add knitr and rmarkdown packages/improve output for version info in AppVeyor tests ## What changes were proposed in this pull request? This PR proposes three things as below: Install packages per documentation - this does not affect the tests itself (but CRAN which we are not doing via AppVeyor) up to my knowledge. This adds `knitr` and `rmarkdown` per https://github.com/apache/spark/blob/45824fb608930eb461e7df53bb678c9534c183a9/R/WINDOWS.md#unit-tests (please see https://github.com/apache/spark/commit/45824fb608930eb461e7df53bb678c9534c183a9) Improve logs/shorten logs - actually, long logs can be a problem on AppVeyor (e.g., see https://github.com/apache/spark/pull/17873) `R -e ...` repeats printing R information for each invocation as below: ``` R version 3.3.1 (2016-06-21) -- "Bug in Your Hair" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: i386-w64-mingw32/i386 (32-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. ``` It looks reducing the call might be slightly better and print out the versions together looks more readable. Before: ``` # R information ... > packageVersion('testthat') [1] '1.0.2' > > # R information ... > packageVersion('e1071') [1] '1.6.8' > > ... 3 more times ``` After: ``` # R information ... > packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival') [1] ‘1.16’ [1] ‘1.6’ [1] ‘1.0.2’ [1] ‘1.6.8’ [1] ‘2.41.3’ ``` Add`appveyor.yml`/`dev/appveyor-install-dependencies.ps1` for triggering the test Changing this file might break the test, e.g., https://github.com/apache/spark/pull/16927 ## How was this patch tested? Before (please see https://ci.appveyor.com/project/HyukjinKwon/spark/build/169-master) After (please see the AppVeyor build in this PR): Author: hyukjinkwon <gurwls223@gmail.com> Closes #18336 from HyukjinKwon/minor-add-knitr-and-rmarkdown. (cherry picked from commit 75a6d05853fea13f88e3c941b1959b24e4640824) Signed-off-by: Sean Owen <sowen@cloudera.com>	18 June 2017, 07:43:55 UTC
8747f8e	liuzhaokun	18 June 2017, 07:32:29 UTC	[SPARK-21126] The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark [https://issues.apache.org/jira/browse/SPARK-21126](https://issues.apache.org/jira/browse/SPARK-21126) The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark,so I think it should be removed from configuration.md. Author: liuzhaokun <liu.zhaokun@zte.com.cn> Closes #18333 from liu-zhaokun/new3. (cherry picked from commit 0d8604bb849b3370cc21966cdd773238f3a29f84) Signed-off-by: Sean Owen <sowen@cloudera.com>	18 June 2017, 07:32:41 UTC
d3deeb3	Yuming Wang	16 June 2017, 10:03:54 UTC	[MINOR][DOCS] Improve Running R Tests docs ## What changes were proposed in this pull request? Update Running R Tests dependence packages to: ```bash R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')" ``` ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18271 from wangyum/building-spark. (cherry picked from commit 45824fb608930eb461e7df53bb678c9534c183a9) Signed-off-by: Sean Owen <sowen@cloudera.com>	16 June 2017, 10:04:03 UTC
653e6f1	jerryshao	16 June 2017, 06:24:15 UTC	[SPARK-12552][FOLLOWUP] Fix flaky test for "o.a.s.deploy.master.MasterSuite.master correctly recover the application" ## What changes were proposed in this pull request? Due to the RPC asynchronous event processing, The test "correctly recover the application" could potentially be failed. The issue could be found in here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78126/testReport/org.apache.spark.deploy.master/MasterSuite/master_correctly_recover_the_application/. So here fixing this flaky test. ## How was this patch tested? Existing UT. CC cloud-fan jiangxb1987 , please help to review, thanks! Author: jerryshao <sshao@hortonworks.com> Closes #18321 from jerryshao/SPARK-12552-followup. (cherry picked from commit 2837b14cdc42f096dce07e383caa30c7469c5d6b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	16 June 2017, 06:24:25 UTC
9909be3	Xianyang Liu	16 June 2017, 04:10:09 UTC	[SPARK-21072][SQL] TreeNode.mapChildren should only apply to the children node. ## What changes were proposed in this pull request? Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342 ## How was this patch tested? Existing tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18284 from ConeyLiu/treenode. (cherry picked from commit 87ab0cec65b50584a627037b9d1b6fdecaee725c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	16 June 2017, 04:10:23 UTC
a585c87	gatorsmile	16 June 2017, 01:25:39 UTC	[SPARK-21111][TEST][2.2] Fix the test failure of describe.sql ## What changes were proposed in this pull request? Test failed in `describe.sql`. We need to fix the related bug introduced in (https://github.com/apache/spark/pull/17649) in the follow-up PR to master. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18316 from gatorsmile/fix.	16 June 2017, 01:25:39 UTC
76ee41f	Xingbo Jiang	15 June 2017, 16:06:54 UTC	[SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message ## What changes were proposed in this pull request? Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case. The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply(). ## How was this patch tested? N/A Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18314 from jiangxb1987/LocalCheckpointSuite. (cherry picked from commit 7dc3e697c74864a4e3cca7342762f1427058b3c3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	15 June 2017, 16:07:08 UTC
b5504f6	Felix Cheung	15 June 2017, 06:08:05 UTC	[SPARK-20980][DOCS] update doc to reflect multiLine change ## What changes were proposed in this pull request? doc only change ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18312 from felixcheung/sqljsonwholefiledoc. (cherry picked from commit 1bf55e396c7b995a276df61d9a4eb8e60bcee334) Signed-off-by: Felix Cheung <felixcheung@apache.org>	15 June 2017, 06:08:18 UTC
af4f89c	Xiao Li	15 June 2017, 05:18:19 UTC	[SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`. N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #18202 from gatorsmile/renameCVSOption. (cherry picked from commit 2051428173d8cd548702eb1a2e1c1ca76b8f2fd5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	15 June 2017, 05:20:36 UTC
e02e063	Xiao Li	14 June 2017, 18:35:14 UTC	Revert "[SPARK-20941][SQL] Fix SubqueryExec Reuse" This reverts commit 6a4e023b250a86887475958093f1d3bdcbb49a03.	14 June 2017, 18:35:14 UTC
3dda682	Xiao Li	14 June 2017, 18:13:16 UTC	[SPARK-21089][SQL] Fix DESC EXTENDED/FORMATTED to Show Table Properties Since both table properties and storage properties share the same key values, table properties are not shown in the output of DESC EXTENDED/FORMATTED when the storage properties are not empty. This PR is to fix the above issue by renaming them to different keys. Added test cases. Author: Xiao Li <gatorsmile@gmail.com> Closes #18294 from gatorsmile/tableProperties. (cherry picked from commit df766a471426625fe86c8845f6261e0fc087772d) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	14 June 2017, 18:19:28 UTC
6265119	gatorsmile	14 June 2017, 11:18:28 UTC	[SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0 ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/18244 to 2.2 --- The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0. The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion. Before this PR, the following queries failed: ```SQL select 1 > 0.0001 select floor(0.0001) select ceil(0.0001) ``` ### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18297 from gatorsmile/backport18244.	14 June 2017, 11:18:28 UTC
9bdc835	gatorsmile	14 June 2017, 08:28:06 UTC	[SPARK-21085][SQL] Failed to read the partitioned table created by Spark 2.1 ### What changes were proposed in this pull request? Before the PR, Spark is unable to read the partitioned table created by Spark 2.1 when the table schema does not put the partitioning column at the end of the schema. [assert(partitionFields.map(_.name) == partitionColumnNames)](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L234-L236) When reading the table metadata from the metastore, we also need to reorder the columns. ### How was this patch tested? Added test cases to check both Hive-serde and data source tables. Author: gatorsmile <gatorsmile@gmail.com> Closes #18295 from gatorsmile/reorderReadSchema. (cherry picked from commit 0c88e8d37224713199ca5661c2cd57f5918dcb9a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	14 June 2017, 08:28:21 UTC
42cc830	lianhuiwang	14 June 2017, 01:57:56 UTC	[SPARK-20986][SQL] Reset table's statistics after PruneFileSourcePartitions rule. ## What changes were proposed in this pull request? After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed. ## How was this patch tested? add unit test. Author: lianhuiwang <lianhuiwang09@gmail.com> Closes #18205 from lianhuiwang/SPARK-20986. (cherry picked from commit 8b5b2e272f48f7ddf8aeece0205cb4a5853c364e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	14 June 2017, 01:58:16 UTC
53212c3	jerryshao	14 June 2017, 00:12:15 UTC	[SPARK-12552][CORE] Correctly count the driver resource when recovering from failure for Master Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI. So here fix this to also count the driver's resource usage. Also changing the recovered app's state to `RUNNING` when fully recovered. Previously it will always be WAITING even fully recovered. andrewor14 please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #10506 from jerryshao/SPARK-12552. (cherry picked from commit 9eb095243b0bd8a2627db6dc36daa4561ef3c08b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	14 June 2017, 00:12:53 UTC
220943d	Shixiong Zhu	12 June 2017, 21:58:08 UTC	[SPARK-20979][SS] Add RateSource to generate values for tests and benchmark ## What changes were proposed in this pull request? This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily. This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L. It supports the following options: - `rowsPerSecond` (e.g. 100, default: 1): How many rows should be generated per second. - `rampUpTime` (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes `rowsPerSecond`. Using finer granularities than seconds will be truncated to integer seconds. - `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach `rowsPerSecond`, but the query may be resource constrained, and `numPartitions` can be tweaked to help reach the desired speed. Here is a simple example that prints 10 rows per seconds: ``` spark.readStream .format("rate") .option("rowsPerSecond", "10") .load() .writeStream .format("console") .start() ``` The idea came from marmbrus and he did the initial work. ## How was this patch tested? The added tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #18199 from zsxwing/rate.	13 June 2017, 23:36:32 UTC
2bc2c15	DjvuLee	13 June 2017, 14:56:03 UTC	[SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTransferServiceSuite ## What changes were proposed in this pull request? The default value for `spark.port.maxRetries` is 100, but we use 10 in the suite file. So we change it to 100 to avoid test failure. ## How was this patch tested? No test Author: DjvuLee <lihu@bytedance.com> Closes #18280 from djvulee/NettyTestBug. (cherry picked from commit b36ce2a2469ff923a3367a530d4a14899ecf9238) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 June 2017, 14:56:11 UTC
039c465	guoxiaolong	13 June 2017, 14:38:11 UTC	[SPARK-21060][WEB-UI] Css style about paging function is error in the executor page. Css style about paging function is error in the executor page. It is different of history server ui paging function css style. ## What changes were proposed in this pull request? Css style about paging function is error in the executor page. It is different of history server ui paging function css style. But their style should be consistent. There are three reasons. 1. The first reason: 'Previous', 'Next' and number should be the button format. 2. The second reason: when you are on the first page, 'Previous' and '1' should be gray and can not be clicked. ![1](https://user-images.githubusercontent.com/26266482/27026667-1fe745ee-4f91-11e7-8b34-150819d22bd3.png) 3. The third reason: when you are on the last page, 'Previous' and 'Max number' should be gray and can not be clicked. ![2](https://user-images.githubusercontent.com/26266482/27026811-9d8d6fa0-4f91-11e7-8b51-7816c3feb381.png) before fix: ![fix_before](https://user-images.githubusercontent.com/26266482/27026428-47ec5c56-4f90-11e7-9dd5-d52c22d7bd36.png) after fix: ![fix_after](https://user-images.githubusercontent.com/26266482/27026439-50d17072-4f90-11e7-8405-6f81da5ab32c.png) The style of history server ui: ![history](https://user-images.githubusercontent.com/26266482/27026528-9c90f780-4f90-11e7-91e6-90d32651fe03.png) ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: guoxiaolong <guo.xiaolong1@zte.com.cn> Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn> Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn> Closes #18275 from guoxiaolongzte/SPARK-21060. (cherry picked from commit b7304f25590fb1bc65cba6440d59a3322fd09947) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 June 2017, 14:38:19 UTC
24836be	Sean Owen	13 June 2017, 09:48:07 UTC	[SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive tables with many partitions ## What changes were proposed in this pull request? Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #18216 from srowen/SPARK-20920. (cherry picked from commit 7b7c85ede398996aafffb126440e5f0c67f67210) Signed-off-by: Sean Owen <sowen@cloudera.com>	13 June 2017, 09:48:15 UTC
dae1a98	Felix Cheung	13 June 2017, 05:08:49 UTC	[TEST][SPARKR][CORE] Fix broken SparkSubmitSuite ## What changes were proposed in this pull request? Fix test file path. This is broken in #18264 and undetected since R-only changes don't build core and subsequent post-commit with the change built fine (again because it wasn't building core) actually appveyor builds everything but it's not running scala suites ... ## How was this patch tested? jenkins srowen gatorsmile Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18283 from felixcheung/rsubmitsuite. (cherry picked from commit 278ba7a2c62b2cbb7bcfe79ce10d35ab57bb1950) Signed-off-by: Felix Cheung <felixcheung@apache.org>	13 June 2017, 05:09:05 UTC
48a843b	Joseph K. Bradley	12 June 2017, 21:27:57 UTC	[SPARK-21050][ML] Word2vec persistence overflow bug fix ## What changes were proposed in this pull request? The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence. This modifies the calculations to use Long. ## How was this patch tested? New unit test. I verified that the test fails before this patch. Author: Joseph K. Bradley <joseph@databricks.com> Closes #18265 from jkbradley/word2vec-save-fix. (cherry picked from commit ff318c0d2f283c3f46491f229f82d93714da40c7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	12 June 2017, 21:28:07 UTC
580ecfd	Reynold Xin	12 June 2017, 21:07:51 UTC	[SPARK-21059][SQL] LikeSimplification can NPE on null pattern ## What changes were proposed in this pull request? This patch fixes a bug that can cause NullPointerException in LikeSimplification, when the pattern for like is null. ## How was this patch tested? Added a new unit test case in LikeSimplificationSuite. Author: Reynold Xin <rxin@databricks.com> Closes #18273 from rxin/SPARK-21059. (cherry picked from commit b1436c7496161da1f4fa6950a06de62c9007c40c) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	12 June 2017, 21:08:21 UTC
a6b7875	Dongjoon Hyun	12 June 2017, 21:05:03 UTC	[SPARK-20345][SQL] Fix STS error handling logic on HiveSQLException ## What changes were proposed in this pull request? [SPARK-5100](https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143) added Spark Thrift Server(STS) UI and the following logic to handle exceptions on case `Throwable`. ```scala HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) ``` However, there occurred a missed case after implementing [SPARK-6964](https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792)'s `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` before case `Throwable`. ```scala case e: HiveSQLException => if (getStatus().getState() == OperationState.CANCELED) { return } else { setState(OperationState.ERROR) throw e } // Actually do need to catch Throwable as some failures don't inherit from Exception and // HiveServer will silently swallow them. case e: Throwable => val currentState = getStatus().getState() logError(s"Error executing query, currentState $currentState, ", e) setState(OperationState.ERROR) HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) throw new HiveSQLException(e.toString) ``` Logically, we had better add `HiveThriftServer2.listener.onStatementError` on case `HiveSQLException`, too. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17643 from dongjoon-hyun/SPARK-20345. (cherry picked from commit 32818d9b378f98556cff529af8645382cd5c6d16) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	12 June 2017, 21:05:13 UTC
92f7c8f	aokolnychyi	12 June 2017, 20:06:14 UTC	[SPARK-17914][SQL] Fix parsing of timestamp strings with nanoseconds The PR contains a tiny change to fix the way Spark parses string literals into timestamps. Currently, some timestamps that contain nanoseconds are corrupted during the conversion from internal UTF8Strings into the internal representation of timestamps. Consider the following example: ``` spark.sql("SELECT cast('2015-01-02 00:00:00.000000001' as TIMESTAMP)").show(false) +------------------------------------------------+ \|CAST(2015-01-02 00:00:00.000000001 AS TIMESTAMP)\| +------------------------------------------------+ \|2015-01-02 00:00:00.000001 \| +------------------------------------------------+ ``` The fix was tested with existing tests. Also, there is a new test to cover cases that did not work previously. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18252 from aokolnychyi/spark-17914. (cherry picked from commit ca4e960aec1a5f8cdc1e1344a25840d2670de391) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	12 June 2017, 20:06:26 UTC
e677394	Dongjoon Hyun	12 June 2017, 12:58:27 UTC	[SPARK-21041][SQL] SparkSession.range should be consistent with SparkContext.range ## What changes were proposed in this pull request? This PR fixes the inconsistency in `SparkSession.range`. BEFORE ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) ``` AFTER ```scala scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array() ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18257 from dongjoon-hyun/SPARK-21041. (cherry picked from commit a92e095e705155ea10c8311f7856b964d654626a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 June 2017, 12:58:54 UTC
a4d78e4	Ziyue Huang	12 June 2017, 09:59:33 UTC	[DOCS] Fix error: ambiguous reference to overloaded definition ## What changes were proposed in this pull request? `df.groupBy.count()` should be `df.groupBy().count()` , otherwise there is an error : ambiguous reference to overloaded definition, both method groupBy in class Dataset of type (col1: String, cols: String) and method groupBy in class Dataset of type (cols: org.apache.spark.sql.Column) ## How was this patch tested? ```scala val df = spark.readStream.schema(...).json(...) val dfCounts = df.groupBy().count() ``` Author: Ziyue Huang <zyhuang94@gmail.com> Closes #18272 from ZiyueHuang/master. (cherry picked from commit e6eb02df1540764ef2a4f0edb45c48df8de18c13) Signed-off-by: Sean Owen <sowen@cloudera.com>	12 June 2017, 09:59:41 UTC
26003de	Felix Cheung	11 June 2017, 10:00:44 UTC	[SPARK-20877][SPARKR][FOLLOWUP] clean up after test move clean up after big test move unit tests, jenkins Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18267 from felixcheung/rtestset2. (cherry picked from commit 9f4ff9552470fb97ca38bb56bbf43be49a9a316c) Signed-off-by: Felix Cheung <felixcheung@apache.org>	11 June 2017, 10:13:56 UTC
0b0be47	Felix Cheung	11 June 2017, 07:00:33 UTC	[SPARK-20877][SPARKR] refactor tests to basic tests only for CRAN ## What changes were proposed in this pull request? Move all existing tests to non-installed directory so that it will never run by installing SparkR package For a follow-up PR: - remove all skip_on_cran() calls in tests - clean up test timer - improve or change basic tests that do run on CRAN (if anyone has suggestion) It looks like `R CMD build pkg` will still put pkg\tests (ie. the full tests) into the source package but `R CMD INSTALL` on such source package does not install these tests (and so `R CMD check` does not run them) ## How was this patch tested? - [x] unit tests, Jenkins - [x] AppVeyor - [x] make a source package, install it, `R CMD check` it - verify the full tests are not installed or run Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18264 from felixcheung/rtestset. (cherry picked from commit dc4c351837879dab26ad8fb471dc51c06832a9e4) Signed-off-by: Felix Cheung <felixcheung@apache.org>	11 June 2017, 07:00:45 UTC
815a082	Reynold Xin	10 June 2017, 01:29:33 UTC	[SPARK-21042][SQL] Document Dataset.union is resolution by position ## What changes were proposed in this pull request? Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users. ## How was this patch tested? N/A - doc only change. Author: Reynold Xin <rxin@databricks.com> Closes #18256 from rxin/SPARK-21042. (cherry picked from commit b78e3849b20d0d09b7146efd7ce8f203ef67b890) Signed-off-by: Reynold Xin <rxin@databricks.com>	10 June 2017, 01:29:39 UTC
869af5b	junzhi lu	09 June 2017, 09:49:04 UTC	Fix bug in JavaRegressionMetricsExample. the original code cant visit the last element of the"parts" array. so the v[v.length–1] always equals 0 ## What changes were proposed in this pull request? change the recycle range from (1 to parts.length-1) to (1 to parts.length) ## How was this patch tested? debug it in eclipse (´〜｀*) zzz. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: junzhi lu <452756565@qq.com> Closes #18237 from masterwugui/patch-1. (cherry picked from commit 6491cbf065254e28bca61c9ef55b84f4009ac36c) Signed-off-by: Sean Owen <sowen@cloudera.com>	09 June 2017, 09:49:13 UTC
714153c	Corey Woodfield	09 June 2017, 09:24:49 UTC	Fixed broken link ## What changes were proposed in this pull request? I fixed some incorrect formatting on a link in the docs ## How was this patch tested? I looked at the markdown preview before and after, and the link was fixed Before: <img width="593" alt="screen shot 2017-06-08 at 6 37 32 pm" src="https://user-images.githubusercontent.com/17733030/26956272-a62cd558-4c79-11e7-862f-9d0e0184b18a.png"> After: <img width="587" alt="screen shot 2017-06-08 at 6 37 44 pm" src="https://user-images.githubusercontent.com/17733030/26956276-b1135ef6-4c79-11e7-8028-84d19c392fda.png"> Author: Corey Woodfield <coreywoodfield@gmail.com> Closes #18246 from coreywoodfield/master. (cherry picked from commit 033839559eab280760e3c5687940d8c62cc9c048) Signed-off-by: Sean Owen <sowen@cloudera.com>	09 June 2017, 09:24:58 UTC
3f6812c	Dongjoon Hyun	09 June 2017, 04:02:51 UTC	[SPARK-20954][SQL][BRANCH-2.2][EXTENDED] DESCRIBE ` result should be compatible with previous Spark ## What changes were proposed in this pull request? After [SPARK-20067](https://issues.apache.org/jira/browse/SPARK-20067), `DESCRIBE` and `DESCRIBE EXTENDED` shows the following result. This is incompatible with Spark 2.1.1. This PR removes the column header line in case of those commands. MASTER and BRANCH-2.2 ```scala scala> sql("desc t").show(false) +----------+---------+-------+ \|col_name \|data_type\|comment\| +----------+---------+-------+ \|# col_name\|data_type\|comment\| \|a \|int \|null \| +----------+---------+-------+ ``` SPARK 2.1.1 and this PR ```scala scala> sql("desc t").show(false) +--------+---------+-------+ \|col_name\|data_type\|comment\| +--------+---------+-------+ \|a \|int \|null \| +--------+---------+-------+ ``` ## How was this patch tested? Pass the Jenkins with the updated test suites. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18245 from dongjoon-hyun/SPARK-20954-BRANCH-2.2.	09 June 2017, 04:02:51 UTC
02cf178	Mark Grover	08 June 2017, 16:55:43 UTC	[SPARK-19185][DSTREAM] Make Kafka consumer cache configurable ## What changes were proposed in this pull request? Add a new property `spark.streaming.kafka.consumer.cache.enabled` that allows users to enable or disable the cache for Kafka consumers. This property can be especially handy in cases where issues like SPARK-19185 get hit, for which there isn't a solution committed yet. By default, the cache is still on, so this change doesn't change any out-of-box behavior. ## How was this patch tested? Running unit tests Author: Mark Grover <mark@apache.org> Author: Mark Grover <grover.markgrover@gmail.com> Closes #18234 from markgrover/spark-19185. (cherry picked from commit 55b8cfe6e6a6759d65bf219ff570fd6154197ec4) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	08 June 2017, 16:55:54 UTC
2f5eaa9	Sean Owen	08 June 2017, 09:56:23 UTC	[SPARK-20914][DOCS] Javadoc contains code that is invalid ## What changes were proposed in this pull request? Fix Java, Scala Dataset examples in scaladoc, which didn't compile. ## How was this patch tested? Existing compilation/test Author: Sean Owen <sowen@cloudera.com> Closes #18215 from srowen/SPARK-20914. (cherry picked from commit 847efe12656756f9ad6a4dc14bd183ac1a0760a6) Signed-off-by: Sean Owen <sowen@cloudera.com>	08 June 2017, 09:56:30 UTC
9a4341b	Dongjoon Hyun	07 June 2017, 07:50:36 UTC	[MINOR][DOC] Update deprecation notes on Python/Hadoop/Scala. ## What changes were proposed in this pull request? We had better update the deprecation notes about Python 2.6, Hadoop (before 2.6.5) and Scala 2.10 in [2.2.0-RC4](http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/) documentation. Since this is a doc only update, I think we can update the doc during publishing. BEFORE (2.2.0-RC4) ![before](https://cloud.githubusercontent.com/assets/9700541/26799758/aea0dc06-49eb-11e7-8ca3-ed8ce1cc6147.png) AFTER ![after](https://cloud.githubusercontent.com/assets/9700541/26799761/b3fef818-49eb-11e7-83c5-334f0e4768ed.png) ## How was this patch tested? Manual. ``` SKIP_API=1 jekyll build ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18207 from dongjoon-hyun/minor_doc_deprecation. (cherry picked from commit 3218505a0b4fc37026a71a262c931f06a60c7bf6) Signed-off-by: Sean Owen <sowen@cloudera.com>	07 June 2017, 07:50:45 UTC
3f93d07	Bogdan Raducanu	07 June 2017, 05:51:10 UTC	[SPARK-20854][TESTS] Removing duplicate test case ## What changes were proposed in this pull request? Removed a duplicate case in "SPARK-20854: select hint syntax with expressions" ## How was this patch tested? Existing tests. Author: Bogdan Raducanu <bogdan@databricks.com> Closes #18217 from bogdanrdc/SPARK-20854-2. (cherry picked from commit cb83ca1433c865cb0aef973df2b872a83671acfd) Signed-off-by: Reynold Xin <rxin@databricks.com>	07 June 2017, 05:51:18 UTC
421d8ec	Shixiong Zhu	05 June 2017, 21:34:10 UTC	[SPARK-20957][SS][TESTS] Fix o.a.s.sql.streaming.StreamingQueryManagerSuite listing ## What changes were proposed in this pull request? When stopping StreamingQuery, StreamExecution will set `streamDeathCause` then notify StreamingQueryManager to remove this query. So it's possible that when `q2.exception.isDefined` returns `true`, StreamingQueryManager's active list still has `q2`. This PR just puts the checks into `eventually` to fix the flaky test. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18180 from zsxwing/SPARK-20957. (cherry picked from commit bc537e40ade0658aae7c6b5ddafb4cc038bdae2b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	05 June 2017, 21:37:59 UTC
1388fdd	Reza Safi	05 June 2017, 20:19:01 UTC	[SPARK-20926][SQL] Removing exposures to guava library caused by directly accessing SessionCatalog's tableRelationCache There could be test failures because DataStorageStrategy, HiveMetastoreCatalog and also HiveSchemaInferenceSuite were exposed to guava library by directly accessing SessionCatalog's tableRelationCacheg. These failures occur when guava shading is in place. ## What changes were proposed in this pull request? This change removes those guava exposures by introducing new methods in SessionCatalog and also changing DataStorageStrategy, HiveMetastoreCatalog and HiveSchemaInferenceSuite so that they use those proxy methods. ## How was this patch tested? Unit tests passed after applying these changes. Author: Reza Safi <rezasafi@cloudera.com> Closes #18148 from rezasafi/branch-2.2.	05 June 2017, 20:19:01 UTC
acd4481	David Eis	03 June 2017, 08:48:10 UTC	[SPARK-20790][MLLIB] Remove extraneous logging in test ## What changes were proposed in this pull request? Remove extraneous logging. ## How was this patch tested? Unit tests pass. Author: David Eis <deis@bloomberg.net> Closes #18188 from davideis/fix-test. (cherry picked from commit 96e6ba6c2aaddd885ba6a842fbfcd73c5537f99e) Signed-off-by: Sean Owen <sowen@cloudera.com>	03 June 2017, 08:48:19 UTC
c8bbab6	Wenchen Fan	03 June 2017, 04:59:52 UTC	[SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes ## What changes were proposed in this pull request? REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18191 from cloud-fan/test. (cherry picked from commit 864d94fe879a32de324da65a844e62a0260b222d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	03 June 2017, 05:00:04 UTC
478874e	Patrick Wendell	03 June 2017, 00:20:42 UTC	Preparing development version 2.2.1-SNAPSHOT	03 June 2017, 00:20:42 UTC
377cfa8	Patrick Wendell	03 June 2017, 00:20:38 UTC	Preparing Spark release v2.2.0-rc4	03 June 2017, 00:20:38 UTC
b560c97	Yin Huai	02 June 2017, 22:36:21 UTC	Revert "[SPARK-20946][SQL] simplify the config setting logic in SparkSession.getOrCreate" This reverts commit e11d90bf8deb553fd41b8837e3856c11486c2503.	02 June 2017, 22:37:38 UTC
6c628e7	Xiao Li	02 June 2017, 19:58:29 UTC	[MINOR][SQL] Update the description of spark.sql.files.ignoreCorruptFiles and spark.sql.columnNameOfCorruptRecord ### What changes were proposed in this pull request? 1. The description of `spark.sql.files.ignoreCorruptFiles` is not accurate. When the file does not exist, we will issue the error message. ``` org.apache.spark.sql.AnalysisException: Path does not exist: file:/nonexist/path; ``` 2. `spark.sql.columnNameOfCorruptRecord` also affects the CSV format. The current description only mentions JSON format. ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #18184 from gatorsmile/updateMessage. (cherry picked from commit 2a780ac7fe21df7c336885f8e814c1b866e04285) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	02 June 2017, 19:58:37 UTC
0c42279	Patrick Wendell	02 June 2017, 19:07:58 UTC	Preparing development version 2.2.0-SNAPSHOT	02 June 2017, 19:07:58 UTC
cc5dbd5	Patrick Wendell	02 June 2017, 19:07:53 UTC	Preparing Spark release v2.2.0-rc3	02 June 2017, 19:07:53 UTC
9a4a8e1	Xiao Li	02 June 2017, 18:57:22 UTC	[SPARK-19236][SQL][BACKPORT-2.2] Added createOrReplaceGlobalTempView method ### What changes were proposed in this pull request? This PR is to backport two PRs for adding the `createOrReplaceGlobalTempView` method https://github.com/apache/spark/pull/18147 https://github.com/apache/spark/pull/16598 --- Added the createOrReplaceGlobalTempView method for dataset API ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #18167 from gatorsmile/Backport18147.	02 June 2017, 18:57:22 UTC
7f35f5b	Shixiong Zhu	02 June 2017, 17:33:21 UTC	[SPARK-20955][CORE] Intern "executorId" to reduce the memory usage ## What changes were proposed in this pull request? In [this line](https://github.com/apache/spark/blob/f7cf2096fdecb8edab61c8973c07c6fc877ee32d/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L128), it uses the `executorId` string received from executors and finally it will go into `TaskUIData`. As deserializing the `executorId` string will always create a new instance, we have a lot of duplicated string instances. This PR does a String interning for TaskUIData to reduce the memory usage. ## How was this patch tested? Manually test using `bin/spark-shell --master local-cluster[6,1,1024]`. Test codes: ``` for (_ <- 1 to 10) { sc.makeRDD(1 to 1000, 1000).count() } Thread.sleep(2000) val l = sc.getClass.getMethod("jobProgressListener").invoke(sc).asInstanceOf[org.apache.spark.ui.jobs.JobProgressListener] org.apache.spark.util.SizeEstimator.estimate(l.stageIdToData) ``` This PR reduces the size of `stageIdToData` from 3487280 to 3009744 (86.3%) in the above case. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18177 from zsxwing/SPARK-20955. (cherry picked from commit 16186cdcbce1a2ec8f839c550e6b571bf5dc2692) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	02 June 2017, 17:33:28 UTC
f36c3ee	Wenchen Fan	02 June 2017, 17:05:05 UTC	[SPARK-20946][SQL] simplify the config setting logic in SparkSession.getOrCreate ## What changes were proposed in this pull request? The current conf setting logic is a little complex and has duplication, this PR simplifies it. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18172 from cloud-fan/session. (cherry picked from commit e11d90bf8deb553fd41b8837e3856c11486c2503) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	02 June 2017, 17:05:12 UTC
ae00d49	Wenchen Fan	02 June 2017, 16:58:01 UTC	[SPARK-20967][SQL] SharedState.externalCatalog is not really lazy ## What changes were proposed in this pull request? `SharedState.externalCatalog` is marked as a `lazy val` but actually it's not lazy. We access `externalCatalog` while initializing `SharedState` and thus eliminate the effort of `lazy val`. When creating `ExternalCatalog` we will try to connect to the metastore and may throw an error, so it makes sense to make it a `lazy val` in `SharedState`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18187 from cloud-fan/minor. (cherry picked from commit d1b80ab9220d83e5fdaf33c513cc811dd17d0de1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	02 June 2017, 16:58:10 UTC
25cc800	guoxiaolong	02 June 2017, 13:38:00 UTC	[SPARK-20942][WEB-UI] The title style about field is error in the history server web ui. ## What changes were proposed in this pull request? 1.The title style about field is error. fix before: ![before](https://cloud.githubusercontent.com/assets/26266482/26661987/a7bed018-46b3-11e7-8a54-a5152d2df0f4.png) fix after: ![fix](https://cloud.githubusercontent.com/assets/26266482/26662000/ba6cc814-46b3-11e7-8f33-cfd4cc2c60fe.png) ![fix1](https://cloud.githubusercontent.com/assets/26266482/26662080/3c732e3e-46b4-11e7-8768-20b5a6aeadcb.png) executor-page style: ![executor_page](https://cloud.githubusercontent.com/assets/26266482/26662384/167cbd10-46b6-11e7-9e07-bf391dbc6e08.png) 2.Title text description, 'the application' should be changed to 'this application'. 3.Analysis of code: $('#history-summary [data-toggle="tooltip"]').tooltip(); The id of 'history-summary' is not there. We only contain id of 'history-summary-table'. ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: guoxiaolong <guo.xiaolong1@zte.com.cn> Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn> Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn> Closes #18170 from guoxiaolongzte/SPARK-20942. (cherry picked from commit 625cebfde632361122e0db3452c4cc38147f696f) Signed-off-by: Sean Owen <sowen@cloudera.com>	02 June 2017, 13:38:11 UTC
bb3d900	Bogdan Raducanu	01 June 2017, 22:50:40 UTC	[SPARK-20854][SQL] Extend hint syntax to support expressions SQL hint syntax: * support expressions such as strings, numbers, etc. instead of only identifiers as it is currently. * support multiple hints, which was missing compared to the DataFrame syntax. DataFrame API: * support any parameters in DataFrame.hint instead of just strings Existing tests. New tests in PlanParserSuite. New suite DataFrameHintSuite. Author: Bogdan Raducanu <bogdan@databricks.com> Closes #18086 from bogdanrdc/SPARK-20854. (cherry picked from commit 2134196a9c0aca82bc3e203c09e776a8bd064d65) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	01 June 2017, 22:56:26 UTC
4cba3b5	Marcelo Vanzin	01 June 2017, 21:44:34 UTC	[SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. Blindly deserializing classes using Java serialization opens the code up to issues in other libraries, since just deserializing data from a stream may end up execution code (think readObject()). Since the launcher protocol is pretty self-contained, there's just a handful of classes it legitimately needs to deserialize, and they're in just two packages, so add a filter that throws errors if classes from any other package show up in the stream. This also maintains backwards compatibility (the updated launcher code can still communicate with the backend code in older Spark releases). Tested with new and existing unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18166 from vanzin/SPARK-20922. (cherry picked from commit 8efc6e986554ae66eab93cd64a9035d716adbab0) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	01 June 2017, 21:44:42 UTC
b81a702	Li Yichao	01 June 2017, 21:39:57 UTC	[SPARK-20365][YARN] Remove local scheme when add path to ClassPath. In Spark on YARN, when configuring "spark.yarn.jars" with local jars (jars started with "local" scheme), we will get inaccurate classpath for AM and containers. This is because we don't remove "local" scheme when concatenating classpath. It is OK to run because classpath is separated with ":" and java treat "local" as a separate jar. But we could improve it to remove the scheme. Updated `ClientSuite` to check "local" is not in the classpath. cc jerryshao Author: Li Yichao <lyc@zhihu.com> Author: Li Yichao <liyichao.good@gmail.com> Closes #18129 from liyichao/SPARK-20365. (cherry picked from commit 640afa49aa349c7ebe35d365eec3ef9bb7710b1d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	01 June 2017, 21:40:32 UTC
6a4e023	Xiao Li	01 June 2017, 16:52:18 UTC	[SPARK-20941][SQL] Fix SubqueryExec Reuse Before this PR, Subquery reuse does not work. Below are three issues: - Subquery reuse does not work. - It is sharing the same `SQLConf` (`spark.sql.exchange.reuse`) with the one for Exchange Reuse. - No test case covers the rule Subquery reuse. This PR is to fix the above three issues. - Ignored the physical operator `SubqueryExec` when comparing two plans. - Added a dedicated conf `spark.sql.subqueries.reuse` for controlling Subquery Reuse - Added a test case for verifying the behavior N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #18169 from gatorsmile/subqueryReuse. (cherry picked from commit f7cf2096fdecb8edab61c8973c07c6fc877ee32d) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	01 June 2017, 16:57:26 UTC
4ab7b82	Yuming Wang	01 June 2017, 06:17:15 UTC	[MINOR][SQL] Fix a few function description error. ## What changes were proposed in this pull request? Fix a few function description error. ## How was this patch tested? manual tests ![descissues](https://cloud.githubusercontent.com/assets/5399861/26619392/d547736c-4610-11e7-85d7-aeeb09c02cc8.gif) Author: Yuming Wang <wgyumg@gmail.com> Closes #18157 from wangyum/DescIssues. (cherry picked from commit c8045f8b482e347eccf2583e0952e1d8bcb6cb96) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	01 June 2017, 06:17:25 UTC
14fda6f	jerryshao	01 June 2017, 05:34:53 UTC	[SPARK-20244][CORE] Handle incorrect bytesRead metrics when using PySpark ## What changes were proposed in this pull request? Hadoop FileSystem's statistics in based on thread local variables, this is ok if the RDD computation chain is running in the same thread. But if child RDD creates another thread to consume the iterator got from Hadoop RDDs, the bytesRead computation will be error, because now the iterator's `next()` and `close()` may run in different threads. This could be happened when using PySpark with PythonRDD. So here building a map to track the `bytesRead` for different thread and add them together. This method will be used in three RDDs, `HadoopRDD`, `NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`. ## How was this patch tested? Unit test and local cluster verification. Author: jerryshao <sshao@hortonworks.com> Closes #17617 from jerryshao/SPARK-20244. (cherry picked from commit 5854f77ce1d3b9491e2a6bd1f352459da294e369) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	01 June 2017, 05:35:02 UTC
a607a26	Shixiong Zhu	01 June 2017, 00:26:18 UTC	[SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException ## What changes were proposed in this pull request? `IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18168 from zsxwing/SPARK-20940. (cherry picked from commit 24db35826a81960f08e3eb68556b0f51781144e1) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	01 June 2017, 00:26:26 UTC
f59f9a3	liuxian	31 May 2017, 18:43:36 UTC	[SPARK-20876][SQL][BACKPORT-2.2] If the input parameter is float type for ceil or floor,the result is not we expected ## What changes were proposed in this pull request? This PR is to backport #18103 to Spark 2.2 ## How was this patch tested? unit test Author: liuxian <liu.xian3@zte.com.cn> Closes #18155 from 10110346/wip-lx-0531.	31 May 2017, 18:43:36 UTC
3686c2e	David Eis	31 May 2017, 12:52:55 UTC	[SPARK-20790][MLLIB] Correctly handle negative values for implicit feedback in ALS ## What changes were proposed in this pull request? Revert the handling of negative values in ALS with implicit feedback, so that the confidence is the absolute value of the rating and the preference is 0 for negative ratings. This was the original behavior. ## How was this patch tested? This patch was tested with the existing unit tests and an added unit test to ensure that negative ratings are not ignored. mengxr Author: David Eis <deis@bloomberg.net> Closes #18022 from davideis/bugfix/negative-rating. (cherry picked from commit d52f636228e833db89045bc7a0c17b72da13f138) Signed-off-by: Sean Owen <sowen@cloudera.com>	31 May 2017, 12:53:05 UTC
3cad66e	Felix Cheung	31 May 2017, 05:33:29 UTC	[SPARK-20877][SPARKR][WIP] add timestamps to test runs to investigate how long they run Jenkins, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18104 from felixcheung/rtimetest. (cherry picked from commit 382fefd1879e4670f3e9e8841ec243e3eb11c578) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	31 May 2017, 05:35:44 UTC
287440d	jerryshao	31 May 2017, 03:24:43 UTC	[SPARK-20275][UI] Do not display "Completed" column for in-progress applications ## What changes were proposed in this pull request? Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications. The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible. ``` [ { "id" : "local-1491805439678", "name" : "Spark shell", "attempts" : [ { "startTime" : "2017-04-10T06:23:57.574GMT", "endTime" : "1969-12-31T23:59:59.999GMT", "lastUpdated" : "2017-04-10T06:23:57.574GMT", "duration" : 0, "sparkUser" : "", "completed" : false, "startTimeEpoch" : 1491805437574, "endTimeEpoch" : -1, "lastUpdatedEpoch" : 1491805437574 } ] } ]% ``` Here is UI before changed: <img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png"> And after: <img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png"> ## How was this patch tested? Manual verification. Author: jerryshao <sshao@hortonworks.com> Closes #17588 from jerryshao/SPARK-20275. (cherry picked from commit 52ed9b289d169219f7257795cbedc56565a39c71) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	31 May 2017, 03:24:50 UTC
5fdc7d8	Xiao Li	30 May 2017, 21:06:19 UTC	[SPARK-20924][SQL] Unable to call the function registered in the not-current database ### What changes were proposed in this pull request? We are unable to call the function registered in the not-current database. ```Scala sql("CREATE DATABASE dAtABaSe1") sql(s"CREATE FUNCTION dAtABaSe1.test_avg AS '${classOf[GenericUDAFAverage].getName}'") sql("SELECT dAtABaSe1.test_avg(1)") ``` The above code returns an error: ``` Undefined function: 'dAtABaSe1.test_avg'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 ``` This PR is to fix the above issue. ### How was this patch tested? Added test cases. Author: Xiao Li <gatorsmile@gmail.com> Closes #18146 from gatorsmile/qualifiedFunction. (cherry picked from commit 4bb6a53ebd06de3de97139a2dbc7c85fc3aa3e66) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	30 May 2017, 21:06:26 UTC
f6730a7	Prashant Sharma	30 May 2017, 01:12:01 UTC	[SPARK-19968][SS] Use a cached instance of `KafkaProducer` instead of creating one every batch. ## What changes were proposed in this pull request? In summary, cost of recreating a KafkaProducer for writing every batch is high as it starts a lot threads and make connections and then closes them. A KafkaProducer instance is promised to be thread safe in Kafka docs. Reuse of KafkaProducer instance while writing via multiple threads is encouraged. Furthermore, I have performance improvement of 10x in latency, with this patch. ### These are times that addBatch took in ms. Without applying this patch ![with-out_patch](https://cloud.githubusercontent.com/assets/992952/23994612/a9de4a42-0a6b-11e7-9d5b-7ae18775bee4.png) ### These are times that addBatch took in ms. After applying this patch ![with_patch](https://cloud.githubusercontent.com/assets/992952/23994616/ad8c11ec-0a6b-11e7-8634-2266ebb5033f.png) ## How was this patch tested? Running distributed benchmarks comparing runs with this patch and without it. Added relevant unit tests. Author: Prashant Sharma <prashsh1@in.ibm.com> Closes #17308 from ScrapCodes/cached-kafka-producer. (cherry picked from commit 96a4d1d0827fc3fba83f174510b061684f0d00f7) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	30 May 2017, 01:12:10 UTC
3b79e4c	Yuming Wang	29 May 2017, 23:10:22 UTC	[SPARK-8184][SQL] Add additional function description for weekofyear ## What changes were proposed in this pull request? Add additional function description for weekofyear. ## How was this patch tested? manual tests ![weekofyear](https://cloud.githubusercontent.com/assets/5399861/26525752/08a1c278-4394-11e7-8988-7cbf82c3a999.gif) Author: Yuming Wang <wgyumg@gmail.com> Closes #18132 from wangyum/SPARK-8184. (cherry picked from commit 1c7db00c74ec6a91c7eefbdba85cbf41fbe8634a) Signed-off-by: Reynold Xin <rxin@databricks.com>	29 May 2017, 23:10:29 UTC
26640a2	Kazuaki Ishizaki	29 May 2017, 19:17:14 UTC	[SPARK-20907][TEST] Use testQuietly for test suites that generate long log output ## What changes were proposed in this pull request? Supress console output by using `testQuietly` in test suites ## How was this patch tested? Tested by `"SPARK-19372: Filter can be executed w/o generated code due to JVM code size limit"` in `DataFrameSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18135 from kiszk/SPARK-20907. (cherry picked from commit c9749068ecf8e0acabdfeeceeedff0f1f73293b7) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	29 May 2017, 19:17:22 UTC
dc51be1	Xiao Li	28 May 2017, 04:32:18 UTC	[SPARK-20908][SQL] Cache Manager: Hint should be ignored in plan matching ### What changes were proposed in this pull request? In Cache manager, the plan matching should ignore Hint. ```Scala val df1 = spark.range(10).join(broadcast(spark.range(10))) df1.cache() spark.range(10).join(spark.range(10)).explain() ``` The output plan of the above query shows that the second query is not using the cached data of the first query. ``` BroadcastNestedLoopJoin BuildRight, Inner :- Range (0, 10, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode +- Range (0, 10, step=1, splits=2) ``` After the fix, the plan becomes ``` InMemoryTableScan [id#20L, id#23L] +- InMemoryRelation [id#20L, id#23L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- BroadcastNestedLoopJoin BuildRight, Inner :- Range (0, 10, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode +- Range (0, 10, step=1, splits=2) ``` ### How was this patch tested? Added a test. Author: Xiao Li <gatorsmile@gmail.com> Closes #18131 from gatorsmile/HintCache. (cherry picked from commit 06c155c90dc784b07002f33d98dcfe9be1e38002) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	28 May 2017, 04:33:12 UTC
25e87d8	Wenchen Fan	27 May 2017, 23:16:51 UTC	[SPARK-20897][SQL] cached self-join should not fail ## What changes were proposed in this pull request? The failed test case is, we have a `SortMergeJoinExec` for a self-join, which means we have a `ReusedExchange` node in the query plan. It works fine without caching, but throws an exception in `SortMergeJoinExec.outputPartitioning` if we cache it. The root cause is, `ReusedExchange` doesn't propagate the output partitioning from its child, so in `SortMergeJoinExec.outputPartitioning` we create `PartitioningCollection` with a hash partitioning and an unknown partitioning, and fail. This bug is mostly fine, because inserting the `ReusedExchange` is the last step to prepare the physical plan, we won't call `SortMergeJoinExec.outputPartitioning` anymore after this. However, if the dataframe is cached, the physical plan of it becomes `InMemoryTableScanExec`, which contains another physical plan representing the cached query, and it has gone through the entire planning phase and may have `ReusedExchange`. Then the planner call `InMemoryTableScanExec.outputPartitioning`, which then calls `SortMergeJoinExec.outputPartitioning` and trigger this bug. ## How was this patch tested? a new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18121 from cloud-fan/bug. (cherry picked from commit 08ede46b897b7e52cfe8231ffc21d9515122cf49) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	27 May 2017, 23:16:59 UTC
f2408bd	Shixiong Zhu	27 May 2017, 05:25:38 UTC	[SPARK-20843][CORE] Add a config to set driver terminate timeout ## What changes were proposed in this pull request? Add a `worker` configuration to set how long to wait before forcibly killing driver. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18126 from zsxwing/SPARK-20843. (cherry picked from commit 6c1dbd6fc8d49acf7c1c902d2ebf89ed5e788a4e) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	27 May 2017, 05:25:44 UTC
39f7665	Wenchen Fan	27 May 2017, 02:57:43 UTC	[SPARK-19659][CORE][FOLLOW-UP] Fetch big blocks to disk when shuffle-read ## What changes were proposed in this pull request? This PR includes some minor improvement for the comments and tests in https://github.com/apache/spark/pull/16989 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18117 from cloud-fan/follow. (cherry picked from commit 1d62f8aca82601506c44b6fd852f4faf3602d7e2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	27 May 2017, 02:58:08 UTC
fc799d7	Yu Peng	26 May 2017, 23:28:36 UTC	[SPARK-10643][CORE] Make spark-submit download remote files to local in client mode ## What changes were proposed in this pull request? This PR makes spark-submit script download remote files to local file system for local/standalone client mode. ## How was this patch tested? - Unit tests - Manual tests by adding s3a jar and testing against file on s3. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Yu Peng <loneknightpy@gmail.com> Closes #18078 from loneknightpy/download-jar-in-spark-submit. (cherry picked from commit 4af37812915763ac3bfd91a600a7f00a4b84d29a) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	26 May 2017, 23:29:25 UTC
30922de	zero323	26 May 2017, 22:01:01 UTC	[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide ## What changes were proposed in this pull request? - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`. - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide - Remove bucketing from Unsupported Hive Functionalities. ## How was this patch tested? Manual tests, docs build. Author: zero323 <zero323@users.noreply.github.com> Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING. (cherry picked from commit ae33abf71b353c638487948b775e966c7127cd46) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	26 May 2017, 22:01:23 UTC
2b59ed4	Michael Armbrust	26 May 2017, 20:33:23 UTC	[SPARK-20844] Remove experimental from Structured Streaming APIs Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate. I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3. Author: Michael Armbrust <michael@databricks.com> Closes #18065 from marmbrus/streamingGA.	26 May 2017, 20:34:33 UTC
92837ae	Kazuaki Ishizaki	16 May 2017, 21:47:21 UTC	[SPARK-19372][SQL] Fix throwing a Java exception at df.fliter() due to 64KB bytecode size limit ## What changes were proposed in this pull request? When an expression for `df.filter()` has many nodes (e.g. 400), the size of Java bytecode for the generated Java code is more than 64KB. It produces an Java exception. As a result, the execution fails. This PR continues to execute by calling `Expression.eval()` disabling code generation if an exception has been caught. ## How was this patch tested? Add a test suite into `DataFrameSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #17087 from kiszk/SPARK-19372.	26 May 2017, 18:15:21 UTC
f99456b	NICHOLAS T. MARION	10 May 2017, 09:59:57 UTC	[SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities ## What changes were proposed in this pull request? Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest. ## How was this patch tested? Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages. Author: NICHOLAS T. MARION <nmarion@us.ibm.com> Closes #17686 from n-marion/xss-fix. (cherry picked from commit b512233a457092b0e2a39d0b42cb021abc69d375) Signed-off-by: Sean Owen <sowen@cloudera.com>	26 May 2017, 18:11:18 UTC
fafe283	Wenchen Fan	26 May 2017, 07:01:28 UTC	[SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo ## What changes were proposed in this pull request? Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier. However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it. https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18091 from cloud-fan/shuffle. (cherry picked from commit d9ad78908f6189719cec69d34557f1a750d2e6af) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	26 May 2017, 07:01:42 UTC
289dd17	Michael Allman	26 May 2017, 01:25:43 UTC	[SPARK-20888][SQL][DOCS] Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode (Link to Jira: https://issues.apache.org/jira/browse/SPARK-20888) ## What changes were proposed in this pull request? Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode configuration key from NEVER_INFO to INFER_AND_SAVE in the Spark SQL 2.1 to 2.2 migration notes. Author: Michael Allman <michael@videoamp.com> Closes #18112 from mallman/spark-20888-document_infer_and_save. (cherry picked from commit c1e7989c4ffd83c51f5c97998b4ff6fe8dd83cf4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	26 May 2017, 01:26:16 UTC
7a21de9	Shixiong Zhu	25 May 2017, 17:49:14 UTC	[SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to examples project ## What changes were proposed in this pull request? Add Structured Streaming Kafka Source to the `examples` project so that people can run `bin/run-example StructuredKafkaWordCount ...`. ## How was this patch tested? manually tested it. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18101 from zsxwing/add-missing-example-dep. (cherry picked from commit 98c3852986a2cb5f2d249d6c8ef602be283bd90e) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	25 May 2017, 17:49:23 UTC
5ae1c65	hyukjinkwon	25 May 2017, 16:10:30 UTC	[SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid path check for sc.addJar on Windows ## What changes were proposed in this pull request? This PR proposes two things: - A follow up for SPARK-19707 (Improving the invalid path check for sc.addJar on Windows as well). ``` org.apache.spark.SparkContextSuite: - add jar with invalid path * FAILED * (32 milliseconds) 2 was not equal to 1 (SparkContextSuite.scala:309) ... ``` - Fix path vs URI related test failures on Windows. ``` org.apache.spark.storage.LocalDirsSuite: - SPARK_LOCAL_DIRS override also affects driver * FAILED * (0 milliseconds) new java.io.File("/NONEXISTENT_PATH").exists() was true (LocalDirsSuite.scala:50) ... - Utils.getLocalDir() throws an exception if any temporary directory cannot be retrieved * FAILED * (15 milliseconds) Expected exception java.io.IOException to be thrown, but no exception was thrown. (LocalDirsSuite.scala:64) ... ``` ``` org.apache.spark.sql.hive.HiveSchemaInferenceSuite: - orc: schema should be inferred and saved when INFER_AND_SAVE is specified * FAILED * (203 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-dae61ab3-a851-4dd3-bf4e-be97c501f254 ... - parquet: schema should be inferred and saved when INFER_AND_SAVE is specified * FAILED * (203 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fa3aff89-a66e-4376-9a37-2a9b87596939 ... - orc: schema should be inferred but not stored when INFER_ONLY is specified * FAILED * (141 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-fb464e59-b049-481b-9c75-f53295c9fc2c ... - parquet: schema should be inferred but not stored when INFER_ONLY is specified * FAILED * (125 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-9487568e-80a4-42b3-b0a5-d95314c4ccbc ... - orc: schema should not be inferred when NEVER_INFER is specified * FAILED * (156 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-0d2dfa45-1b0f-4958-a8be-1074ed0135a ... - parquet: schema should not be inferred when NEVER_INFER is specified * FAILED * (547 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-6d95d64e-613e-4a59-a0f6-d198c5aa51ee ... ``` ``` org.apache.spark.sql.execution.command.DDLSuite: - create temporary view using * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-3881d9ca-561b-488d-90b9-97587472b853 mp; ... - insert data to a data source table which has a non-existing location should succeed * FAILED * (109 milliseconds) file:/C:projectsspark%09arget%09mpspark-4cad3d19-6085-4b75-b407-fe5e9d21df54 did not equal file:///C:/projects/spark/target/tmp/spark-4cad3d19-6085-4b75-b407-fe5e9d21df54 (DDLSuite.scala:1869) ... - insert into a data source table with a non-existing partition location should succeed * FAILED * (94 milliseconds) file:/C:projectsspark%09arget%09mpspark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d did not equal file:///C:/projects/spark/target/tmp/spark-4b52e7de-e3aa-42fd-95d4-6d4d58d1d95d (DDLSuite.scala:1910) ... - read data from a data source table which has a non-existing location should succeed * FAILED * (93 milliseconds) file:/C:projectsspark%09arget%09mpspark-f8c281e2-08c2-4f73-abbf-f3865b702c34 did not equal file:///C:/projects/spark/target/tmp/spark-f8c281e2-08c2-4f73-abbf-f3865b702c34 (DDLSuite.scala:1937) ... - read data from a data source table with non-existing partition location should succeed * FAILED * (110 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - create datasource table with a non-existing location * FAILED * (94 milliseconds) file:/C:projectsspark%09arget%09mpspark-387316ae-070c-4e78-9b78-19ebf7b29ec8 did not equal file:///C:/projects/spark/target/tmp/spark-387316ae-070c-4e78-9b78-19ebf7b29ec8 (DDLSuite.scala:1982) ... - CTAS for external data source table with a non-existing location * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - CTAS for external data source table with a existed location * FAILED * (15 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - data source table:partition column name containing a b * FAILED * (125 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - data source table:partition column name containing a:b * FAILED * (143 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - data source table:partition column name containing a%b * FAILED * (109 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - data source table:partition column name containing a,b * FAILED * (109 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - location uri contains a b for datasource table * FAILED * (94 milliseconds) file:/C:projectsspark%09arget%09mpspark-5739cda9-b702-4e14-932c-42e8c4174480a%20b did not equal file:///C:/projects/spark/target/tmp/spark-5739cda9-b702-4e14-932c-42e8c4174480/a%20b (DDLSuite.scala:2084) ... - location uri contains a:b for datasource table * FAILED * (78 milliseconds) file:/C:projectsspark%09arget%09mpspark-9bdd227c-840f-4f08-b7c5-4036638f098da:b did not equal file:///C:/projects/spark/target/tmp/spark-9bdd227c-840f-4f08-b7c5-4036638f098d/a:b (DDLSuite.scala:2084) ... - location uri contains a%b for datasource table * FAILED * (78 milliseconds) file:/C:projectsspark%09arget%09mpspark-62bb5f1d-fa20-460a-b534-cb2e172a3640a%25b did not equal file:///C:/projects/spark/target/tmp/spark-62bb5f1d-fa20-460a-b534-cb2e172a3640/a%25b (DDLSuite.scala:2084) ... - location uri contains a b for database * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - location uri contains a:b for database * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - location uri contains a%b for database * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... ``` ``` org.apache.spark.sql.hive.execution.HiveDDLSuite: - create hive table with a non-existing location * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - CTAS for external hive table with a non-existing location * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - CTAS for external hive table with a existed location * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - partition column name of parquet table containing a b * FAILED * (156 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - partition column name of parquet table containing a:b * FAILED * (94 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - partition column name of parquet table containing a%b * FAILED * (125 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - partition column name of parquet table containing a,b * FAILED * (110 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... - partition column name of hive table containing a b * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - partition column name of hive table containing a:b * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - partition column name of hive table containing a%b * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - partition column name of hive table containing a,b * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - hive table: location uri contains a b * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - hive table: location uri contains a:b * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... - hive table: location uri contains a%b * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ... ``` ``` org.apache.spark.sql.sources.PathOptionSuite: - path option also exist for write path * FAILED * (94 milliseconds) file:/C:projectsspark%09arget%09mpspark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc did not equal file:///C:/projects/spark/target/tmp/spark-2870b281-7ac0-43d6-b6b6-134e01ab6fdc (PathOptionSuite.scala:98) ... ``` ``` org.apache.spark.sql.CachedTableSuite: - SPARK-19765: UNCACHE TABLE should un-cache all cached plans that refer to this table * FAILED * (110 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ... ``` ``` org.apache.spark.sql.execution.DataSourceScanExecRedactionSuite: - treeString is redacted * FAILED * (250 milliseconds) "file:/C:/projects/spark/target/tmp/spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" did not contain "C:\projects\spark\target\tmp\spark-3ecc1fa4-3e76-489c-95f4-f0b0500eae28" (DataSourceScanExecRedactionSuite.scala:46) ... ``` ## How was this patch tested? Tested via AppVeyor for each and checked it passed once each. These should be retested via AppVeyor in this PR. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17987 from HyukjinKwon/windows-20170515. (cherry picked from commit e9f983df275c138626af35fd263a7abedf69297f) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 May 2017, 16:10:38 UTC
022a495	Lior Regev	25 May 2017, 16:08:19 UTC	[SPARK-20741][SPARK SUBMIT] Added cleanup of JARs archive generated by SparkSubmit ## What changes were proposed in this pull request? Deleted generated JARs archive after distribution to HDFS ## How was this patch tested? Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lior Regev <lioregev@gmail.com> Closes #17986 from liorregev/master. (cherry picked from commit 7306d556903c832984c7f34f1e8fe738a4b2343c) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 May 2017, 16:08:41 UTC
e01f1f2	Yan Facai (颜发才)	25 May 2017, 13:40:39 UTC	[SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth. ## What changes were proposed in this pull request? Expose numPartitions (expert) param of PySpark FPGrowth. ## How was this patch tested? + [x] Pass all unit tests. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition. (cherry picked from commit 139da116f130ed21481d3e9bdee5df4b8d7760ac) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>	25 May 2017, 13:40:52 UTC
9cbf39f	Yanbo Liang	25 May 2017, 12:15:15 UTC	[SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth. ## What changes were proposed in this pull request? Follow-up for #17218, some minor fix for PySpark ```FPGrowth```. ## How was this patch tested? Existing UT. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18089 from yanboliang/spark-19281. (cherry picked from commit 913a6bfe4b0eb6b80a03b858ab4b2767194103de) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>	25 May 2017, 12:15:38 UTC
8896c4e	jinxing	25 May 2017, 08:11:30 UTC	[SPARK-19659] Fetch big blocks to disk when shuffle-read. ## What changes were proposed in this pull request? Currently the whole block is fetched into memory(off heap by default) when shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can be large when skew situations. If OOM happens during shuffle read, job will be killed and users will be notified to "Consider boosting spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more memory can resolve the OOM. However the approach is not perfectly suitable for production environment, especially for data warehouse. Using Spark SQL as data engine in warehouse, users hope to have a unified parameter(e.g. memory) but less resource wasted(resource is allocated but not used). The hope is strong especially when migrating data engine to Spark from another one(e.g. Hive). Tuning the parameter for thousands of SQLs one by one is very time consuming. It's not always easy to predict skew situations, when happen, it make sense to fetch remote blocks to disk for shuffle-read, rather than kill the job because of OOM. In this pr, I propose to fetch big blocks to disk(which is also mentioned in SPARK-3019): 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Request memory from `MemoryManager` before fetch blocks and release the memory to `MemoryManager` when `ManagedBuffer` is released. 3. Fetch remote blocks to disk when failing acquiring memory from `MemoryManager`, otherwise fetch to memory. This is an improvement for memory control when shuffle blocks and help to avoid OOM in scenarios like below: 1. Single huge block; 2. Sizes of many blocks are underestimated in `MapStatus` and the actual footprint of blocks is much larger than the estimated. ## How was this patch tested? Added unit test in `MapStatusSuite` and `ShuffleBlockFetcherIteratorSuite`. Author: jinxing <jinxing6042@126.com> Closes #16989 from jinxing64/SPARK-19659. (cherry picked from commit 3f94e64aa8fd806ae1fa0156d846ce96afacddd3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 May 2017, 08:11:51 UTC
b52a06d	Xianyang Liu	25 May 2017, 07:47:59 UTC	[SPARK-20250][CORE] Improper OOM error when a task been killed while spilling data ## What changes were proposed in this pull request? Currently, when a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the `TaskMemoryManager` will throw an `OOM` exception. And we don't catch `Fatal` exception when a error caused by `Thread.interrupt`. So for `ClosedByInterruptException`, we should throw `RuntimeException` instead of `OutOfMemoryError`. https://issues.apache.org/jira/browse/SPARK-20250?jql=project%20%3D%20SPARK ## How was this patch tested? Existing unit tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18090 from ConeyLiu/SPARK-20250. (cherry picked from commit 731462a04f8e33ac507ad19b4270c783a012a33e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 May 2017, 07:48:16 UTC
e0aa239	Liang-Chi Hsieh	25 May 2017, 01:55:45 UTC	[SPARK-20848][SQL][FOLLOW-UP] Shutdown the pool after reading parquet files ## What changes were proposed in this pull request? This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name. ## How was this patch tested? Manually test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18100 from viirya/SPARK-20848-followup. (cherry picked from commit 6b68d61cf31748a088778dfdd66491b2f89a3c7b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 May 2017, 01:55:59 UTC
3f82d65	liuxian	25 May 2017, 00:32:02 UTC	[SPARK-20403][SQL] Modify the instructions of some functions ## What changes were proposed in this pull request? 1. add instructions of 'cast' function When using 'show functions' and 'desc function cast' command in spark-sql 2. Modify the instructions of functions，such as boolean，tinyint，smallint，int，bigint，float，double，decimal，date，timestamp，binary，string ## How was this patch tested? Before modification： spark-sql>desc function boolean; Function: boolean Class: org.apache.spark.sql.catalyst.expressions.Cast Usage: boolean(expr AS type) - Casts the value `expr` to the target data type `type`. After modification： spark-sql> desc function boolean; Function: boolean Class: org.apache.spark.sql.catalyst.expressions.Cast Usage: boolean(expr) - Casts the value `expr` to the target data type `boolean`. spark-sql> desc function cast Function: cast Class: org.apache.spark.sql.catalyst.expressions.Cast Usage: cast(expr AS type) - Casts the value `expr` to the target data type `type`. Author: liuxian <liu.xian3@zte.com.cn> Closes #17698 from 10110346/wip_lx_0418. (cherry picked from commit 197f9018a4641c8fc0725905ebfb535b61bed791) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	25 May 2017, 00:32:18 UTC
ae65d30	Jacek Laskowski	25 May 2017, 00:24:23 UTC	[SPARK-16202][SQL][DOC] Follow-up to Correct The Description of CreatableRelationProvider's createRelation ## What changes were proposed in this pull request? Follow-up to SPARK-16202: 1. Remove the duplication of the meaning of `SaveMode` (as one was in fact missing that had proven that the duplication may be incomplete in the future again) 2. Use standard scaladoc tags /cc gatorsmile rxin yhuai (as they were involved previously) ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #18026 from jaceklaskowski/CreatableRelationProvider-SPARK-16202. (cherry picked from commit 5f8ff2fc9a859ceeaa8f1d03060fdbb30951e706) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	25 May 2017, 00:24:33 UTC
2405afc	Kris Mok	25 May 2017, 00:19:35 UTC	[SPARK-20872][SQL] ShuffleExchange.nodeName should handle null coordinator ## What changes were proposed in this pull request? A one-liner change in `ShuffleExchange.nodeName` to cover the case when `coordinator` is `null`, so that the match expression is exhaustive. Please refer to [SPARK-20872](https://issues.apache.org/jira/browse/SPARK-20872) for a description of the symptoms. TL;DR is that inspecting a `ShuffleExchange` (directly or transitively) on the Executor side can hit a case where the `coordinator` field of a `ShuffleExchange` is null, and thus will trigger a `MatchError` in `ShuffleExchange.nodeName()`'s inexhaustive match expression. Also changed two other match conditions in `ShuffleExchange` on the `coordinator` field to be consistent. ## How was this patch tested? Manually tested this change with a case where the `coordinator` is null to make sure `ShuffleExchange.nodeName` doesn't throw a `MatchError` any more. Author: Kris Mok <kris.mok@databricks.com> Closes #18095 from rednaxelafx/shuffleexchange-nodename. (cherry picked from commit c0b3e45e3b46a5235b748cb85ad200c9ec1bb426) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	25 May 2017, 00:19:46 UTC
b7a2a16	Reynold Xin	24 May 2017, 20:57:19 UTC	[SPARK-20867][SQL] Move hints from Statistics into HintInfo class ## What changes were proposed in this pull request? This is a follow-up to SPARK-20857 to move the broadcast hint from Statistics into a new HintInfo class, so we can be more flexible in adding new hints in the future. ## How was this patch tested? Updated test cases to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #18087 from rxin/SPARK-20867. (cherry picked from commit a64746677bf09ef67e3fd538355a6ee9b5ce8cf4) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	24 May 2017, 20:57:28 UTC
c59ad42	Liang-Chi Hsieh	24 May 2017, 16:35:40 UTC	[SPARK-20848][SQL] Shutdown the pool after reading parquet files ## What changes were proposed in this pull request? From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads. We should shutdown the pool after reading parquet files. ## How was this patch tested? Added a test to ParquetFileFormatSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18073 from viirya/SPARK-20848. (cherry picked from commit f72ad303f05a6d99513ea3b121375726b177199c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	24 May 2017, 16:36:01 UTC
83aeac9	Bago Amirbekian	24 May 2017, 14:55:38 UTC	[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago Amirbekian <bago@databricks.com> Closes #18081 from MrBago/BF-py3floatbug. (cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>	24 May 2017, 14:56:28 UTC
1d10724	zero323	24 May 2017, 11:57:44 UTC	[SPARK-20631][FOLLOW-UP] Fix incorrect tests. ## What changes were proposed in this pull request? - Fix incorrect tests for `_check_thresholds`. - Move test to `ParamTests`. ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #18085 from zero323/SPARK-20631-FOLLOW-UP. (cherry picked from commit 1816eb3bef930407dc9e083de08f5105725c55d1) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>	24 May 2017, 11:58:40 UTC

Newer
Older