Revision history - refs/heads/branch-2.0 - origin: https://github.com/apache/spark

visit type:

Revision	Author	Date	Message	Commit Date
5ed89ce	shane knapp	10 August 2018, 23:07:18 UTC	[SPARK-25089][R] removing lintr checks for 2.0 ## What changes were proposed in this pull request? since 2.0 will be EOLed some time in the not too distant future, and we'll be moving the builds from centos to ubuntu, i think it's fine to disable R linting rather than going down the rabbit hole of trying to fix this stuff. ## How was this patch tested? the build system will test this Closes #22074 from shaneknapp/removing-lintr-2.0. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	10 August 2018, 23:07:18 UTC
dccd8c7	Wenchen Fan	24 May 2018, 04:44:26 UTC	fix compilation caused by SPARK-24257	24 May 2018, 04:46:33 UTC
5fd0809	sychen	24 May 2018, 03:02:09 UTC	[SPARK-24257][SQL] LongToUnsafeRowMap calculate the new size may be wrong LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted. Author: sychen <sychen@ctrip.com> Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size. (cherry picked from commit 888340151f737bb68d0e419b1e949f11469881f9) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	24 May 2018, 03:25:05 UTC
a42dd00	Wenchen Fan	04 May 2018, 11:20:15 UTC	[SPARK-23697][CORE] LegacyAccumulatorWrapper should define isZero correctly ## What changes were proposed in this pull request? It's possible that Accumulators of Spark 1.x may no longer work with Spark 2.x. This is because `LegacyAccumulatorWrapper.isZero` may return wrong answer if `AccumulableParam` doesn't define equals/hashCode. This PR fixes this by using reference equality check in `LegacyAccumulatorWrapper.isZero`. ## How was this patch tested? a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #21229 from cloud-fan/accumulator. (cherry picked from commit 4d5de4d303a773b1c18c350072344bd7efca9fc4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	04 May 2018, 11:22:20 UTC
d51c6aa	Gabor Somogyi	26 February 2018, 16:39:44 UTC	[SPARK-23438][DSTREAMS] Fix DStreams data loss with WAL when driver crashes There is a race condition introduced in SPARK-11141 which could cause data loss. The problem is that ReceivedBlockTracker.insertAllocatedBatch function assumes that all blocks from streamIdToUnallocatedBlockQueues allocated to the batch and clears the queue. In this PR only the allocated blocks will be removed from the queue which will prevent data loss. Additional unit test + manually. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #20620 from gaborgsomogyi/SPARK-23438. (cherry picked from commit b308182f233b8840dfe0e6b5736d2f2746f40757) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	26 February 2018, 16:57:45 UTC
076c2f6	Felix Cheung	31 October 2017, 04:45:49 UTC	[SPARK-22327][SPARKR][TEST][BACKPORT-2.0] check for version warning ## What changes were proposed in this pull request? backporting to 2.0 (since it's the first branch "older than" 2.1.2) ## How was this patch tested? manually Jenkins, AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19550 from felixcheung/rcranversioncheck20.	31 October 2017, 04:45:49 UTC
28c0151	Andrew Ash	25 October 2017, 21:41:02 UTC	[SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint ## What changes were proposed in this pull request? Fix java lint ## How was this patch tested? Run `./dev/lint-java` Author: Andrew Ash <andrew@andrewash.com> Closes #19574 from ash211/aash/fix-java-lint. (cherry picked from commit 5433be44caecaeef45ed1fdae10b223c698a9d14) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	25 October 2017, 21:41:41 UTC
8e71b19	Andrea zito	25 October 2017, 17:10:24 UTC	[SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections ## What changes were proposed in this pull request? This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it. ## How was this patch tested? Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue. Author: Andrea zito <andrea.zito@u-hopper.com> Closes #19217 from nivox/SPARK-21991. (cherry picked from commit 6ea8a56ca26a7e02e6574f5f763bb91059119a80)	25 October 2017, 17:13:09 UTC
a96fbd8	peay	21 October 2017, 08:53:09 UTC	[SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator Backport of https://github.com/apache/spark/pull/18752 (https://issues.apache.org/jira/browse/SPARK-21551) (cherry picked from commit 9d3c6640f56e3e4fd195d3ad8cead09df67a72c7) Author: peay <peay@protonmail.com> Closes #19514 from FRosner/branch-2.0.	21 October 2017, 08:53:09 UTC
ac23491	Takuya UESHIN	08 September 2017, 05:26:07 UTC	[SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext. ## What changes were proposed in this pull request? `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests. This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19158 from ueshin/issues/SPARK-21950. (cherry picked from commit 57bc1e9eb452284cbed090dbd5008eb2062f1b36) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	08 September 2017, 05:27:11 UTC
bf1f30d	Wenchen Fan	24 August 2017, 17:36:37 UTC	[SPARK-21826][SQL][2.1][2.0] outer broadcast hash join should not throw NPE backport https://github.com/apache/spark/pull/19036 to branch 2.1 and 2.0 Author: Wenchen Fan <wenchen@databricks.com> Closes #19040 from cloud-fan/bug. (cherry picked from commit 576975356357ead203e452d0d794794349ba4578) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	24 August 2017, 17:39:40 UTC
9f670ce	Yan Facai (颜发才)	08 August 2017, 03:18:15 UTC	[SPARK-21306][ML] For branch 2.0, OneVsRest should support setWeightCol The PR is related to #18554, and is modified for branch 2.0. ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18764 from facaiy/BUG/branch-2.0_OneVsRest_support_setWeightCol.	08 August 2017, 03:18:15 UTC
c27a01a	Marcelo Vanzin	01 August 2017, 17:06:03 UTC	[SPARK-21522][CORE] Fix flakiness in LauncherServerSuite. Handle the case where the server closes the socket before the full message has been written by the client. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18727 from vanzin/SPARK-21522. (cherry picked from commit b133501800b43fa5c538a4e5ad597c9dc7d8378e) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	01 August 2017, 17:06:36 UTC
f8ae2bd	Yanbo Liang	28 July 2017, 11:45:14 UTC	Revert "[SPARK-21306][ML] OneVsRest should support setWeightCol" This reverts commit ccb82722450c20c9cdea2b2c68783943213a5aa1.	28 July 2017, 11:45:14 UTC
ccb8272	Yan Facai (颜发才)	28 July 2017, 02:10:35 UTC	[SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol. (cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>	28 July 2017, 02:20:27 UTC
d7b9d62	aokolnychyi	18 July 2017, 04:07:50 UTC	[SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Physical Plan == Project [CheckOverflow((col#1 col#1), DecimalType(38,12)) AS (col * col)#4] +- Scan ExistingRDD[col#1] // Schema root \|-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Physical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- Scan ExistingRDD[col#1] // Schema root \|-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18583 from aokolnychyi/spark-21332. (cherry picked from commit 0be5fb41a6b7ef4da9ba36f3604ac646cb6d4ae3) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	18 July 2017, 04:08:57 UTC
e4f57f2	gatorsmile	16 July 2017, 16:50:36 UTC	[SPARK-21426][2.0][SQL][TEST] Fix test failure due to missing literal representation ## What changes were proposed in this pull request? SPARK 2.0 does not support hex literal. Thus, the test case failed after backporting https://github.com/apache/spark/pull/18571 ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18643 from gatorsmile/fixTestFailure2.0.	16 July 2017, 16:50:36 UTC
1afded0	Kazuaki Ishizaki	15 July 2017, 03:16:04 UTC	[SPARK-21344][SQL] BinaryType comparison does signed byte array comparison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18571 from kiszk/SPARK-21344. (cherry picked from commit ac5d5d795909061a17e056696cf0ef87d9e65dd1) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	15 July 2017, 03:17:08 UTC
4229e16	gatorsmile	03 July 2017, 05:28:51 UTC	[SPARK-21282][TEST][2.0] Fix test failure in 2.0 ### What changes were proposed in this pull request? There is a test failure after backporting a fix from 2.2 to 2.0, because the automatically generated column names are different between 2.2 and 2.0 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/ This PR is to re-generate the result file. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18506 from gatorsmile/fixFailure.	03 July 2017, 05:28:51 UTC
44a97f7	sharkdtu	19 June 2017, 21:54:54 UTC	[SPARK-21138][YARN] Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different ## What changes were proposed in this pull request? When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows： ``` spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark ``` The staging dir can not be deleted, it will prompt following message: ``` java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 ``` ## How was this patch tested? Existing tests Author: sharkdtu <sharkdtu@tencent.com> Closes #18352 from sharkdtu/master. (cherry picked from commit 3d4d11a80fe8953d48d8bfac2ce112e37d38dc90) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	19 June 2017, 22:06:42 UTC
333924e	saturday_s	19 June 2017, 17:24:29 UTC	[SPARK-19688][STREAMING] Not to read `spark.yarn.credentials.file` from checkpoint. ## What changes were proposed in this pull request? Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint. ## How was this patch tested? Manual tested with 1.6.3 and 2.1.1. I didn't test this with master because of some compile problems, but I think it will be the same result. ## Notice This should be merged into maintenance branches too. jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008) Author: saturday_s <shi.indetail@gmail.com> Closes #18230 from saturday-shi/SPARK-21008. (cherry picked from commit e92ffe6f1771e3fe9ea2e62ba552c1b5cf255368) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	19 June 2017, 17:25:38 UTC
7efd475	Xingbo Jiang	15 June 2017, 16:06:54 UTC	[SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message ## What changes were proposed in this pull request? Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case. The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply(). ## How was this patch tested? N/A Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18314 from jiangxb1987/LocalCheckpointSuite. (cherry picked from commit 7dc3e697c74864a4e3cca7342762f1427058b3c3) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	15 June 2017, 16:07:58 UTC
0239b16	gatorsmile	14 June 2017, 11:18:28 UTC	[SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0 ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/18244 to 2.2 --- The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0. The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion. Before this PR, the following queries failed: ```SQL select 1 > 0.0001 select floor(0.0001) select ceil(0.0001) ``` ### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18297 from gatorsmile/backport18244. (cherry picked from commit 626511953b87747e933e4f64b9fcd4c4776a5c4e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	14 June 2017, 11:18:58 UTC
0f35988	Wenchen Fan	03 June 2017, 04:59:52 UTC	[SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes ## What changes were proposed in this pull request? REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18191 from cloud-fan/test. (cherry picked from commit 864d94fe879a32de324da65a844e62a0260b222d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	03 June 2017, 05:00:32 UTC
9952b53	Marcelo Vanzin	01 June 2017, 23:45:31 UTC	[SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18178 from vanzin/SPARK-20922-hotfix. (cherry picked from commit 0b25a7d93359e348e11b2e8698990a53436b3c5d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	01 June 2017, 23:45:41 UTC
f7cbf90	Marcelo Vanzin	01 June 2017, 21:44:34 UTC	[SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. Blindly deserializing classes using Java serialization opens the code up to issues in other libraries, since just deserializing data from a stream may end up execution code (think readObject()). Since the launcher protocol is pretty self-contained, there's just a handful of classes it legitimately needs to deserialize, and they're in just two packages, so add a filter that throws errors if classes from any other package show up in the stream. This also maintains backwards compatibility (the updated launcher code can still communicate with the backend code in older Spark releases). Tested with new and existing unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18166 from vanzin/SPARK-20922. (cherry picked from commit 8efc6e986554ae66eab93cd64a9035d716adbab0) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	01 June 2017, 21:45:09 UTC
cd870c0	Shixiong Zhu	01 June 2017, 00:26:18 UTC	[SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException ## What changes were proposed in this pull request? `IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18168 from zsxwing/SPARK-20940. (cherry picked from commit 24db35826a81960f08e3eb68556b0f51781144e1) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	01 June 2017, 00:26:49 UTC
9846a3c	Wenchen Fan	26 May 2017, 07:01:28 UTC	[SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo ## What changes were proposed in this pull request? Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier. However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it. https://issues.apache.org/jira/browse/SPARK-18105 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18091 from cloud-fan/shuffle. (cherry picked from commit d9ad78908f6189719cec69d34557f1a750d2e6af) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	26 May 2017, 07:02:45 UTC
ef0ebdd	Xianyang Liu	25 May 2017, 07:47:59 UTC	[SPARK-20250][CORE] Improper OOM error when a task been killed while spilling data Currently, when a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the `TaskMemoryManager` will throw an `OOM` exception. And we don't catch `Fatal` exception when a error caused by `Thread.interrupt`. So for `ClosedByInterruptException`, we should throw `RuntimeException` instead of `OutOfMemoryError`. https://issues.apache.org/jira/browse/SPARK-20250?jql=project%20%3D%20SPARK Existing unit tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18090 from ConeyLiu/SPARK-20250. (cherry picked from commit 731462a04f8e33ac507ad19b4270c783a012a33e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 May 2017, 07:52:31 UTC
79fbfbb	Xingbo Jiang	24 May 2017, 21:34:17 UTC	[SPARK-18406][CORE][BACKPORT-2.0] Race between end-of-task and completion iterator read lock release This is a backport PR of #18076 to 2.0 and 2.1. ## What changes were proposed in this pull request? When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. ## How was this patch tested? Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18096 from jiangxb1987/completion-iterator-2.0.	24 May 2017, 21:34:17 UTC
72e1f83	Bago Amirbekian	24 May 2017, 14:55:38 UTC	[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago Amirbekian <bago@databricks.com> Closes #18081 from MrBago/BF-py3floatbug. (cherry picked from commit bc66a77bbe2120cc21bd8da25194efca4cde13c3) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>	24 May 2017, 15:00:01 UTC
4dd34d0	Mark Grover	22 May 2017, 17:10:41 UTC	[SPARK-20756][YARN] yarn-shuffle jar references unshaded guava and contains scala classes ## What changes were proposed in this pull request? This change ensures that all references to guava from within the yarn shuffle jar pointed to the shaded guava class already provided in the jar. Also, it explicitly excludes scala classes from being added to the jar. ## How was this patch tested? Ran unit tests on the module and they passed. javap now returns the expected result - reference to the shaded guava under `org/spark_project` (previously this was referring to `com.google...` ``` javap -cp common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar -c org/apache/spark/network/yarn/YarnShuffleService \| grep Lists 57: invokestatic #138 // Method org/spark_project/guava/collect/Lists.newArrayList:()Ljava/util/ArrayList; ``` Guava is still shaded in the jar: ``` jar -tf common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar \| grep guava \| head META-INF/maven/com.google.guava/ META-INF/maven/com.google.guava/guava/ META-INF/maven/com.google.guava/guava/pom.properties META-INF/maven/com.google.guava/guava/pom.xml org/spark_project/guava/ org/spark_project/guava/annotations/ org/spark_project/guava/annotations/Beta.class org/spark_project/guava/annotations/GwtCompatible.class org/spark_project/guava/annotations/GwtIncompatible.class org/spark_project/guava/annotations/VisibleForTesting.class ``` (not sure if the above META-INF/* is a problem or not) I took this jar, deployed it on a yarn cluster with shuffle service enabled, and made sure the YARN node managers came up. An application with a shuffle was run and it succeeded. Author: Mark Grover <mark@apache.org> Closes #17990 from markgrover/spark-20756. (cherry picked from commit 36309110046a89d749a7c9746eaa16997de26922) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	22 May 2017, 17:11:27 UTC
9b145c6	Ryan Blue	12 May 2017, 12:38:36 UTC	[SPARK-17424] Fix unsound substitution bug in ScalaReflection. ## What changes were proposed in this pull request? This method gets a type's primary constructor and fills in type parameters with concrete types. For example, `MapPartitions[T, U] -> MapPartitions[Int, String]`. This Substitution fails when the actual type args are empty because they are still unknown. Instead, when there are no resolved types to subsitute, this returns the original args with unresolved type parameters. ## How was this patch tested? This doesn't affect substitutions where the type args are determined. This fixes our case where the actual type args are empty and our job runs successfully. Author: Ryan Blue <blue@apache.org> Closes #15062 from rdblue/SPARK-17424-fix-unsound-reflect-substitution. (cherry picked from commit b23693390781a99ff9248ea07a22e68884ffc747) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 May 2017, 12:39:43 UTC
b2d0ed2	liuxian	12 May 2017, 03:38:50 UTC	[SPARK-20665][SQL] Bround" and "Round" function return NULL spark-sql>select bround(12.3, 2); spark-sql>NULL For this case, the expected result is 12.3, but it is null. So ,when the second parameter is bigger than "decimal.scala", the result is not we expected. "round" function has the same problem. This PR can solve the problem for both of them. unit test cases in MathExpressionsSuite and MathFunctionsSuite Author: liuxian <liu.xian3@zte.com.cn> Closes #17906 from 10110346/wip_lx_0509. (cherry picked from commit 2b36eb696f6c738e1328582630755aaac4293460) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 May 2017, 03:43:21 UTC
d86dae8	zero323	10 May 2017, 08:57:52 UTC	[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params ## What changes were proposed in this pull request? - Replace `getParam` calls with `getOrDefault` calls. - Fix exception message to avoid unintended `TypeError`. - Add unit tests ## How was this patch tested? New unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17891 from zero323/SPARK-20631. (cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>	10 May 2017, 09:00:22 UTC
4665997	Wenchen Fan	03 May 2017, 02:08:46 UTC	[SPARK-20558][CORE] clear InheritableThreadLocal variables in SparkContext when stopping it ## What changes were proposed in this pull request? To better understand this problem, let's take a look at an example first: ``` object Main { def main(args: Array[String]): Unit = { var t = new Test new Thread(new Runnable { override def run() = {} }).start() println("first thread finished") t.a = null t = new Test new Thread(new Runnable { override def run() = {} }).start() } } class Test { var a = new InheritableThreadLocal[String] { override protected def childValue(parent: String): String = { println("parent value is: " + parent) parent } } a.set("hello") } ``` The result is: ``` parent value is: hello first thread finished parent value is: hello parent value is: hello ``` Once an `InheritableThreadLocal` has been set value, child threads will inherit its value as long as it has not been GCed, so setting the variable which holds the `InheritableThreadLocal` to `null` doesn't work as we expected. In `SparkContext`, we have an `InheritableThreadLocal` for local properties, we should clear it when stopping `SparkContext`, or all the future child threads will still inherit it and copy the properties and waste memory. This is the root cause of https://issues.apache.org/jira/browse/SPARK-20548 , which creates/stops `SparkContext` many times and finally have a lot of `InheritableThreadLocal` alive, and cause OOM when starting new threads in the internal thread pools. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #17833 from cloud-fan/core. (cherry picked from commit b946f3160eb7953fb30edf1f097ea87be75b33e7) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	03 May 2017, 02:09:47 UTC
068500a	jerryshao	25 April 2017, 22:21:12 UTC	[SPARK-20239][CORE][2.1-BACKPORT] Improve HistoryServer's ACL mechanism Current SHS (Spark History Server) has two different ACLs: * ACL of base URL, it is controlled by "spark.acls.enabled" or "spark.ui.acls.enabled", and with this enabled, only user configured with "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user who started SHS could list all the applications, otherwise none of them can be listed. This will also affect REST APIs which listing the summary of all apps and one app. * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". With this enabled only history admin user and user/group who ran this app can access the details of this app. With this two ACLs, we may encounter several unexpected behaviors: 1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no view permission. User "A" cannot see the app list but could still access details of it's own app. 2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" could download any application's event log, even it is not run by user "A". 3. The changes of Live UI's ACL will affect History UI's ACL which share the same conf file. The unexpected behaviors is mainly because we have two different ACLs, ideally we should have only one to manage all. So to improve SHS's ACL mechanism, here in this PR proposed to: 1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" for history server. 2. Check permission for event-log download REST API. With this PR: 1. Admin user could see/download the list of all applications, as well as application details. 2. Normal user could see the list of all applications, but can only download and check the details of applications accessible to him. New UTs are added, also verified in real cluster. CC tgravescs vanzin please help to review, this PR changes the semantics you did previously. Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #17755 from jerryshao/SPARK-20239-2.1-backport. (cherry picked from commit 359382c038d5836e95ee3ca871f3d1da5bc08148) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	25 April 2017, 22:21:24 UTC
ddf6dd8	Sameer Agarwal	25 April 2017, 05:05:20 UTC	[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit ## What changes were proposed in this pull request? In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping splits. To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism. ## How was this patch tested? Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #17751 from sameeragarwal/randomsplit2. (cherry picked from commit 31345fde82ada1f8bb12807b250b04726a1f6aa6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 April 2017, 05:07:13 UTC
84be4c8	hyukjinkwon	17 April 2017, 17:03:42 UTC	[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports https://github.com/apache/spark/pull/16429 to branch-2.0 so that Python 3.6.0 works with Spark 2.0.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.tests (124s) Finished test(python3.6): pyspark.accumulators (4s) Finished test(python3.6): pyspark.broadcast (4s) Finished test(python3.6): pyspark.conf (3s) Finished test(python3.6): pyspark.context (15s) Finished test(python3.6): pyspark.ml.classification (24s) Finished test(python3.6): pyspark.sql.tests (190s) Finished test(python3.6): pyspark.mllib.tests (190s) Finished test(python3.6): pyspark.ml.clustering (14s) Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Finished test(python3.6): pyspark.ml.recommendation (18s) Finished test(python3.6): pyspark.ml.feature (28s) Finished test(python3.6): pyspark.ml.evaluation (28s) Finished test(python3.6): pyspark.ml.regression (21s) Finished test(python3.6): pyspark.ml.tuning (17s) Finished test(python3.6): pyspark.streaming.tests (239s) Finished test(python3.6): pyspark.mllib.evaluation (15s) Finished test(python3.6): pyspark.mllib.classification (24s) Finished test(python3.6): pyspark.mllib.clustering (37s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (19s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.ml.tests (76s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.recommendation (21s) Finished test(python3.6): pyspark.mllib.linalg.distributed (27s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.stat._statistics (11s) Finished test(python3.6): pyspark.mllib.tree (16s) Finished test(python3.6): pyspark.profiler (8s) Finished test(python3.6): pyspark.shuffle (1s) Finished test(python3.6): pyspark.mllib.util (17s) Finished test(python3.6): pyspark.serializers (12s) Finished test(python3.6): pyspark.rdd (18s) Finished test(python3.6): pyspark.sql.conf (4s) Finished test(python3.6): pyspark.sql.catalog (14s) Finished test(python3.6): pyspark.sql.column (13s) Finished test(python3.6): pyspark.sql.context (15s) Finished test(python3.6): pyspark.sql.group (26s) Finished test(python3.6): pyspark.sql.dataframe (31s) Finished test(python3.6): pyspark.sql.functions (32s) Finished test(python3.6): pyspark.sql.types (5s) Finished test(python3.6): pyspark.sql.streaming (11s) Finished test(python3.6): pyspark.sql.window (5s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.session (15s) Finished test(python3.6): pyspark.sql.readwriter (34s) Tests passed in 376 seconds ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #17374 from HyukjinKwon/SPARK-19019-backport.	17 April 2017, 17:03:42 UTC
24f6ef2	DB Tsai	12 April 2017, 16:08:37 UTC	[SPARK-20291][SQL][BACKPORT] NaNvl(FloatType, NullType) should not be cast to N… …aNvl(DoubleType, DoubleType) ## What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/17606 `NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`. This will cause mismatching in the output type when the input type is float. By adding extra rule in TypeCoercion can resolve this issue. ## How was this patch tested? unite tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: DB Tsai <dbt@netflix.com> Author: DB Tsai <dbtsai@dbtsai.com> Closes #17618 from dbtsai/branch-2.0.	12 April 2017, 16:08:37 UTC
123a758	DB Tsai	11 April 2017, 04:04:18 UTC	[MINOR][SQL] Fix the @since tag when backporting SPARK-18555 from 2.2 branch into 2.0 branch ## What changes were proposed in this pull request? Fix the since tag when backporting critical bugs (SPARK-18555) from 2.2 branch into 2.0 branch. ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Author: DB Tsai <dbtsai@dbtsai.com> Closes #17601 from dbtsai/branch-2.0.	11 April 2017, 04:04:18 UTC
aec3752	DB Tsai	10 April 2017, 05:16:34 UTC	[SPARK-20270][SQL] na.fill should not change the values in long or integer when the default value is in double ## What changes were proposed in this pull request? This bug was partially addressed in SPARK-18555 https://github.com/apache/spark/pull/15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big. Here is an example how this happens, with ``` Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null), (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2), ``` the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision. The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong. With the PR, the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting. ## How was this patch tested? unit test added. +cc srowen rxin cloud-fan gatorsmile Thanks. Author: DB Tsai <dbt@netflix.com> Closes #17577 from dbtsai/fixnafill. (cherry picked from commit 1a0bc41659eef317dcac18df35c26857216a4314) Signed-off-by: DB Tsai <dbtsai@dbtsai.com>	11 April 2017, 00:14:30 UTC
735e203	root	06 December 2016, 02:39:56 UTC	[SPARK-18555][SQL] DataFrameNaFunctions.fill miss up original values in long integers ## What changes were proposed in this pull request? DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value. The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) . ``` def fill(value: Double, cols: Seq[String]): DataFrame = { val columnEquals = df.sparkSession.sessionState.analyzer.resolver val projections = df.schema.fields.map { f => // Only fill if the column is part of the cols list. if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) { fillCol[Double](f, value) } else { df.col(f.name) } } df.select(projections : _*) } ``` For example: ``` scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b") df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint] scala> df.show +-------------------+-------------------+ \| a\| b\| +-------------------+-------------------+ \| 1\| 2\| \| -1\| -2\| \|9123146099426677101\|9123146560113991650\| +-------------------+-------------------+ scala> df.na.fill(0).show +-------------------+-------------------+ \| a\| b\| +-------------------+-------------------+ \| 1\| 2\| \| -1\| -2\| \|9123146099426676736\|9123146560113991680\| +-------------------+-------------------+ ``` the original values changed [which is not we expected result]: ``` 9123146099426677101 -> 9123146099426676736 9123146560113991650 -> 9123146560113991680 ``` ## How was this patch tested? unit test added. Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)> Closes #15994 from windpiger/nafillMissupOriginalValue. (cherry picked from commit 508de38c9928d160cf70e8e7d69ddb1dca5c1a64) Signed-off-by: DB Tsai <dbtsai@dbtsai.com>	11 April 2017, 00:11:29 UTC
87be965	Shixiong Zhu	10 April 2017, 21:06:49 UTC	[SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 30 seconds ## What changes were proposed in this pull request? Saw the following failure locally: ``` Traceback (most recent call last): File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup self._test_func(input, func, expected, sort=True, input2=input2) File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func self.assertEqual(expected, result) AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != [] First list contains 3 additional elements. First extra element 0: [(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))] + [] - [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))], - [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))], - [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]] ``` It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120 It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #17597 from zsxwing/SPARK-20285. (cherry picked from commit f9a50ba2d1bfa3f55199df031e71154611ba51f6) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	10 April 2017, 21:07:08 UTC
a0b499f	Wenchen Fan	08 April 2017, 03:54:18 UTC	[SPARK-20246][SQL] should not push predicate down through aggregate with non-deterministic expressions ## What changes were proposed in this pull request? Similar to `Project`, when `Aggregate` has non-deterministic expressions, we should not push predicate down through it, as it will change the number of input rows and thus change the evaluation result of non-deterministic expressions in `Aggregate`. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #17562 from cloud-fan/filter. (cherry picked from commit 7577e9c356b580d744e1fc27c645fce41bdf9cf0) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	08 April 2017, 03:54:52 UTC
9016e17	Liang-Chi Hsieh	06 April 2017, 00:46:44 UTC	[SPARK-20214][ML] Make sure converted csc matrix has sorted indices ## What changes were proposed in this pull request? `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that: from scipy.sparse import lil_matrix lil = lil_matrix((4, 1)) lil[1, 0] = 1 lil[3, 0] = 2 _convert_to_vector(lil.todok()) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector return SparseVector(l.shape[0], csc.indices, csc.data) File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__ % (self.indices[i], self.indices[i + 1])) TypeError: Indices 3 and 1 are not strictly increasing A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices: >>> from scipy.sparse import lil_matrix >>> lil = lil_matrix((4, 1)) >>> lil[1, 0] = 1 >>> lil[3, 0] = 2 >>> dok = lil.todok() >>> csc = dok.tocsc() >>> csc.has_sorted_indices 0 >>> csc.indices array([3, 1], dtype=int32) I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17532 from viirya/make-sure-sorted-indices. (cherry picked from commit 12206058e8780e202c208b92774df3773eff36ae) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	06 April 2017, 00:47:59 UTC
15ea5ea	wangzhenhua	05 April 2017, 17:21:43 UTC	[SPARK-20223][SQL] Fix typo in tpcds q77.sql ## What changes were proposed in this pull request? Fix typo in tpcds q77.sql ## How was this patch tested? N/A Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17538 from wzhfy/typoQ77. (cherry picked from commit a2d8d767d933321426a4eb9df1583e017722d7d6) Signed-off-by: Xiao Li <gatorsmile@gmail.com>	05 April 2017, 17:22:10 UTC
90eb373	Kazuaki Ishizaki	24 March 2017, 04:57:56 UTC	[SPARK-19959][SQL] Fix to throw NullPointerException in df[java.lang.Long].collect ## What changes were proposed in this pull request? This PR fixes `NullPointerException` in the generated code by Catalyst. When we run the following code, we get the following `NullPointerException`. This is because there is no null checks for `inputadapter_value` while `java.lang.Long inputadapter_value` at Line 30 may have `null`. This happen when a type of DataFrame is nullable primitive type such as `java.lang.Long` and the wholestage codegen is used. While the physical plan keeps `nullable=true` in `input[0, java.lang.Long, true].longValue`, `BoundReference.doGenCode` ignores `nullable=true`. Thus, nullcheck code will not be generated and `NullPointerException` will occur. This PR checks the nullability and correctly generates nullcheck if needed. ```java sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect ``` ```java Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:393) ... ``` Generated code without this PR ```java /* 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private UnsafeRow serializefromobject_result; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; / 012 / / 013 / public GeneratedIterator(Object[] references) { / 014 / this.references = references; / 015 / } / 016 / / 017 / public void init(int index, scala.collection.Iterator[] inputs) { / 018 / partitionIndex = index; / 019 / this.inputs = inputs; / 020 / inputadapter_input = inputs[0]; / 021 / serializefromobject_result = new UnsafeRow(1); / 022 / this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0); / 023 / this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); / 024 / / 025 / } / 026 / / 027 / protected void processNext() throws java.io.IOException { / 028 / while (inputadapter_input.hasNext() && !stopEarly()) { / 029 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 030 / java.lang.Long inputadapter_value = (java.lang.Long)inputadapter_row.get(0, null); / 031 / / 032 / boolean serializefromobject_isNull = true; / 033 / long serializefromobject_value = -1L; / 034 / if (!false) { / 035 / serializefromobject_isNull = false; / 036 / if (!serializefromobject_isNull) { / 037 / serializefromobject_value = inputadapter_value.longValue(); / 038 / } / 039 / / 040 / } / 041 / serializefromobject_rowWriter.zeroOutNullBytes(); / 042 / / 043 / if (serializefromobject_isNull) { / 044 / serializefromobject_rowWriter.setNullAt(0); / 045 / } else { / 046 / serializefromobject_rowWriter.write(0, serializefromobject_value); / 047 / } / 048 / append(serializefromobject_result); / 049 / if (shouldStop()) return; / 050 / } / 051 / } / 052 / } ``` Generated code with this PR ```java / 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private UnsafeRow serializefromobject_result; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; / 012 / / 013 / public GeneratedIterator(Object[] references) { / 014 / this.references = references; / 015 / } / 016 / / 017 / public void init(int index, scala.collection.Iterator[] inputs) { / 018 / partitionIndex = index; / 019 / this.inputs = inputs; / 020 / inputadapter_input = inputs[0]; / 021 / serializefromobject_result = new UnsafeRow(1); / 022 / this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0); / 023 / this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); / 024 / / 025 / } / 026 / / 027 / protected void processNext() throws java.io.IOException { / 028 / while (inputadapter_input.hasNext() && !stopEarly()) { / 029 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 030 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 031 / java.lang.Long inputadapter_value = inputadapter_isNull ? null : ((java.lang.Long)inputadapter_row.get(0, null)); / 032 / / 033 / boolean serializefromobject_isNull = true; / 034 / long serializefromobject_value = -1L; / 035 / if (!inputadapter_isNull) { / 036 / serializefromobject_isNull = false; / 037 / if (!serializefromobject_isNull) { / 038 / serializefromobject_value = inputadapter_value.longValue(); / 039 / } / 040 / / 041 / } / 042 / serializefromobject_rowWriter.zeroOutNullBytes(); / 043 / / 044 / if (serializefromobject_isNull) { / 045 / serializefromobject_rowWriter.setNullAt(0); / 046 / } else { / 047 / serializefromobject_rowWriter.write(0, serializefromobject_value); / 048 / } / 049 / append(serializefromobject_result); / 050 / if (shouldStop()) return; / 051 / } / 052 / } / 053 */ } ``` ## How was this patch tested? Added new test suites in `DataFrameSuites` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #17302 from kiszk/SPARK-19959. (cherry picked from commit bb823ca4b479a00030c4919c2d857d254b2a44d8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	24 March 2017, 04:58:57 UTC
b45940e	Michael Allman	24 March 2017, 04:52:10 UTC	[SPARK-17204][CORE] Fix replicated off heap storage (Jira: https://issues.apache.org/jira/browse/SPARK-17204) There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems. `BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist. Author: Michael Allman <michael@videoamp.com> Closes #17390 from mallman/spark-17204-replicated_off_heap_storage-2.0_backport.	24 March 2017, 04:52:10 UTC
72a0ee3	wangzhenhua	21 March 2017, 13:38:28 UTC	[SPARK-19994][HOTFIX][BRANCH-2.0] Change InnerLike to Inner ## What changes were proposed in this pull request? InnerLike => Inner ## How was this patch tested? Existing tests. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17376 from wzhfy/hotFixWrongOrdering.	21 March 2017, 13:38:28 UTC
3983b3d	wangzhenhua	20 March 2017, 06:37:23 UTC	[SPARK-19994][SQL] Wrong outputOrdering for right/full outer smj ## What changes were proposed in this pull request? For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so `nullOrdering` of the left key can't be guaranteed. We should output right key order instead of left key order. For full outer join, neither left key nor right key guarantees `nullOrdering`. We should not output any ordering. In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in `PlannerSuite` by putting together tests for `Sort`, and also extracts common logic in Sort tests into a method. ## How was this patch tested? Corresponding test cases are added. Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #17331 from wzhfy/wrongOrdering. (cherry picked from commit 965a5abcff3adccc10a53b0d97d06c43934df1a2) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	20 March 2017, 06:38:47 UTC
6ee7d5b	Shixiong Zhu	17 March 2017, 18:12:23 UTC	[SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests more stable ## What changes were proposed in this pull request? Sometimes, CheckpointTests will hang on a busy machine because the streaming jobs are too slow and cannot catch up. I observed the scheduled delay was keeping increasing for dozens of seconds locally. This PR increases the batch interval from 0.5 seconds to 2 seconds to generate less Spark jobs. It should make `pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` with `awaitTerminationOrTimeout` so that if the streaming job fails, it will also fail the test. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #17323 from zsxwing/SPARK-19986. (cherry picked from commit 376d782164437573880f0ad58cecae1cb5f212f2) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	17 March 2017, 18:12:50 UTC
fd5149a	Wenchen Fan	15 March 2017, 13:00:39 UTC	hot fix for compilation error caused by PR#17236	15 March 2017, 13:00:39 UTC
e8426cb	Wenchen Fan	11 March 2017, 00:14:22 UTC	[SPARK-19893][SQL] should not run DataFrame set oprations with map type In spark SQL, map type can't be used in equality test/comparison, and `Intersect`/`Except`/`Distinct` do need equality test for all columns, we should not allow map type in `Intersect`/`Except`/`Distinct`. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #17236 from cloud-fan/map. (cherry picked from commit fb9beda54622e0c3190c6504fc468fa4e50eeb45) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	11 March 2017, 00:33:26 UTC
c561e6c	Shixiong Zhu	09 February 2017, 19:16:51 UTC	[SPARK-19481] [REPL] [MAVEN] Avoid to leak SparkContext in Signaling.cancelOnInterrupt ## What changes were proposed in this pull request? `Signaling.cancelOnInterrupt` leaks a SparkContext per call and it makes ReplSuite unstable. This PR adds `SparkContext.getActive` to allow `Signaling.cancelOnInterrupt` to get the active `SparkContext` to avoid the leak. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16825 from zsxwing/SPARK-19481.	08 March 2017, 20:49:53 UTC
da3dfaf	Michael Armbrust	08 March 2017, 09:32:42 UTC	[SPARK-18055][SQL] Use correct mirror in ExpresionEncoder Previously, we were using the mirror of passed in `TypeTag` when reflecting to build an encoder. This fails when the outer class is built in (i.e. `Seq`'s default mirror is based on root classloader) but inner classes (i.e. `A` in `Seq[A]`) are defined in the REPL or a library. This patch changes us to always reflect based on a mirror created using the context classloader. Author: Michael Armbrust <michael@databricks.com> Closes #17201 from marmbrus/replSeqEncoder. (cherry picked from commit 314e48a3584bad4b486b046bbf0159d64ba857bc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	08 March 2017, 09:34:25 UTC
e699028	Bryan Cutler	08 March 2017, 04:46:39 UTC	[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe ## What changes were proposed in this pull request? The `keyword_only` decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code. This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize. ## How was this patch tested? Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances. Author: Bryan Cutler <cutlerb@gmail.com> Closes #17195 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348-2_0.	08 March 2017, 04:46:39 UTC
0cc992c	Liwei Lin	06 March 2017, 21:11:29 UTC	[SPARK-16845][SQL][BRANCH-2.0] GeneratedClass$SpecificOrdering` grows beyond 64 KB ## What changes were proposed in this pull request? This is a backport pr of #15480 into `branch-2.0`. ## How was this patch tested? Existing tests. Author: Liwei Lin <lwlin7@gmail.com> Closes #17157 from ueshin/issues/SPARK-16845_2.0.	06 March 2017, 21:11:29 UTC
c7e7b04	uncleGen	06 March 2017, 02:17:30 UTC	[SPARK-19822][TEST] CheckpointSuite.testCheckpointedOperation: should not filter checkpointFilesOfLatestTime with the PATH string. ## What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73800/testReport/ ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 617 times over 10.003740484 seconds. Last failure message: 8 did not equal 2. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.DStreamCheckpointTester$class.generateOutput(CheckpointSuite .scala:172) at org.apache.spark.streaming.CheckpointSuite.generateOutput(CheckpointSuite.scala:211) ``` the check condition is: ``` val checkpointFilesOfLatestTime = Checkpoint.getCheckpointFiles(checkpointDir).filter { _.toString.contains(clock.getTimeMillis.toString) } // Checkpoint files are written twice for every batch interval. So assert that both // are written to make sure that both of them have been written. assert(checkpointFilesOfLatestTime.size === 2) ``` the path string may contain the `clock.getTimeMillis.toString`, like `3500` : ``` file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-1500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-2500 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3000 file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500.bk file:/root/dev/spark/assembly/CheckpointSuite/spark-20035007-9891-4fb6-91c1-cc15b7ccaf15/checkpoint-3500 ▲▲▲▲ ``` so we should only check the filename, but not the whole path. ## How was this patch tested? Jenkins. Author: uncleGen <hustyugm@gmail.com> Closes #17167 from uncleGen/flaky-CheckpointSuite. (cherry picked from commit 207067ead6db6dc87b0d144a658e2564e3280a89) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	06 March 2017, 02:17:50 UTC
7380188	guifeng	03 March 2017, 05:19:29 UTC	[SPARK-19779][SS] Delete needless tmp file after restart structured streaming job ## What changes were proposed in this pull request? [SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779) The PR (https://github.com/apache/spark/pull/17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future. ## How was this patch tested? unit tests Author: guifeng <guifengleaf@gmail.com> Closes #17124 from gf53520/SPARK-19779. (cherry picked from commit e24f21b5f8365ed25346e986748b393e0b4be25c) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	03 March 2017, 05:19:49 UTC
491b47a	jerryshao	03 March 2017, 01:18:52 UTC	[SPARK-19750][UI][BRANCH-2.1] Fix redirect issue from http to https ## What changes were proposed in this pull request? If spark ui port (4040) is not set, it will choose port number 0, this will make https port to also choose 0. And in Spark 2.1 code, it will use this https port (0) to do redirect, so when redirect triggered, it will point to a wrong url: like: ``` /tmp/temp$ wget http://172.27.25.134:55015 --2017-02-23 12:13:54-- http://172.27.25.134:55015/ Connecting to 172.27.25.134:55015... connected. HTTP request sent, awaiting response... 302 Found Location: https://172.27.25.134:0/ [following] --2017-02-23 12:13:54-- https://172.27.25.134:0/ Connecting to 172.27.25.134:0... failed: Can't assign requested address. Retrying. --2017-02-23 12:13:55-- (try: 2) https://172.27.25.134:0/ Connecting to 172.27.25.134:0... failed: Can't assign requested address. Retrying. --2017-02-23 12:13:57-- (try: 3) https://172.27.25.134:0/ Connecting to 172.27.25.134:0... failed: Can't assign requested address. Retrying. --2017-02-23 12:14:00-- (try: 4) https://172.27.25.134:0/ Connecting to 172.27.25.134:0... failed: Can't assign requested address. Retrying. ``` So instead of using 0 to do redirect, we should pick a bound port instead. This issue only exists in Spark 2.1-, and can be reproduced in yarn cluster mode. ## How was this patch tested? Current redirect UT doesn't verify this issue, so extend current UT to do correct verification. Author: jerryshao <sshao@hortonworks.com> Closes #17083 from jerryshao/SPARK-19750. (cherry picked from commit 3a7591ad5315308d24c0e444ce304ff78aef2304) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	03 March 2017, 01:19:04 UTC
e30fe1c	Stan Zhai	02 March 2017, 12:24:43 UTC	[SPARK-19766][SQL][BRANCH-2.0] Constant alias columns in INNER JOIN should not be folded by FoldablePropagation rule This PR fix for branch-2.0 Refer #17099 gatorsmile Author: Stan Zhai <zhaishidan@haizhi.com> Closes #17131 from stanzhai/fix-inner-join-2.0.	02 March 2017, 12:24:43 UTC
c9c45d9	Michael McCune	28 February 2017, 23:07:16 UTC	[SPARK-19769][DOCS] Update quickstart instructions ## What changes were proposed in this pull request? This change addresses the renaming of the `simple.sbt` build file to `build.sbt`. Newer versions of the sbt tool are not finding the older named file and are looking for the `build.sbt`. The quickstart instructions for self-contained applications is updated with this change. ## How was this patch tested? As this is a relatively minor change of a few words, the markdown was checked for syntax and spelling. Site was built with `SKIP_API=1 jekyll serve` for testing purposes. Author: Michael McCune <msm@redhat.com> Closes #17101 from elmiko/spark-19769. (cherry picked from commit bf5987cbe6c9f4a1a91d912ed3a9098111632d1a) Signed-off-by: Sean Owen <sowen@cloudera.com>	28 February 2017, 23:07:41 UTC
dcfb05c	Roberto Agostino Vitillo	28 February 2017, 18:49:07 UTC	[SPARK-19677][SS] Committing a delta file atop an existing one should not fail on HDFS ## What changes were proposed in this pull request? HDFSBackedStateStoreProvider fails to rename files on HDFS but not on the local filesystem. According to the [implementation notes](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/filesystem.html) of `rename()`, the behavior of the local filesystem and HDFS varies: > Destination exists and is a file > Renaming a file atop an existing file is specified as failing, raising an exception. > - Local FileSystem : the rename succeeds; the destination file is replaced by the source file. > - HDFS : The rename fails, no exception is raised. Instead the method call simply returns false. This patch ensures that `rename()` isn't called if the destination file already exists. It's still semantically correct because Structured Streaming requires that rerunning a batch should generate the same output. ## How was this patch tested? This patch was tested by running `StateStoreSuite`. Author: Roberto Agostino Vitillo <ra.vitillo@gmail.com> Closes #17012 from vitillo/fix_rename.	28 February 2017, 18:50:51 UTC
a6af60f	jerryshao	24 February 2017, 17:31:52 UTC	[SPARK-19038][YARN] Avoid overwriting keytab configuration in yarn-client ## What changes were proposed in this pull request? Because yarn#client will reset the `spark.yarn.keytab` configuration to point to the location in distributed file, so if user still uses the old `SparkConf` to create `SparkSession` with Hive enabled, it will read keytab from the path in distributed cached. This is OK for yarn cluster mode, but in yarn client mode where driver is running out of container, it will be failed to fetch the keytab. So here we should avoid reseting this configuration in the `yarn#client` and only overwriting it for AM, so using `spark.yarn.keytab` could get correct keytab path no matter running in client (keytab in local fs) or cluster (keytab in distributed cache) mode. ## How was this patch tested? Verified in security cluster. Author: jerryshao <sshao@hortonworks.com> Closes #16923 from jerryshao/SPARK-19038. (cherry picked from commit a920a4369434c84274866a09f61e402232c3b47c) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	24 February 2017, 17:32:32 UTC
8cdd121	Marcelo Vanzin	22 February 2017, 23:48:56 UTC	[SPARK-19652][UI] Do auth checks for REST API access (branch-2.0). The REST API has a security filter that performs auth checks based on the UI root's security manager. That works fine when the UI root is the app's UI, but not when it's the history server. In the SHS case, all users would be allowed to see all applications through the REST API, even if the UI itself wouldn't be available to them. This change adds auth checks for each app access through the API too, so that only authorized users can see the app's data. The change also modifies the existing security filter to use `HttpServletRequest.getRemoteUser()`, which is used in other places. That is not necessarily the same as the principal's name; for example, when using Hadoop's SPNEGO auth filter, the remote user strips the realm information, which then matches the user name registered as the owner of the application. I also renamed the UIRootFromServletContext trait to a more generic name since I'm using it to store more context information now. Tested manually with an authentication filter enabled. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #17029 from vanzin/SPARK-19652_2.0.	22 February 2017, 23:48:56 UTC
ddd432d	Sean Owen	20 February 2017, 17:02:09 UTC	[SPARK-19646][CORE][STREAMING] binaryRecords replicates records in scala API Use `BytesWritable.copyBytes`, not `getBytes`, because `getBytes` returns the underlying array, which may be reused when repeated reads don't need a different size, as is the case with binaryRecords APIs Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16974 from srowen/SPARK-19646. (cherry picked from commit d0ecca6075d86bedebf8bc2278085a2cd6cb0a43) Signed-off-by: Sean Owen <sowen@cloudera.com>	20 February 2017, 17:19:14 UTC
5c3e56f	Davies Liu	17 February 2017, 17:38:06 UTC	[SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMap ## What changes were proposed in this pull request? Radix sort require that half of array as free (as temporary space), so we use 0.5 as the scale factor to make sure that BytesToBytesMap will not have more items than 1/2 of capacity. Turned out this is not true, the current implementation of append() could leave 1 more item than the threshold (1/2 of capacity) in the array, which break the requirement of radix sort (fail the assert in 2.2, or fail to insert into InMemorySorter in 2.1). This PR fix the off-by-one bug in BytesToBytesMap. This PR also fix a bug that the array will never grow if it fail to grow once (stay as initial capacity), introduced by #15722 . ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #16844 from davies/off_by_one. (cherry picked from commit 3d0c3af0a76757c20e429c38efa4f14a15c9097a) Signed-off-by: Davies Liu <davies.liu@gmail.com>	17 February 2017, 17:38:34 UTC
2926812	Jong Wook Kim	14 February 2017, 19:33:31 UTC	[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment ## What changes were proposed in this pull request? As discussed in [JIRA](https://issues.apache.org/jira/browse/SPARK-19501), this patch addresses the problem where too many HDFS RPCs are made when there are many URIs specified in `spark.yarn.jars`, potentially adding hundreds of RTTs to YARN before the application launches. This becomes significant when submitting the application to a non-local YARN cluster (where the RTT may be in order of 100ms, for example). For each URI specified, the current implementation makes at least two HDFS RPCs, for: - [Calling `getFileStatus()` before uploading each file to the distributed cache in `ClientDistributedCacheManager.addResource()`](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71). - [Resolving any symbolic links in each of the file URI](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L377-L379), which repeatedly makes HDFS RPCs until the all symlinks are resolved. (see [`FileContext.resolve(Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java#L2189-L2195), [`FSLinkResolver.resolve(FileContext, Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSLinkResolver.java#L79-L112), and [`AbstractFileSystem.resolvePath()`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/AbstractFileSystem.java#L464-L468).) The first `getFileStatus` RPC can be removed, using `statCache` populated with the file statuses retrieved with [the previous `globStatus` call](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531). The second one can be largely reduced by caching the symlink resolution results in a mutable.HashMap. This patch adds a local variable in `yarn.Client.prepareLocalResources()` and passes it as an additional parameter to `yarn.Client.copyFileToRemote`. [The symlink resolution code was added in 2013](https://github.com/apache/spark/commit/a35472e1dd2ea1b5a0b1fb6b382f5a98f5aeba5a#diff-b050df3f55b82065803d6e83453b9706R187) and has not changed since. I am assuming that this is still required, but otherwise we can remove using `symlinkCache` and symlink resolution altogether. ## How was this patch tested? This patch is based off 8e8afb3, currently the latest YARN patch on master. All tests except a few in spark-hive passed with `./dev/run-tests` on my machine, using JDK 1.8.0_112 on macOS 10.12.3; also tested myself with this modified version of SPARK 2.2.0-SNAPSHOT which performed a normal deployment and execution on a YARN cluster without errors. Author: Jong Wook Kim <jongwook@nyu.edu> Closes #16916 from jongwook/SPARK-19501. (cherry picked from commit ab9872db1f9c0f289541ec5756d1a142d85545ce) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	14 February 2017, 19:42:40 UTC
f50c437	Josh Rosen	13 February 2017, 19:04:27 UTC	[SPARK-19529] TransportClientFactory.createClient() shouldn't call awaitUninterruptibly() This patch replaces a single `awaitUninterruptibly()` call with a plain `await()` call in Spark's `network-common` library in order to fix a bug which may cause tasks to be uncancellable. In Spark's Netty RPC layer, `TransportClientFactory.createClient()` calls `awaitUninterruptibly()` on a Netty future while waiting for a connection to be established. This creates problem when a Spark task is interrupted while blocking in this call (which can happen in the event of a slow connection which will eventually time out). This has bad impacts on task cancellation when `interruptOnCancel = true`. As an example of the impact of this problem, I experienced significant numbers of uncancellable "zombie tasks" on a production cluster where several tasks were blocked trying to connect to a dead shuffle server and then continued running as zombies after I cancelled the associated Spark stage. The zombie tasks ran for several minutes with the following stack: ``` java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:460) io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301) org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224) org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) => holding Monitor(java.lang.Object1849476028}) org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114) org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169) org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala: 350) org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286) org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120) org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45) org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) org.apache.spark.rdd.RDD.iterator(RDD.scala:287) [...] ``` As far as I can tell, `awaitUninterruptibly()` might have been used in order to avoid having to declare that methods throw `InterruptedException` (this code is written in Java, hence the need to use checked exceptions). This patch simply replaces this with a regular, interruptible `await()` call,. This required several interface changes to declare a new checked exception (these are internal interfaces, though, and this change doesn't significantly impact binary compatibility). An alternative approach would be to wrap `InterruptedException` into `IOException` in order to avoid having to change interfaces. The problem with this approach is that the `network-shuffle` project's `RetryingBlockFetcher` code treats `IOExceptions` as transitive failures when deciding whether to retry fetches, so throwing a wrapped `IOException` might cause an interrupted shuffle fetch to be retried, further prolonging the lifetime of a cancelled zombie task. Note that there are three other `awaitUninterruptibly()` in the codebase, but those calls have a hard 10 second timeout and are waiting on a `close()` operation which is expected to complete near instantaneously, so the impact of uninterruptibility there is much smaller. Manually. Author: Josh Rosen <joshrosen@databricks.com> Closes #16866 from JoshRosen/SPARK-19529. (cherry picked from commit 1c4d10b10c78d138b55e381ec6828e04fef70d6f) Signed-off-by: Cheng Lian <lian@databricks.com>	13 February 2017, 20:57:29 UTC
23050c8	Xiao Li	12 February 2017, 01:20:44 UTC	[SPARK-17897][SQL][BACKPORT-2.0] Fixed IsNotNull Constraint Inference Rule ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/16067 to Spark 2.0 ---- The `constraints` of an operator is the expressions that evaluate to `true` for all the rows produced. That means, the expression result should be neither `false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the constraints, which are generated by its own predicates or propagated from the children. The constraint can be a complex expression. For better usage of these constraints, we try to push down `IsNotNull` to the lowest-level expressions (i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is null intolerant. (When the input is NULL, the null-intolerant expression always evaluates to NULL.) Below is the existing code we have for `IsNotNull` pushdown. ```Scala private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = expr match { case a: Attribute => Seq(a) case _: NullIntolerant \| IsNotNull(_: NullIntolerant) => expr.children.flatMap(scanNullIntolerantExpr) case _ => Seq.empty[Attribute] } ``` `IsNotNull` itself is not null-intolerant. It converts `null` to `false`. If the expression does not include any `Not`-like expression, it works; otherwise, it could generate a wrong result. This PR is to fix the above function by removing the `IsNotNull` from the inference. After the fix, when a constraint has a `IsNotNull` expression, we infer new attribute-specific `IsNotNull` constraints if and only if `IsNotNull` appears in the root. Without the fix, the following test case will return empty. ```Scala val data = Seq[java.lang.Integer](1, null).toDF("key") data.filter("not key is not null").show() ``` Before the fix, the optimized plan is like ``` == Optimized Logical Plan == Project [value#1 AS key#3] +- Filter (isnotnull(value#1) && NOT isnotnull(value#1)) +- LocalRelation [value#1] ``` After the fix, the optimized plan is like ``` == Optimized Logical Plan == Project [value#1 AS key#3] +- Filter NOT isnotnull(value#1) +- LocalRelation [value#1] ``` ### How was this patch tested? Added a test Author: Xiao Li <gatorsmile@gmail.com> Closes #16894 from gatorsmile/isNotNull2.0.	12 February 2017, 01:20:44 UTC
00803cd	Stan Zhai	09 February 2017, 20:01:25 UTC	[SPARK-19509][SQL] Grouping Sets do not respect nullable grouping columns ## What changes were proposed in this pull request? The analyzer currently does not check if a column used in grouping sets is actually nullable itself. This can cause the nullability of the column to be incorrect, which can cause null pointer exceptions down the line. This PR fixes that by also consider the nullability of the column. This is only a problem for Spark 2.1 and below. The latest master uses a different approach. Closes https://github.com/apache/spark/pull/16874 ## How was this patch tested? Added a regression test to `SQLQueryTestSuite.grouping_set`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #16873 from hvanhovell/SPARK-19509. (cherry picked from commit a3d5300a030fb5f1c275e671603e0745b6466735) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	09 February 2017, 20:01:45 UTC
8bf6422	Herman van Hovell	06 February 2017, 20:28:13 UTC	[SPARK-19472][SQL] Parser should not mistake CASE WHEN(...) for a function call ## What changes were proposed in this pull request? The SQL parser can mistake a `WHEN (...)` used in `CASE` for a function call. This happens in cases like the following: ```sql select case when (1) + case when 1 > 0 then 1 else 0 end = 2 then 1 else 0 end from tb ``` This PR fixes this by re-organizing the case related parsing rules. ## How was this patch tested? Added a regression test to the `ExpressionParserSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #16821 from hvanhovell/SPARK-19472. (cherry picked from commit cb2677b86039a75fcd8a4e567ab06055f054a19a) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	06 February 2017, 20:28:52 UTC
b41294b	Felix Cheung	27 January 2017, 18:31:28 UTC	[SPARK-19333][SPARKR] Add Apache License headers to R files ## What changes were proposed in this pull request? add header ## How was this patch tested? Manual run to check vignettes html is created properly Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16709 from felixcheung/rfilelicense. (cherry picked from commit 385d73848b0d274467b633c7615e03b370f4a634) Signed-off-by: Felix Cheung <felixcheung@apache.org>	27 January 2017, 18:31:59 UTC
93d5887	Marcelo Vanzin	27 January 2017, 18:16:09 UTC	[SPARK-19220][UI] Make redirection to HTTPS apply to all URIs. (branch-2.0) The redirect handler was installed only for the root of the server; any other context ended up being served directly through the HTTP port. Since every sub page (e.g. application UIs in the history server) is a separate servlet context, this meant that everything but the root was accessible via HTTP still. The change adds separate names to each connector, and binds contexts to specific connectors so that content is only served through the HTTPS connector when it's enabled. In that case, the only thing that binds to the HTTP connector is the redirect handler. Tested with new unit tests and by checking a live history server. (cherry picked from commit 59502bbcf6e64e5b5e3dda080441054afaf58c53) Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #16717 from vanzin/SPARK-19220_2.0.	27 January 2017, 18:16:09 UTC
48a8dc8	Tathagata Das	26 January 2017, 01:17:34 UTC	[SPARK-14804][SPARK][GRAPHX] Fix checkpointing of VertexRDD/EdgeRDD ## What changes were proposed in this pull request? EdgeRDD/VertexRDD overrides checkpoint() and isCheckpointed() to forward these to the internal partitionRDD. So when checkpoint() is called on them, its the partitionRDD that actually gets checkpointed. However since isCheckpointed() also overridden to call partitionRDD.isCheckpointed, EdgeRDD/VertexRDD.isCheckpointed returns true even though this RDD is actually not checkpointed. This would have been fine except the RDD's internal logic for computing the RDD depends on isCheckpointed(). So for VertexRDD/EdgeRDD, since isCheckpointed is true, when computing Spark tries to read checkpoint data of VertexRDD/EdgeRDD even though they are not actually checkpointed. Through a crazy sequence of call forwarding, it reads checkpoint data of partitionsRDD and tries to cast it to types in Vertex/EdgeRDD. This leads to ClassCastException. The minimal fix that does not change any public behavior is to modify RDD internal to not use public override-able API for internal logic. ## How was this patch tested? New unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15396 from tdas/SPARK-14804. (cherry picked from commit 47d5d0ddb06c7d2c86515d9556c41dc80081f560) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	26 January 2017, 01:18:02 UTC
00a4807	Marcelo Vanzin	25 January 2017, 22:22:58 UTC	[SPARK-18750][YARN] Follow up: move test to correct directory in 2.1 branch. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #16704 from vanzin/SPARK-18750_2.1. (cherry picked from commit 97d3353ef16a6e6edc93d8177b08442a03e19eee) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	25 January 2017, 22:23:48 UTC
2d9e8d5	Marcelo Vanzin	25 January 2017, 14:18:41 UTC	[SPARK-18750][YARN] Avoid using "mapValues" when allocating containers. That method is prone to stack overflows when the input map is really large; instead, use plain "map". Also includes a unit test that was tested and caused stack overflows without the fix. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #16667 from vanzin/SPARK-18750. (cherry picked from commit 76db394f2baedc2c7b7a52c05314a64ec9068263) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	25 January 2017, 18:51:11 UTC
886f737	Yanbo Liang	22 January 2017, 05:15:57 UTC	[SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive ## What changes were proposed in this pull request? MLlib ```GeneralizedLinearRegression``` ```family``` and ```link``` should be case insensitive. This is consistent with some other MLlib params such as [```featureSubsetStrategy```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L415). ## How was this patch tested? Update corresponding tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16516 from yanboliang/spark-19133. (cherry picked from commit 3dcad9fab17297f9966026f29fefb5c726965a13) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>	22 January 2017, 05:16:42 UTC
4c2065d	Tathagata Das	20 January 2017, 22:04:51 UTC	[SPARK-19314][SS][CATALYST] Do not allow sort before aggregation in Structured Streaming plan ## What changes were proposed in this pull request? Sort in a streaming plan should be allowed only after a aggregation in complete mode. Currently it is incorrectly allowed when present anywhere in the plan. It gives unpredictable potentially incorrect results. ## How was this patch tested? New test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16662 from tdas/SPARK-19314. (cherry picked from commit 552e5f08841828e55f5924f1686825626da8bcd0) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	20 January 2017, 22:05:23 UTC
9fc053c	GraceH	19 January 2017, 04:58:31 UTC	[SPARK-16968][SQL][BACKPORT-2.0] Add additional options in jdbc when creating a new table ### What changes were proposed in this pull request? This PR is to backport the PRs https://github.com/apache/spark/pull/14559 and https://github.com/apache/spark/pull/14683 --- In the PR, we just allow the user to add additional options when create a new table in JDBC writer. The options can be table_options or partition_options. E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8" Here is the usage example: ``` df.write.option("createTableOptions", "ENGINE=InnoDB DEFAULT CHARSET=utf8").jdbc(...) ``` ### How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #16634 from gatorsmile/backportSPARK-16968.	19 January 2017, 04:58:31 UTC
ee4e8fa	Takeshi YAMAMURO	15 January 2017, 07:39:31 UTC	[SPARK-17237][SPARK-17458][SQL][BACKPORT-2.0] Preserve aliases that are given for pivot aggregations ## What changes were proposed in this pull request? This pr is to preserve aliases that are given for pivot aggregations to solve the issue reported in `SPARK-17237`. This pivoting adds backticks (e.g. 3_count(\`c\`)) in column names and, in some cases, thes causes analysis exceptions like; ``` scala> val df = Seq((2, 3, 4), (3, 4, 5)).toDF("a", "x", "y") scala> df.groupBy("a").pivot("x").agg(count("y"), avg("y")).na.fill(0) org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_count(`y`)`; at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:134) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:144) ... ``` So, this pr also removes these backticks from column names. ## How was this patch tested? Added a test in `DataFrameAggregateSuite`. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #16565 from maropu/SPARK-17237-3.	15 January 2017, 07:39:31 UTC
08385b7	Yucai Yu	13 January 2017, 21:40:53 UTC	[SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn ## What changes were proposed in this pull request? the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2. ## How was this patch tested? unit test Author: Yucai Yu <yucai.yu@intel.com> Closes #16555 from yucai/offheap_short. (cherry picked from commit ad0dadaa251b031a480fc2080f792a54ed7dfc5f) Signed-off-by: Davies Liu <davies.liu@gmail.com>	13 January 2017, 21:41:22 UTC
f56819f	Wenchen Fan	13 January 2017, 06:52:34 UTC	[SPARK-19178][SQL] convert string of large numbers to int should return null ## What changes were proposed in this pull request? When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`. However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected. This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral. ## How was this patch tested? new regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16550 from cloud-fan/string-to-int. (cherry picked from commit 6b34e745bb8bdcf5a8bb78359fa39bbe8c6563cc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	13 January 2017, 10:48:21 UTC
449231c	Vinayak	13 January 2017, 10:35:12 UTC	[SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro. Existing unit tests and a new unit test added to pyspark-sql: /python/run-tests --python-executables=python --modules=pyspark-sql Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Vinayak <vijoshi5@in.ibm.com> Author: Vinayak Joshi <vijoshi@users.noreply.github.com> Closes #16119 from vijoshi/SPARK-18687_master. (cherry picked from commit 285a7798e267311730b0163d37d726a81465468a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	13 January 2017, 10:36:35 UTC
be527dd	Andrew Ash	13 January 2017, 07:14:07 UTC	Fix missing close-parens for In filter's toString Otherwise the open parentheses isn't closed in query plan descriptions of batch scans. PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ... Author: Andrew Ash <andrew@andrewash.com> Closes #16558 from ash211/patch-9. (cherry picked from commit b040cef2ed0ed46c3dfb483a117200c9dac074ca) Signed-off-by: Reynold Xin <rxin@databricks.com>	13 January 2017, 07:14:25 UTC
55d2a11	Liang-Chi Hsieh	12 January 2017, 12:53:31 UTC	[SPARK-19055][SQL][PYSPARK] Fix SparkSession initialization when SparkContext is stopped ## What changes were proposed in this pull request? In SparkSession initialization, we store created the instance of SparkSession into a class variable _instantiatedContext. Next time we can use SparkSession.builder.getOrCreate() to retrieve the existing SparkSession instance. However, when the active SparkContext is stopped and we create another new SparkContext to use, the existing SparkSession is still associated with the stopped SparkContext. So the operations with this existing SparkSession will be failed. We need to detect such case in SparkSession and renew the class variable _instantiatedContext if needed. ## How was this patch tested? New test added in PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16454 from viirya/fix-pyspark-sparksession. (cherry picked from commit c6c37b8af714c8ddc8c77ac943a379f703558f27) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 January 2017, 12:54:41 UTC
3566e40	Wenchen Fan	12 January 2017, 12:21:04 UTC	[SPARK-18969][SQL] Support grouping by nondeterministic expressions ## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close https://github.com/apache/spark/pull/16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <wenchen@databricks.com> Closes #16404 from cloud-fan/groupby. (cherry picked from commit 871d266649ddfed38c64dfda7158d8bb58d4b979) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 January 2017, 12:25:44 UTC
c94288b	Dongjoon Hyun	10 January 2017, 13:27:55 UTC	[SPARK-18857][SQL] Don't use `Iterator.duplicate` for `incrementalCollect` in Thrift Server ## What changes were proposed in this pull request? To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However, Scala `Iterator.duplicate` uses a queue to buffer all items between both iterators, this causes GC and hangs for queries with large number of rows. We should not use this, especially for `spark.sql.thriftServer.incrementalCollect`. https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300 ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16440 from dongjoon-hyun/SPARK-18857. (cherry picked from commit a2c6adcc5d2702d2f0e9b239517353335e5f911e) Signed-off-by: Sean Owen <sowen@cloudera.com>	12 January 2017, 10:45:26 UTC
ec2fe92	Felix Cheung	12 January 2017, 04:01:11 UTC	[SPARK-19133][SPARKR][ML][BACKPORT-2.0] fix glm for Gamma, clarify glm family supported ## What changes were proposed in this pull request? Backport to 2.0 (cherry picking from 2.1 didn't work) ## How was this patch tested? unit test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16543 from felixcheung/rgammabackport20.	12 January 2017, 04:01:11 UTC
6fe676c	Sean Owen	10 January 2017, 20:40:21 UTC	[SPARK-18997][CORE] Recommended upgrade libthrift to 0.9.3 ## What changes were proposed in this pull request? Updates to libthrift 0.9.3 to address a CVE. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #16530 from srowen/SPARK-18997. (cherry picked from commit 856bae6af64982ae0221948c58ff564887e54a70) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	10 January 2017, 20:40:57 UTC
e70c419	Dongjoon Hyun	08 January 2017, 02:55:01 UTC	[SPARK-18941][SQL][DOC] Add a new behavior document on `CREATE/DROP TABLE` with `LOCATION` ## What changes were proposed in this pull request? This PR adds a new behavior change description on `CREATE TABLE ... LOCATION` at `sql-programming-guide.md` clearly under `Upgrading From Spark SQL 1.6 to 2.0`. This change is introduced at Apache Spark 2.0.0 as [SPARK-15276](https://issues.apache.org/jira/browse/SPARK-15276). ## How was this patch tested? ``` SKIP_API=1 jekyll build ``` Newly Added Description <img width="913" alt="new" src="https://cloud.githubusercontent.com/assets/9700541/21743606/7efe2b12-d4ba-11e6-8a0d-551222718ea2.png"> Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16400 from dongjoon-hyun/SPARK-18941. (cherry picked from commit 923e594844a7ad406195b91877f0fb374d5a454b) Signed-off-by: gatorsmile <gatorsmile@gmail.com>	08 January 2017, 02:55:52 UTC
56998f3	wm624@hotmail.com	07 January 2017, 19:07:49 UTC	[SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model ## What changes were proposed in this pull request? While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different. For example, in the test("read/write DistributedLDAModel"), I add the test: val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior assert(logPrior === logPrior2) The test fails: -4.394180878889078 did not equal -4.294290536919573 The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`. Please refer to #16464 for details. ## How was this patch tested? Add a new unit test for testing logPrior. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16491 from wangmiao1981/ldabug. (cherry picked from commit 036b50347c56a3541c526b1270093163b9b79e45) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	07 January 2017, 19:10:06 UTC
93549ff	Dongjoon Hyun	05 January 2017, 05:34:31 UTC	[SPARK-18877][SQL][BACKPORT-2.0] CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar` ## What changes were proposed in this pull request? CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a partition. Specifically, `inferRowType`, the seqOp of aggregate, returns the last decimal type. This PR fixes it to use `findTightestCommonType`. decimal.csv ``` 9.03E+12 1.19E+11 ``` BEFORE ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root \|-- _c0: decimal(3,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show 16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3 ``` AFTER ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root \|-- _c0: decimal(4,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show +---------+ \| _c0\| +---------+ \|9.030E+12\| \| 1.19E+11\| +---------+ ``` ## How was this patch tested? Pass the newly add test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16472 from dongjoon-hyun/SPARK-18877-BACKPORT-20.	05 January 2017, 05:34:31 UTC
5ed2f1c	Sean Owen	28 December 2016, 12:17:33 UTC	[SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing Scala deps in spark-tags ## What changes were proposed in this pull request? This adds back a direct dependency on Scala library classes from spark-tags because its Scala annotations need them. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16418 from srowen/SPARK-18993. (cherry picked from commit d7bce3bd31ec193274718042dc017706989d7563) Signed-off-by: Sean Owen <sowen@cloudera.com>	28 December 2016, 12:17:53 UTC
f124d35	gatorsmile	26 December 2016, 07:00:22 UTC	[SPARK-18237][SPARK-18703][SPARK-18675][SQL][BACKPORT-2.0] CTAS for hive serde table should work for all hive versions AND Drop Staging Directories and Data Files ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/15744, https://github.com/apache/spark/pull/16104 and https://github.com/apache/spark/pull/16134. ---------- [[SPARK-18237][HIVE] hive.exec.stagingdir have no effect ](https://github.com/apache/spark/pull/15744) hive.exec.stagingdir have no effect in spark2.0.1， Hive confs in hive-site.xml will be loaded in hadoopConf, so we should use hadoopConf in InsertIntoHiveTable instead of SessionState.conf ---------- [[SPARK-18675][SQL] CTAS for hive serde table should work for all hive versions](https://github.com/apache/spark/pull/16104) Before hive 1.1, when inserting into a table, hive will create the staging directory under a common scratch directory. After the writing is finished, hive will simply empty the table directory and move the staging directory to it. After hive 1.1, hive will create the staging directory under the table directory, and when moving staging directory to table directory, hive will still empty the table directory, but will exclude the staging directory there. In `InsertIntoHiveTable`, we simply copy the code from hive 1.2, which means we will always create the staging directory under the table directory, no matter what the hive version is. This causes problems if the hive version is prior to 1.1, because the staging directory will be removed by hive when hive is trying to empty the table directory. This PR copies the code from hive 0.13, so that we have 2 branches to create staging directory. If hive version is prior to 1.1, we'll go to the old style branch(i.e. create the staging directory under a common scratch directory), else, go to the new style branch(i.e. create the staging directory under the table directory) ---------- [[SPARK-18703] [SQL] Drop Staging Directories and Data Files After each Insertion/CTAS of Hive serde Tables](https://github.com/apache/spark/pull/16134) Below are the files/directories generated for three inserts againsts a Hive table: ``` /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/.part-00000.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/part-00000 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/.part-00000.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/part-00000 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/._SUCCESS.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/.part-00000.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/_SUCCESS /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/part-00000 /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000 ``` The first 18 files are temporary. We do not drop it until the end of JVM termination. If JVM does not appropriately terminate, these temporary files/directories will not be dropped. Only the last two files are needed, as shown below. ``` /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc /private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000 ``` The temporary files/directories could accumulate a lot when we issue many inserts, since each insert generats at least six files. This could eat a lot of spaces and slow down the JVM termination. When the JVM does not terminates approprately, the files might not be dropped. This PR is to drop the created staging files and temporary data files after each insert/CTAS. ### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #16399 from gatorsmile/backport18703&18675ToSpark2.0.	26 December 2016, 07:00:22 UTC
30e6d46	Ryan Williams	22 December 2016, 00:37:20 UTC	[SPARK-17807][CORE] split test-tags into test-JAR Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR. Alternative to #16303. Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #16311 from ryan-williams/tt. (cherry picked from commit afd9bc1d8a85adf88c412d8bc75e46e7ecb4bcdd) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	23 December 2016, 00:23:26 UTC
2d72160	Shixiong Zhu	23 December 2016, 00:22:55 UTC	[SPARK-18972][CORE] Fix the netty thread names for RPC ## What changes were proposed in this pull request? Right now the name of threads created by Netty for Spark RPC are `shuffle-client-` and `shuffle-server-`. It's pretty confusing. This PR just uses the module name in TransportConf to set the thread name. In addition, it also includes the following minor fixes: - TransportChannelHandler.channelActive and channelInactive should call the corresponding super methods. - Make ShuffleBlockFetcherIterator throw NoSuchElementException if it has no more elements. Otherwise, if the caller calls `next` without `hasNext`, it will just hang. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16380 from zsxwing/SPARK-18972. (cherry picked from commit f252cb5d161e064d39cc1ed1d9299307a0636174) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>	23 December 2016, 00:23:09 UTC
542be40	Shixiong Zhu	21 December 2016, 19:17:44 UTC	[SPARK-18031][TESTS] Fix flaky test ExecutorAllocationManagerSuite.basic functionality ## What changes were proposed in this pull request? The failure is because in `test("basic functionality")`, it doesn't block until `ExecutorAllocationManager.manageAllocation` is called. This PR just adds StreamManualClock to allow the tests to block on expected wait time to make the test deterministic. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16321 from zsxwing/SPARK-18031.	22 December 2016, 18:30:04 UTC
080ac37	Takeshi YAMAMURO	22 December 2016, 00:53:33 UTC	[SPARK-18528][SQL] Fix a bug to initialise an iterator of aggregation buffer ## What changes were proposed in this pull request? This pr is to fix an `NullPointerException` issue caused by a following `limit + aggregate` query; ``` scala> val df = Seq(("a", 1), ("b", 2), ("c", 1), ("d", 5)).toDF("id", "value") scala> df.limit(2).groupBy("id").count().show WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) ``` The root culprit is that [`$doAgg()`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L596) skips an initialization of [the buffer iterator](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L603); `BaseLimitExec` sets `stopEarly=true` and `$doAgg()` exits in the middle without the initialization. ## How was this patch tested? Added a test to check if no exception happens for limit + aggregates in `DataFrameAggregateSuite.scala`. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #15980 from maropu/SPARK-18528. (cherry picked from commit b41ec997786e2be42a8a2a182212a610d08b221b) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	22 December 2016, 00:54:00 UTC

Newer
Older