Revision history - refs/tags/v1.5.0-rc2 - origin: https://github.com/apache/spark

visit type:

Revision	Author	Date	Message	Commit Date
7277713	Patrick Wendell	25 August 2015, 22:56:37 UTC	Preparing Spark release v1.5.0-rc2	25 August 2015, 22:56:37 UTC
ab7d46d	Davies Liu	25 August 2015, 22:19:41 UTC	[SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive) Follow the rule in Hive for decimal division. see https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113 cc chenghao-intel Author: Davies Liu <davies@databricks.com> Closes #8415 from davies/decimal_div2. (cherry picked from commit 7467b52ed07f174d93dfc4cb544dc4b69a2c2826) Signed-off-by: Yin Huai <yhuai@databricks.com>	25 August 2015, 22:20:42 UTC
8925896	Davies Liu	25 August 2015, 21:55:34 UTC	[SPARK-10245] [SQL] Fix decimal literals with precision < scale In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal. Author: Davies Liu <davies@databricks.com> Closes #8428 from davies/smaller_decimal. (cherry picked from commit ec89bd840a6862751999d612f586a962cae63f6d) Signed-off-by: Yin Huai <yhuai@databricks.com>	25 August 2015, 21:55:45 UTC
6f05b7a	Xiangrui Meng	25 August 2015, 21:11:38 UTC	[SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and mllib.util Same as #8421 but for `mllib.pmml` and `mllib.util`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8430 from mengxr/SPARK-10239 and squashes the following commits: a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util (cherry picked from commit 00ae4be97f7b205432db2967ba6d506286ef2ca6) Signed-off-by: DB Tsai <dbt@netflix.com>	25 August 2015, 21:11:50 UTC
055387c	Feynman Liang	25 August 2015, 20:23:15 UTC	[SPARK-9797] [MLLIB] [DOC] StreamingLinearRegressionWithSGD.setConvergenceTol default value Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc Author: Feynman Liang <fliang@databricks.com> Closes #8424 from feynmanliang/SPARK-9797. (cherry picked from commit 9205907876cf65695e56c2a94bedd83df3675c03) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	25 August 2015, 20:23:25 UTC
186326d	Xiangrui Meng	25 August 2015, 20:22:38 UTC	[SPARK-10237] [MLLIB] update since versions in mllib.fpm Same as #8421 but for `mllib.fpm`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8429 from mengxr/SPARK-10237. (cherry picked from commit c619c7552f22d28cfa321ce671fc9ca854dd655f) Signed-off-by: Xiangrui Meng <meng@databricks.com>	25 August 2015, 20:22:45 UTC
95e44b4	Feynman Liang	25 August 2015, 20:21:05 UTC	[SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD alias * Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol * Cleans up a note in code Author: Feynman Liang <fliang@databricks.com> Closes #8425 from feynmanliang/SPARK-9800. (cherry picked from commit c0e9ff1588b4d9313cc6ec6e00e5c7663eb67910) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	25 August 2015, 20:21:16 UTC
5a32ed7	Xiangrui Meng	25 August 2015, 19:16:23 UTC	[SPARK-10231] [MLLIB] update @Since annotation for mllib.classification Update `Since` annotation in `mllib.classification`: 1. add version to classes, objects, constructors, and public variables declared in constructors 2. correct some versions 3. remove `Since` on `toString` MechCoder dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8421 from mengxr/SPARK-10231 and squashes the following commits: b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification (cherry picked from commit 16a2be1a84c0a274a60c0a584faaf58b55d4942b) Signed-off-by: DB Tsai <dbt@netflix.com>	25 August 2015, 19:16:41 UTC
c740f5d	Feynman Liang	25 August 2015, 18:58:47 UTC	[SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230. (cherry picked from commit 881208a8e849facf54166bdd69d3634407f952e7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	25 August 2015, 18:58:55 UTC
742c82e	Yuhao Yang	25 August 2015, 17:54:03 UTC	[SPARK-8531] [ML] Update ML user guide for MinMaxScaler jira: https://issues.apache.org/jira/browse/SPARK-8531 Update ML user guide for MinMaxScaler Author: Yuhao Yang <hhbyyh@gmail.com> Author: unknown <yuhaoyan@yuhaoyan-MOBL1.ccr.corp.intel.com> Closes #7211 from hhbyyh/minmaxdoc. (cherry picked from commit b37f0cc1b4c064d6f09edb161250fa8b783de52a) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	25 August 2015, 17:54:12 UTC
0402f12	Michael Armbrust	25 August 2015, 17:22:54 UTC	[SPARK-10198] [SQL] Turn off partition verification by default Author: Michael Armbrust <michael@databricks.com> Closes #8404 from marmbrus/turnOffPartitionVerification. (cherry picked from commit 5c08c86bfa43462fb2ca5f7c5980ddfb44dd57f8) Signed-off-by: Michael Armbrust <michael@databricks.com>	25 August 2015, 17:23:08 UTC
bdcc8e6	ehnalis	25 August 2015, 11:30:06 UTC	Fixed a typo in DAGScheduler. Author: ehnalis <zoltan.zvara@gmail.com> Closes #8308 from ehnalis/master. (cherry picked from commit 7f1e507bf7e82bff323c5dec3c1ee044687c4173) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 August 2015, 11:30:18 UTC
5d68405	Zhang, Liye	25 August 2015, 10:48:55 UTC	[DOC] add missing parameters in SparkContext.scala for scala doc Author: Zhang, Liye <liye.zhang@intel.com> Closes #8412 from liyezhang556520/minorDoc. (cherry picked from commit 5c14890159a5711072bf395f662b2433a389edf9) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 August 2015, 10:49:07 UTC
73f1dd1	Yin Huai	25 August 2015, 08:19:34 UTC	[SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspectors). https://issues.apache.org/jira/browse/SPARK-10197 Author: Yin Huai <yhuai@databricks.com> Closes #8407 from yhuai/ORCSPARK-10197. (cherry picked from commit 0e6368ffaec1965d0c7f89420e04a974675c7f6e) Signed-off-by: Cheng Lian <lian@databricks.com>	25 August 2015, 08:20:13 UTC
a0f22cf	Josh Rosen	25 August 2015, 08:06:36 UTC	[SPARK-10195] [SQL] Data sources Filter should not expose internal types Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties. This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0. To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions. Author: Josh Rosen <joshrosen@databricks.com> Closes #8403 from JoshRosen/datasources-internal-vs-external-types. (cherry picked from commit 7bc9a8c6249300ded31ea931c463d0a8f798e193) Signed-off-by: Reynold Xin <rxin@databricks.com>	25 August 2015, 08:06:51 UTC
e5cea56	Davies Liu	25 August 2015, 08:00:44 UTC	[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly. In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5). Author: Davies Liu <davies@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #8400 from davies/timestamp_parquet. (cherry picked from commit 2f493f7e3924b769160a16f73cccbebf21973b91) Signed-off-by: Cheng Lian <lian@databricks.com>	25 August 2015, 08:00:58 UTC
2032d66	Tathagata Das	25 August 2015, 07:35:51 UTC	[SPARK-10210] [STREAMING] Filter out non-existent blocks before creating BlockRDD When write ahead log is not enabled, a recovered streaming driver still tries to run jobs using pre-failure block ids, and fails as the block do not exists in-memory any more (and cannot be recovered as receiver WAL is not enabled). This occurs because the driver-side WAL of ReceivedBlockTracker is recovers that past block information, and ReceiveInputDStream creates BlockRDDs even if those blocks do not exist. The solution in this PR is to filter out block ids that do not exist before creating the BlockRDD. In addition, it adds unit tests to verify other logic in ReceiverInputDStream. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8405 from tdas/SPARK-10210. (cherry picked from commit 1fc37581a52530bac5d555dbf14927a5780c3b75) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	25 August 2015, 07:36:01 UTC
4841ebb	Sean Owen	25 August 2015, 07:32:20 UTC	[SPARK-6196] [BUILD] Remove MapR profiles in favor of hadoop-provided Follow up to https://github.com/apache/spark/pull/7047 pwendell mentioned that MapR should use `hadoop-provided` now, and indeed the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence the action seems to be to remove the profiles, which are now not used. CC trystanleftwich Author: Sean Owen <sowen@cloudera.com> Closes #8338 from srowen/SPARK-6196. (cherry picked from commit 57b960bf3706728513f9e089455a533f0244312e) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 August 2015, 07:32:31 UTC
76d920f	Yu ISHIKAWA	25 August 2015, 07:28:51 UTC	[SPARK-10214] [SPARKR] [DOCS] Improve SparkR Column, DataFrame API docs cc: shivaram ## Summary - Add name tags to each methods in DataFrame.R and column.R - Replace `rdname column` with `rdname {each_func}`. i.e. alias method : `rdname column` => `rdname alias` ## Generated PDF File https://drive.google.com/file/d/0B9biIZIU47lLNHN2aFpnQXlSeGs/view?usp=sharing ## JIRA [[SPARK-10214] Improve SparkR Column, DataFrame API docs - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10214) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8414 from yu-iskw/SPARK-10214. (cherry picked from commit d4549fe58fa0d781e0e891bceff893420cb1d598) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	25 August 2015, 07:28:58 UTC
b7c4ff1	Josh Rosen	25 August 2015, 07:04:10 UTC	[SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions. I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class. Author: Josh Rosen <joshrosen@databricks.com> Closes #7631 from JoshRosen/SPARK-9293. (cherry picked from commit 82268f07abfa658869df2354ae72f8d6ddd119e8) Signed-off-by: Michael Armbrust <michael@databricks.com>	25 August 2015, 07:04:23 UTC
95a14e9	Cheng Lian	25 August 2015, 06:58:42 UTC	[SPARK-10136] [SQL] A more robust fix for SPARK-10136 PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here. The structure of the problematic Parquet schema generated by parquet-avro is something like this: ``` message m { <repetition> group f (LIST) { // Level 1 repeated group array (LIST) { // Level 2 repeated <primitive-type> array; // Level 3 } } } ``` (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.) This structure consists of two nested legacy 2-level `LIST`-like structures: 1. The repeated group type at level 2 is the element type of the outer array defined at level 1 This group should map to an `CatalystArrayConverter.ElementConverter` when building converters. 2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2 This group should also map to an `CatalystArrayConverter.ElementConverter`. The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it. According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.) As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec: > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.) This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method. Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3]. [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305 [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463 [3]: https://issues.apache.org/jira/browse/PARQUET-364 Author: Cheng Lian <lian@databricks.com> Closes #8361 from liancheng/spark-10136/proper-version. (cherry picked from commit bf03fe68d62f33dda70dff45c3bda1f57b032dfc) Signed-off-by: Cheng Lian <lian@databricks.com>	25 August 2015, 06:58:57 UTC
0b425ed	Yin Huai	25 August 2015, 06:38:32 UTC	[SPARK-10196] [SQL] Correctly saving decimals in internal rows to JSON. https://issues.apache.org/jira/browse/SPARK-10196 Author: Yin Huai <yhuai@databricks.com> Closes #8408 from yhuai/DecimalJsonSPARK-10196. (cherry picked from commit df7041d02d3fd44b08a859f5d77bf6fb726895f0) Signed-off-by: Davies Liu <davies.liu@gmail.com>	25 August 2015, 06:38:42 UTC
bb1357f	zsxwing	25 August 2015, 06:34:50 UTC	[SPARK-10137] [STREAMING] Avoid to restart receivers if scheduleReceivers returns balanced results This PR fixes the following cases for `ReceiverSchedulingPolicy`. 1) Assume there are 4 executors: host1, host2, host3, host4, and 5 receivers: r1, r2, r3, r4, r5. Then `ReceiverSchedulingPolicy.scheduleReceivers` will return (r1 -> host1, r2 -> host2, r3 -> host3, r4 -> host4, r5 -> host1). Let's assume r1 starts at first on `host1` as `scheduleReceivers` suggested, and try to register with ReceiverTracker. But the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will return (host2, host3, host4) according to the current executor weights (host1 -> 1.0, host2 -> 0.5, host3 -> 0.5, host4 -> 0.5), so ReceiverTracker will reject `r1`. This is unexpected since r1 is starting exactly where `scheduleReceivers` suggested. This case can be fixed by ignoring the information of the receiver that is rescheduling in `receiverTrackingInfoMap`. 2) Assume there are 3 executors (host1, host2, host3) and each executors has 3 cores, and 3 receivers: r1, r2, r3. Assume r1 is running on host1. Now r2 is restarting, the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will always return (host1, host2, host3). So it's possible that r2 will be scheduled to host1 by TaskScheduler. r3 is similar. Then at last, it's possible that there are 3 receivers running on host1, while host2 and host3 are idle. This issue can be fixed by returning only executors that have the minimum wight rather than returning at least 3 executors. Author: zsxwing <zsxwing@gmail.com> Closes #8340 from zsxwing/fix-receiver-scheduling. (cherry picked from commit f023aa2fcc1d1dbb82aee568be0a8f2457c309ae) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	25 August 2015, 06:35:02 UTC
88991dc	cody koeninger	25 August 2015, 06:26:14 UTC	[SPARK-9786] [STREAMING] [KAFKA] fix backpressure so it works with defa… …ult maxRatePerPartition setting of 0 Author: cody koeninger <cody@koeninger.org> Closes #8413 from koeninger/backpressure-testing-master. (cherry picked from commit d9c25dec87e6da7d66a47ff94e7eefa008081b9d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	25 August 2015, 06:26:27 UTC
2239a20	Michael Armbrust	25 August 2015, 06:15:27 UTC	[SPARK-10178] [SQL] HiveComparisionTest should print out dependent tables In `HiveComparisionTest`s it is possible to fail a query of the form `SELECT * FROM dest1`, where `dest1` is the query that is actually computing the incorrect results. To aid debugging this patch improves the harness to also print these query plans and their results. Author: Michael Armbrust <michael@databricks.com> Closes #8388 from marmbrus/generatedTables. (cherry picked from commit 5175ca0c85b10045d12c3fb57b1e52278a413ecf) Signed-off-by: Reynold Xin <rxin@databricks.com>	25 August 2015, 06:15:34 UTC
c99f416	Yin Huai	25 August 2015, 04:49:50 UTC	[SPARK-10121] [SQL] Thrift server always use the latest class loader provided by the conf of executionHive's state https://issues.apache.org/jira/browse/SPARK-10121 Looks like the problem is that if we add a jar through another thread, the thread handling the JDBC session will not get the latest classloader. Author: Yin Huai <yhuai@databricks.com> Closes #8368 from yhuai/SPARK-10121. (cherry picked from commit a0c0aae1defe5e1e57704065631d201f8e3f6bac) Signed-off-by: Cheng Lian <lian@databricks.com>	25 August 2015, 04:50:44 UTC
2f7e4b4	Feynman Liang	25 August 2015, 02:45:41 UTC	[SQL] [MINOR] [DOC] Clarify docs for inferring DataFrame from RDD of Products * Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with `SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is essentially a wrapper for the latter * Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any `RDD[Product]`, not just case classes Author: Feynman Liang <fliang@databricks.com> Closes #8406 from feynmanliang/sql-doc-fixes. (cherry picked from commit 642c43c81c835139e3f35dfd6a215d668a474203) Signed-off-by: Reynold Xin <rxin@databricks.com>	25 August 2015, 02:45:48 UTC
ec5d09c	Yu ISHIKAWA	25 August 2015, 01:17:51 UTC	[SPARK-10118] [SPARKR] [DOCS] Improve SparkR API docs for 1.5 release cc: shivaram ## Summary - Modify `tdname` of expression functions. i.e. `ascii`: `rdname functions` => `rdname ascii` - Replace the dynamical function definitions to the static ones because of thir documentations. ## Generated PDF File https://drive.google.com/file/d/0B9biIZIU47lLX2t6ZjRoRnBTSEU/view?usp=sharing ## JIRA [[SPARK-10118] Improve SparkR API docs for 1.5 release - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10118) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8386 from yu-iskw/SPARK-10118. (cherry picked from commit 6511bf559b736d8e23ae398901c8d78938e66869) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	25 August 2015, 01:17:58 UTC
228e429	Michael Armbrust	25 August 2015, 01:10:51 UTC	[SPARK-10165] [SQL] Await child resolution in ResolveFunctions Currently, we eagerly attempt to resolve functions, even before their children are resolved. However, this is not valid in cases where we need to know the types of the input arguments (i.e. when resolving Hive UDFs). As a fix, this PR delays function resolution until the functions children are resolved. This change also necessitates a change to the way we resolve aggregate expressions that are not in aggregate operators (e.g., in `HAVING` or `ORDER BY` clauses). Specifically, we can't assume that these misplaced functions will be resolved, allowing us to differentiate aggregate functions from normal functions. To compensate for this change we now attempt to resolve these unresolved expressions in the context of the aggregate operator, before checking to see if any aggregate expressions are present. Author: Michael Armbrust <michael@databricks.com> Closes #8371 from marmbrus/hiveUDFResolution. (cherry picked from commit 2bf338c626e9d97ccc033cfadae8b36a82c66fd1) Signed-off-by: Michael Armbrust <michael@databricks.com>	25 August 2015, 01:11:04 UTC
8ca8bdd	Patrick Wendell	25 August 2015, 00:22:09 UTC	HOTFIX: Adding missing 1.4.1 ec2 version	25 August 2015, 00:22:09 UTC
a4bad5f	Josh Rosen	24 August 2015, 23:17:45 UTC	[SPARK-10190] Fix NPE in CatalystTypeConverters Decimal toScala converter This adds a missing null check to the Decimal `toScala` converter in `CatalystTypeConverters`, fixing an NPE. Author: Josh Rosen <joshrosen@databricks.com> Closes #8401 from JoshRosen/SPARK-10190. (cherry picked from commit d7b4c095271c36fcc7f9ded267ecf5ec66fac803) Signed-off-by: Reynold Xin <rxin@databricks.com>	24 August 2015, 23:17:52 UTC
aadb9de	Joseph K. Bradley	24 August 2015, 22:38:54 UTC	[SPARK-10061] [DOC] ML ensemble docs User guide for spark.ml GBTs and Random Forests. The examples are copied from the decision tree guide and modified to run. I caught some issues I had somehow missed in the tree guide as well. I have run all examples, including Java ones. (Of course, I thought I had previously as well...) CC: mengxr manishamde yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8369 from jkbradley/ml-ensemble-docs. (cherry picked from commit 13db11cb08eb90eb0ea3402c9fe0270aa282f247) Signed-off-by: Xiangrui Meng <meng@databricks.com>	24 August 2015, 22:39:01 UTC
9223443	Sean Owen	24 August 2015, 21:35:21 UTC	[SPARK-9758] [TEST] [SQL] Compilation issue for hive test / wrong package? Move `test.org.apache.spark.sql.hive` package tests to apparent intended `org.apache.spark.sql.hive` as they don't intend to test behavior from outside org.apache.spark.* Alternate take, per discussion at https://github.com/apache/spark/pull/8051 I think this is what vanzin and I had in mind but also CC rxin to cross-check, as this does indeed depend on whether these tests were accidentally in this package or not. Testing from a `test.org.apache.spark` package is legitimate but didn't seem to be the intent here. Author: Sean Owen <sowen@cloudera.com> Closes #8307 from srowen/SPARK-9758. (cherry picked from commit cb2d2e15844d7ae34b5dd7028b55e11586ed93fa) Signed-off-by: Sean Owen <sowen@cloudera.com>	24 August 2015, 21:35:31 UTC
d36f351	Cheng Lian	24 August 2015, 21:11:19 UTC	[SPARK-8580] [SQL] Refactors ParquetHiveCompatibilitySuite and adds more test cases This PR refactors `ParquetHiveCompatibilitySuite` so that it's easier to add new test cases. Hit two bugs, SPARK-10177 and HIVE-11625, while working on this, added test cases for them and marked as ignored for now. SPARK-10177 will be addressed in a separate PR. Author: Cheng Lian <lian@databricks.com> Closes #8392 from liancheng/spark-8580/parquet-hive-compat-tests. (cherry picked from commit a2f4cdceba32aaa0df59df335ca0ce1ac73fc6c2) Signed-off-by: Davies Liu <davies.liu@gmail.com>	24 August 2015, 21:11:30 UTC
831f78e	Andrew Or	24 August 2015, 21:10:50 UTC	[SPARK-10144] [UI] Actually show peak execution memory by default The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. The result is that the memory is not displayed by default. Author: Andrew Or <andrew@databricks.com> Closes #8345 from andrewor14/show-memory-default. (cherry picked from commit 662bb9667669cb07cf6d2ccee0d8e76bb561cd89) Signed-off-by: Yin Huai <yhuai@databricks.com>	24 August 2015, 21:11:03 UTC
43dcf95	Burak Yavuz	24 August 2015, 20:48:01 UTC	[SPARK-7710] [SPARK-7998] [DOCS] Docs for DataFrameStatFunctions This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`. rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #8378 from brkyvz/update-sql-docs. (cherry picked from commit 9ce0c7ad333f4a3c01207e5e9ed42bcafb99d894) Signed-off-by: Reynold Xin <rxin@databricks.com>	24 August 2015, 20:48:09 UTC
d003373	Tathagata Das	24 August 2015, 19:40:09 UTC	[SPARK-9791] [PACKAGE] Change private class to private class to prevent unnecessary classes from showing up in the docs In addition, some random cleanup of import ordering Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8387 from tdas/SPARK-9791 and squashes the following commits: 67f3ee9 [Tathagata Das] Change private class to private[package] class to prevent them from showing up in the docs (cherry picked from commit 7478c8b66d6a2b1179f20c38b49e27e37b0caec3) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	24 August 2015, 19:40:23 UTC
36bc50c	zsxwing	24 August 2015, 19:38:01 UTC	[SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build. I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests. Author: zsxwing <zsxwing@gmail.com> Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits: e0b5818 [zsxwing] Fix the sbt build c697627 [zsxwing] Add the jar pathes to the exception message be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars (cherry picked from commit 4e0395ddb764d092b5b38447af49e196e590e0f0) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	24 August 2015, 19:38:10 UTC
b40059d	Tathagata Das	24 August 2015, 02:24:32 UTC	[SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local checkpoint paths and existing SparkContexts The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following: 1. Use the same code path as Java to check whether a valid checkpoint exists 2. Create a new Python SparkContext only if there no active one. There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8366 from tdas/SPARK-10142 and squashes the following commits: 3afa666 [Tathagata Das] Added tests 2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists 9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files (cherry picked from commit 053d94fcf32268369b5a40837271f15d6af41aa4) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	24 August 2015, 02:24:42 UTC
00f812d	Joseph K. Bradley	24 August 2015, 01:34:07 UTC	[SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix. (cherry picked from commit b963c19a803c5a26c9b65655d40ca6621acf8bd4) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	24 August 2015, 01:34:15 UTC
1c5a828	zsxwing	24 August 2015, 00:41:49 UTC	[SPARK-10148] [STREAMING] Display active and inactive receiver numbers in Streaming page Added the active and inactive receiver numbers in the summary section of Streaming page. <img width="1074" alt="screen shot 2015-08-21 at 2 08 54 pm" src="https://cloud.githubusercontent.com/assets/1000778/9402437/ff2806a2-480f-11e5-8f8e-efdf8e5d514d.png"> Author: zsxwing <zsxwing@gmail.com> Closes #8351 from zsxwing/receiver-number. (cherry picked from commit c6df5f66d9a8b9760f2cd46fcd930f977650c9c5) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	24 August 2015, 00:52:46 UTC
595f92f	Keiji Yoshida	23 August 2015, 10:04:29 UTC	Update streaming-programming-guide.md Update `See the Scala example` to `See the Java example`. Author: Keiji Yoshida <yoshida.keiji.84@gmail.com> Closes #8376 from yosssi/patch-1. (cherry picked from commit 623c675fde7a3a39957a62c7af26a54f4b01f8ce) Signed-off-by: Sean Owen <sowen@cloudera.com>	23 August 2015, 10:04:44 UTC
5f03b7a	Keiji Yoshida	22 August 2015, 09:38:10 UTC	Update programming-guide.md Update `lineLengths.persist();` to `lineLengths.persist(StorageLevel.MEMORY_ONLY());` because `JavaRDD#persist` needs a parameter of `StorageLevel`. Author: Keiji Yoshida <yoshida.keiji.84@gmail.com> Closes #8372 from yosssi/patch-1. (cherry picked from commit 46fcb9e0dbb2b28110f68a3d9f6c0c47bfd197b1) Signed-off-by: Reynold Xin <rxin@databricks.com>	22 August 2015, 09:38:18 UTC
fbf9a6e	Xusen Yin	21 August 2015, 23:30:12 UTC	[SPARK-9893] User guide with Java test suite for VectorSlicer Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin <yinxusen@gmail.com> Closes #8267 from yinxusen/SPARK-9893. (cherry picked from commit 630a994e6a9785d1704f8e7fb604f32f5dea24f8) Signed-off-by: Xiangrui Meng <meng@databricks.com>	21 August 2015, 23:30:19 UTC
cb61c7b	Joseph K. Bradley	21 August 2015, 23:28:00 UTC	[SPARK-10163] [ML] Allow single-category features for GBT models Removed categorical feature info validation since no longer needed This is needed to make the ML user guide examples work (in another current PR). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8367 from jkbradley/gbt-single-cat. (cherry picked from commit f01c4220d2b791f470fa6596ffe11baa51517fbe) Signed-off-by: Xiangrui Meng <meng@databricks.com>	21 August 2015, 23:28:07 UTC
914da35	Patrick Wendell	21 August 2015, 21:56:50 UTC	Preparing development version 1.5.0-SNAPSHOT	21 August 2015, 21:56:50 UTC
e256928	Patrick Wendell	21 August 2015, 21:56:43 UTC	Preparing Spark release v1.5.0-rc2	21 August 2015, 21:56:43 UTC
f65759e	Reynold Xin	21 August 2015, 21:54:45 UTC	Version update for Spark 1.5.0 and add CHANGES.txt file. Author: Reynold Xin <rxin@databricks.com> Closes #8365 from rxin/1.5-update.	21 August 2015, 21:54:45 UTC
14c8c0c	Yin Huai	21 August 2015, 21:30:00 UTC	[SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary. https://issues.apache.org/jira/browse/SPARK-10143 With this PR, we will set min split size to parquet's block size (row group size) set in the conf if the min split size is smaller. So, we can avoid have too many tasks and even useless tasks for reading parquet data. I tested it locally. The table I have has 343MB and it is in my local FS. Because I did not set any min/max split size, the default split size was 32MB and the map stage had 11 tasks. But there were only three tasks that actually read data. With my PR, there were only three tasks in the map stage. Here is the difference. Without this PR: ![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png) With this PR: ![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png) Even if the block size setting does match the actual block size of parquet file, I think it is still generally good to use parquet's block size setting if min split size is smaller than this block size. Tested it on a cluster using ``` val count = sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count ``` Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this patch, the job had 2893 tasks and spent 64s. It is still not as good as using one mapper per file (1824 tasks and 42s), but it is much better than our master. Author: Yin Huai <yhuai@databricks.com> Closes #8346 from yhuai/parquetMinSplit. (cherry picked from commit e3355090d4030daffed5efb0959bf1d724c13c13) Signed-off-by: Yin Huai <yhuai@databricks.com>	21 August 2015, 21:30:12 UTC
e7db876	MechCoder	21 August 2015, 21:19:24 UTC	[SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8352 from MechCoder/since. (cherry picked from commit f5b028ed2f1ad6de43c8b50ebf480e1b6c047035) Signed-off-by: Xiangrui Meng <meng@databricks.com>	21 August 2015, 21:19:33 UTC
4e72839	jerryshao	21 August 2015, 20:10:11 UTC	[SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function Details of the bug and explanations can be seen in [SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122). tdas , please help to review. Author: jerryshao <sshao@hortonworks.com> Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits: 4039b16 [jerryshao] Fix getOffsetRanges in transform() bug	21 August 2015, 20:17:48 UTC
817c38a	Daoyuan Wang	21 August 2015, 19:21:51 UTC	[SPARK-10130] [SQL] type coercion for IF should have children resolved first Type coercion for IF should have children resolved first, or we could meet unresolved exception. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #8331 from adrian-wang/spark10130. (cherry picked from commit 3c462f5d87a9654c5a68fd658a40f5062029fd9a) Signed-off-by: Michael Armbrust <michael@databricks.com>	21 August 2015, 19:22:08 UTC
e5e6017	Alexander Ulanov	21 August 2015, 03:02:27 UTC	[SPARK-9846] [DOCS] User guide for Multilayer Perceptron Classifier Added user guide for multilayer perceptron classifier: - Simplified description of the multilayer perceptron classifier - Example code for Scala and Java Author: Alexander Ulanov <nashb@yandex.ru> Closes #8262 from avulanov/SPARK-9846-mlpc-docs. (cherry picked from commit dcfe0c5cde953b31c5bfeb6e41d1fc9b333241eb) Signed-off-by: Xiangrui Meng <meng@databricks.com>	21 August 2015, 03:02:34 UTC
04ef52a	Xiangrui Meng	21 August 2015, 03:01:13 UTC	[SPARK-10140] [DOC] add target fields to @Since so constructors parameters and public fields can be annotated. rxin MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #8344 from mengxr/SPARK-10140.2. (cherry picked from commit cdd9a2bb10e20556003843a0f7aaa33acd55f6d2) Signed-off-by: Xiangrui Meng <meng@databricks.com>	21 August 2015, 03:01:27 UTC
988e838	Patrick Wendell	20 August 2015, 23:24:12 UTC	Preparing development version 1.5.1-SNAPSHOT	20 August 2015, 23:24:12 UTC
4c56ad7	Patrick Wendell	20 August 2015, 23:24:07 UTC	Preparing Spark release v1.5.0-rc1	20 August 2015, 23:24:07 UTC
175c1d9	Patrick Wendell	20 August 2015, 22:33:10 UTC	Preparing development version 1.5.0-SNAPSHOT	20 August 2015, 22:33:10 UTC
d837d51	Patrick Wendell	20 August 2015, 22:33:04 UTC	Preparing Spark release v1.5.0-rc1	20 August 2015, 22:33:04 UTC
2beea65	Joseph K. Bradley	20 August 2015, 22:01:31 UTC	[SPARK-9245] [MLLIB] LDA topic assignments For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token. CC: rotationsymmetry mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8329 from jkbradley/lda-topic-assignments. (cherry picked from commit eaafe139f881d6105996373c9b11f2ccd91b5b3e) Signed-off-by: Xiangrui Meng <meng@databricks.com>	20 August 2015, 22:01:37 UTC
560ec12	MechCoder	20 August 2015, 21:56:08 UTC	[SPARK-10108] Add since tags to mllib.feature Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8309 from MechCoder/tags_feature. (cherry picked from commit 7cfc0750e14f2c1b3847e4720cc02150253525a9) Signed-off-by: Xiangrui Meng <meng@databricks.com>	20 August 2015, 21:59:55 UTC
2e0d2a9	Xiangrui Meng	20 August 2015, 21:47:04 UTC	[SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add Java test suite Otherwise, setters do not return self type. jkbradley avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8342 from mengxr/SPARK-10138. (cherry picked from commit 2a3d98aae285aba39786e9809f96de412a130f39) Signed-off-by: Xiangrui Meng <meng@databricks.com>	20 August 2015, 21:47:11 UTC
eac31ab	Patrick Wendell	20 August 2015, 19:43:13 UTC	Preparing development version 1.5.0-SNAPSHOT	20 August 2015, 19:43:13 UTC
99eeac8	Patrick Wendell	20 August 2015, 19:43:08 UTC	Preparing Spark release v1.5.0-rc1	20 August 2015, 19:43:08 UTC
6026f4f	Josh Rosen	20 August 2015, 18:31:03 UTC	[SPARK-10126] [PROJECT INFRA] Fix typo in release-build.sh which broke snapshot publishing for Scala 2.11 The current `release-build.sh` has a typo which breaks snapshot publication for Scala 2.11. We should change the Scala version to 2.11 and clean before building a 2.11 snapshot. Author: Josh Rosen <joshrosen@databricks.com> Closes #8325 from JoshRosen/fix-2.11-snapshots. (cherry picked from commit 12de348332108f8c0c5bdad1d4cfac89b952b0f8) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	20 August 2015, 18:31:21 UTC
a1785e3	Patrick Wendell	20 August 2015, 18:06:41 UTC	Preparing development version 1.5.0-SNAPSHOT	20 August 2015, 18:06:41 UTC
19b92c8	Patrick Wendell	20 August 2015, 18:06:31 UTC	Preparing Spark release v1.5.0-rc1	20 August 2015, 18:06:31 UTC
2f47e09	Cheng Lian	20 August 2015, 18:00:24 UTC	[SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array I caught SPARK-10136 while adding more test cases to `ParquetAvroCompatibilitySuite`. Actual bug fix code lies in `CatalystRowConverter.scala`. Author: Cheng Lian <lian@databricks.com> Closes #8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array. (cherry picked from commit 85f9a61357994da5023b08b0a8a2eb09388ce7f8) Signed-off-by: Michael Armbrust <michael@databricks.com>	20 August 2015, 18:02:02 UTC
a7027e6	Alex Shkurenko	20 August 2015, 17:16:38 UTC	[SPARK-9982] [SPARKR] SparkR DataFrame fail to return data of Decimal type Author: Alex Shkurenko <ashkurenko@enova.com> Closes #8239 from ashkurenko/master. (cherry picked from commit 39e91fe2fd43044cc734d55625a3c03284b69f09) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	20 August 2015, 17:16:57 UTC
257e9d7	MechCoder	20 August 2015, 17:05:31 UTC	[MINOR] [SQL] Fix sphinx warnings in PySpark SQL Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8171 from MechCoder/sql_sphinx. (cherry picked from commit 52c60537a274af5414f6b0340a4bd7488ef35280) Signed-off-by: Xiangrui Meng <meng@databricks.com>	20 August 2015, 17:05:39 UTC
5be5175	Reynold Xin	20 August 2015, 14:53:27 UTC	[SPARK-10100] [SQL] Eliminate hash table lookup if there is no grouping key in aggregation. This improves performance by ~ 20 - 30% in one of my local test and should fix the performance regression from 1.4 to 1.5 on ss_max. Author: Reynold Xin <rxin@databricks.com> Closes #8332 from rxin/SPARK-10100. (cherry picked from commit b4f4e91c395cb69ced61d9ff1492d1b814f96828) Signed-off-by: Yin Huai <yhuai@databricks.com>	20 August 2015, 14:53:40 UTC
675e224	Yin Huai	20 August 2015, 10:43:24 UTC	[SPARK-10092] [SQL] Backports #8324 to branch-1.5 Author: Yin Huai <yhuai@databricks.com> Closes #8336 from liancheng/spark-10092/for-branch-1.5.	20 August 2015, 10:43:24 UTC
71aa547	Tathagata Das	20 August 2015, 04:15:58 UTC	[SPARK-10128] [STREAMING] Used correct classloader to deserialize WAL data Recovering Kinesis sequence numbers from WAL leads to classnotfoundexception because the ObjectInputStream does not use the correct classloader and the SequenceNumberRanges class (in streaming-kinesis-asl package) cannot be found (added through spark-submit) while deserializing. The solution is to use `Thread.currentThread().getContextClassLoader` while deserializing. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8328 from tdas/SPARK-10128 and squashes the following commits: f19b1c2 [Tathagata Das] Used correct classloader to deserialize WAL data (cherry picked from commit b762f9920f7587d3c08493c49dd2fede62110b88) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	20 August 2015, 04:16:17 UTC
63922fa	zsxwing	20 August 2015, 02:43:09 UTC	[SPARK-10125] [STREAMING] Fix a potential deadlock in JobGenerator.stop Because `lazy val` uses `this` lock, if JobGenerator.stop and JobGenerator.doCheckpoint (JobGenerator.shouldCheckpoint has not yet been initialized) run at the same time, it may hang. Here are the stack traces for the deadlock: ```Java "pool-1-thread-1-ScalaTest-running-StreamingListenerSuite" #11 prio=5 os_prio=31 tid=0x00007fd35d094800 nid=0x5703 in Object.wait() [0x000000012ecaf000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1245) - locked <0x00000007b5d8d7f8> (a org.apache.spark.util.EventLoop$$anon$1) at java.lang.Thread.join(Thread.java:1319) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:81) at org.apache.spark.streaming.scheduler.JobGenerator.stop(JobGenerator.scala:155) - locked <0x00000007b5d8cea0> (a org.apache.spark.streaming.scheduler.JobGenerator) at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:95) - locked <0x00000007b5d8ced8> (a org.apache.spark.streaming.scheduler.JobScheduler) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:687) "JobGenerator" #67 daemon prio=5 os_prio=31 tid=0x00007fd35c3b9800 nid=0x9f03 waiting for monitor entry [0x0000000139e4a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.streaming.scheduler.JobGenerator.shouldCheckpoint$lzycompute(JobGenerator.scala:63) - waiting to lock <0x00000007b5d8cea0> (a org.apache.spark.streaming.scheduler.JobGenerator) at org.apache.spark.streaming.scheduler.JobGenerator.shouldCheckpoint(JobGenerator.scala:63) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:290) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:182) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83) at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ``` I can use this patch to produce this deadlock: https://github.com/zsxwing/spark/commit/8a88f28d1331003a65fabef48ae3d22a7c21f05f And a timeout build in Jenkins due to this deadlock: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1654/ This PR initializes `checkpointWriter` before `eventLoop` uses it to avoid this deadlock. Author: zsxwing <zsxwing@gmail.com> Closes #8326 from zsxwing/SPARK-10125.	20 August 2015, 02:44:33 UTC
a3ed2c3	Timothy Chen	20 August 2015, 02:43:26 UTC	[SPARK-10124] [MESOS] Fix removing queued driver in mesos cluster mode. Currently the spark applications can be queued to the Mesos cluster dispatcher, but when multiple jobs are in queue we don't handle removing jobs from the buffer correctly while iterating and causes null pointer exception. This patch copies the buffer before iterating them, so exceptions aren't thrown when the jobs are removed. Author: Timothy Chen <tnachen@gmail.com> Closes #8322 from tnachen/fix_cluster_mode. (cherry picked from commit 73431d8afb41b93888d2642a1ce2d011f03fb740) Signed-off-by: Andrew Or <andrew@databricks.com>	20 August 2015, 02:43:34 UTC
16414da	zsxwing	20 August 2015, 01:36:01 UTC	[SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark Streaming and some docs This PR includes the following fixes: 1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3. 2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3. 3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually. Author: zsxwing <zsxwing@gmail.com> Closes #8315 from zsxwing/SPARK-9812. (cherry picked from commit 1f29d502e7ecd6faa185d70dc714f9ea3922fb6d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	20 August 2015, 01:36:10 UTC
321cb99	Reynold Xin	20 August 2015, 00:35:41 UTC	[SPARK-9242] [SQL] Audit UDAF interface. A few minor changes: 1. Improved documentation 2. Rename apply(distinct....) to distinct. 3. Changed MutableAggregationBuffer from a trait to an abstract class. 4. Renamed returnDataType to dataType to be more consistent with other expressions. And unrelated to UDAFs: 1. Renamed file names in expressions to use suffix "Expressions" to be more consistent. 2. Moved regexp related expressions out to its own file. 3. Renamed StringComparison => StringPredicate. Author: Reynold Xin <rxin@databricks.com> Closes #8321 from rxin/SPARK-9242. (cherry picked from commit 2f2686a73f5a2a53ca5b1023e0d7e0e6c9be5896) Signed-off-by: Reynold Xin <rxin@databricks.com>	20 August 2015, 00:35:48 UTC
56a37b0	Eric Liang	19 August 2015, 22:43:08 UTC	[SPARK-9895] User Guide for RFormula Feature Transformer mengxr Author: Eric Liang <ekl@databricks.com> Closes #8293 from ericl/docs-2. (cherry picked from commit 8e0a072f78b4902d5f7ccc6b15232ed202a117f9) Signed-off-by: Xiangrui Meng <meng@databricks.com>	19 August 2015, 22:43:15 UTC
5c749c8	Wenchen Fan	19 August 2015, 22:04:56 UTC	[SPARK-6489] [SQL] add column pruning for Generate This PR takes over https://github.com/apache/spark/pull/5358 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8268 from cloud-fan/6489. (cherry picked from commit b0dbaec4f942a47afde3490b9339ad3bd187024d) Signed-off-by: Michael Armbrust <michael@databricks.com>	19 August 2015, 22:05:25 UTC
a59475f	Marcelo Vanzin	19 August 2015, 21:33:32 UTC	[SPARK-10119] [CORE] Fix isDynamicAllocationEnabled when config is expliticly disabled. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8316 from vanzin/SPARK-10119. (cherry picked from commit e0dd1309ac248375f429639801923570f14de18d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	19 August 2015, 21:33:45 UTC
1494d58	Daoyuan Wang	19 August 2015, 21:31:51 UTC	[SPARK-10083] [SQL] CaseWhen should support type coercion of DecimalType and FractionalType create t1 (a decimal(7, 2), b long); select case when 1=1 then a else 1.0 end from t1; select case when 1=1 then a else b end from t1; Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #8270 from adrian-wang/casewhenfractional. (cherry picked from commit 373a376c04320aab228b5c385e2b788809877d3e) Signed-off-by: Michael Armbrust <michael@databricks.com>	19 August 2015, 21:32:43 UTC
b32a31d	Cheng Lian	19 August 2015, 21:15:28 UTC	[SPARK-9899] [SQL] Disables customized output committer when speculation is on Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss. Please see this [PR comment] [1] for more details. [1]: https://github.com/apache/spark/pull/8191#issuecomment-131598385 Author: Cheng Lian <lian@databricks.com> Closes #8317 from liancheng/spark-9899/speculation-hates-direct-output-committer. (cherry picked from commit f3ff4c41d2e32bd0f2419d1c9c68fcd0c2593e41) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala	19 August 2015, 21:26:11 UTC
d9dfd43	Davies Liu	19 August 2015, 21:03:47 UTC	[SPARK-10090] [SQL] fix decimal scale of division We should rounding the result of multiply/division of decimal to expected precision/scale, also check overflow. Author: Davies Liu <davies@databricks.com> Closes #8287 from davies/decimal_division. (cherry picked from commit 1f4c4fe6dfd8cc52b5fddfd67a31a77edbb1a036) Signed-off-by: Michael Armbrust <michael@databricks.com>	19 August 2015, 21:04:09 UTC
77269fc	Cheng Lian	19 August 2015, 20:57:52 UTC	[SPARK-9627] [SQL] Stops using Scala runtime reflection in DictionaryEncoding `DictionaryEncoding` uses Scala runtime reflection to avoid boxing costs while building the directory array. However, this code path may hit [SI-6240] [1] and throw exception. [1]: https://issues.scala-lang.org/browse/SI-6240 Author: Cheng Lian <lian@databricks.com> Closes #8306 from liancheng/spark-9627/in-memory-cache-scala-reflection. (cherry picked from commit 21bdbe9fe69be47be562de24216a469e5ee64c7b) Signed-off-by: Michael Armbrust <michael@databricks.com>	19 August 2015, 20:58:03 UTC
afaed7e	Davies Liu	19 August 2015, 20:56:40 UTC	[SPARK-10073] [SQL] Python withColumn should replace the old column DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column that has the same name). cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8300 from davies/with_column. (cherry picked from commit 08887369c890e0dd87eb8b34e8c32bb03307bf24) Signed-off-by: Michael Armbrust <michael@databricks.com>	19 August 2015, 20:56:54 UTC
829c33a	Yin Huai	19 August 2015, 20:43:46 UTC	[SPARK-10087] [CORE] [BRANCH-1.5] Disable spark.shuffle.reduceLocality.enabled by default. https://issues.apache.org/jira/browse/SPARK-10087 In some cases, when spark.shuffle.reduceLocality.enabled is enabled, we are scheduling all reducers to the same executor (the cluster has plenty of resources). Changing spark.shuffle.reduceLocality.enabled to false resolve the problem. Comments of https://github.com/apache/spark/pull/8280 provide more details of the symptom of this issue. This PR changes the default setting of `spark.shuffle.reduceLocality.enabled` to `false` for branch 1.5. Author: Yin Huai <yhuai@databricks.com> Closes #8296 from yhuai/setNumPartitionsCorrectly-branch1.5.	19 August 2015, 20:43:46 UTC
1038f67	Davies Liu	19 August 2015, 20:43:04 UTC	[SPARK-10107] [SQL] fix NPE in format_number Author: Davies Liu <davies@databricks.com> Closes #8305 from davies/format_number. (cherry picked from commit e05da5cb5ea253e6372f648fc8203204f2a8df8d) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 August 2015, 20:43:20 UTC
8c0a5a2	Xiangrui Meng	19 August 2015, 20:17:26 UTC	[SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder Closes #8256 Author: Xiangrui Meng <meng@databricks.com> Author: Xiaoqing Wang <spark445@126.com> Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8288 from mengxr/SPARK-8918. (cherry picked from commit 5b62bef8cbf73f910513ef3b1f557aa94b384854) Signed-off-by: Xiangrui Meng <meng@databricks.com>	19 August 2015, 20:17:34 UTC
ba36925	Yu ISHIKAWA	19 August 2015, 19:39:37 UTC	[SPARK-10106] [SPARKR] Add `ifelse` Column function to SparkR ### JIRA [[SPARK-10106] Add `ifelse` Column function to SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10106) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8303 from yu-iskw/SPARK-10106. (cherry picked from commit d898c33f774b9a3db2fb6aa8f0cb2c2ac6004b58) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	19 August 2015, 19:39:44 UTC
f25c324	Feynman Liang	19 August 2015, 18:35:05 UTC	[SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator` Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097. (cherry picked from commit 28a98464ea65aa7b35e24fca5ddaa60c2e5d53ee) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	19 August 2015, 18:35:17 UTC
a8e8808	Yu ISHIKAWA	19 August 2015, 17:41:14 UTC	[SPARK-9856] [SPARKR] Add expression functions into SparkR whose params are complicated I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type. ### JIRA [[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8264 from yu-iskw/SPARK-9856-3. (cherry picked from commit 2fcb9cb9552dac1d78dcca5d4d5032b4fa6c985c) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	19 August 2015, 17:41:22 UTC
bebe63d	Yanbo Liang	19 August 2015, 15:53:34 UTC	[SPARK-10084] [MLLIB] [DOC] Add Python example for mllib FP-growth user guide 1, Add Python example for mllib FP-growth user guide. 2, Correct mistakes of Scala and Java examples. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8279 from yanboliang/spark-10084. (cherry picked from commit 802b5b8791fc2c892810981b2479a04175aa3dcd) Signed-off-by: Xiangrui Meng <meng@databricks.com>	19 August 2015, 15:53:42 UTC
f8dc427	Joseph K. Bradley	19 August 2015, 14:38:27 UTC	[SPARK-10060] [ML] [DOC] spark.ml DecisionTree user guide New user guide section ml-decision-tree.md, including code examples. I have run all examples, including the Java ones. CC: manishamde yanboliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8244 from jkbradley/ml-dt-docs. (cherry picked from commit 39e4ebd521defdb68a0787bcd3bde6bc855f5198) Signed-off-by: Xiangrui Meng <meng@databricks.com>	19 August 2015, 14:38:39 UTC
522b0b6	Han JU	19 August 2015, 12:04:16 UTC	[SPARK-8949] Print warnings when using preferred locations feature Add warnings according to SPARK-8949 in `SparkContext` - warnings in scaladoc - log warnings when preferred locations feature is used through `SparkContext`'s constructor However I didn't found any documentation reference of this feature. Please direct me if you know any reference to this feature. Author: Han JU <ju.han.felix@gmail.com> Closes #7874 from darkjh/SPARK-8949. (cherry picked from commit 3d16a545007922ee6fa36e5f5c3959406cb46484) Signed-off-by: Sean Owen <sowen@cloudera.com>	19 August 2015, 12:04:24 UTC
5553f02	lewuathe	19 August 2015, 08:54:03 UTC	[SPARK-9977] [DOCS] Update documentation for StringIndexer By using `StringIndexer`, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label. I think it is better to make it explicit on documentation. Author: lewuathe <lewuathe@me.com> Closes #8205 from Lewuathe/SPARK-9977. (cherry picked from commit ba2a07e2b6c5a39597b64041cd5bf342ef9631f5) Signed-off-by: Sean Owen <sowen@cloudera.com>	19 August 2015, 08:54:11 UTC
e56bcc6	Moussa Taifi	19 August 2015, 08:42:41 UTC	[DOCS] [SQL] [PYSPARK] Fix typo in ntile function Fix typo in ntile function. Author: Moussa Taifi <moutai10@gmail.com> Closes #8261 from moutai/patch-2. (cherry picked from commit 865a3df3d578c0442c97d749c81f554b560da406) Signed-off-by: Sean Owen <sowen@cloudera.com>	19 August 2015, 08:42:50 UTC
561390d	Sean Owen	19 August 2015, 08:41:09 UTC	[SPARK-10070] [DOCS] Remove Guava dependencies in user guides `Lists.newArrayList` -> `Arrays.asList` CC jkbradley feynmanliang Anybody into replacing usages of `Lists.newArrayList` in the examples / source code too? this method isn't useful in Java 7 and beyond. Author: Sean Owen <sowen@cloudera.com> Closes #8272 from srowen/SPARK-10070. (cherry picked from commit f141efeafb42b14b5fcfd9aa8c5275162042349f) Signed-off-by: Sean Owen <sowen@cloudera.com>	19 August 2015, 08:41:19 UTC
417852f	Bill Chambers	19 August 2015, 07:05:01 UTC	Fix Broken Link Link was broken because it included tick marks. Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #8302 from anabranch/patch-1. (cherry picked from commit b23c4d3ffc36e47c057360c611d8ab1a13877699) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 August 2015, 07:05:12 UTC
392bd19	Tathagata Das	19 August 2015, 06:37:57 UTC	[SPARK-9967] [SPARK-10099] [STREAMING] Renamed conf spark.streaming.backpressure.{enable-->enabled} and fixed deprecated annotations Small changes - Renamed conf spark.streaming.backpressure.{enable --> enabled} - Change Java Deprecated annotations to Scala deprecated annotation with more information. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8299 from tdas/SPARK-9967. (cherry picked from commit bc9a0e03235865d2ec33372f6400dec8c770778a) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	19 August 2015, 06:38:13 UTC
3ceee55	Josh Rosen	19 August 2015, 05:30:13 UTC	[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf. (cherry picked from commit 010b03ed52f35fd4d426d522f8a9927ddc579209) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 August 2015, 05:30:20 UTC
4163926	Alexander Ulanov	19 August 2015, 05:13:52 UTC	[SPARK-9508] GraphX Pregel docs update with new Pregel code SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code Author: Alexander Ulanov <nashb@yandex.ru> Closes #7831 from avulanov/SPARK-9508-pregel-doc2. (cherry picked from commit 1c843e284818004f16c3f1101c33b510f80722e3) Signed-off-by: Reynold Xin <rxin@databricks.com>	19 August 2015, 05:13:57 UTC

Newer
Older