Revision history - refs/tags/v1.5.0-rc3 - origin: https://github.com/apache/spark

visit type:

Revision	Author	Date	Message	Commit Date
908e37b	Patrick Wendell	31 August 2015, 22:57:42 UTC	Preparing Spark release v1.5.0-rc3	31 August 2015, 22:57:42 UTC
1c752b8	Davies Liu	31 August 2015, 22:55:22 UTC	[SPARK-10341] [SQL] fix memory starving in unsafe SMJ In SMJ, the first ExternalSorter could consume all the memory before spilling, then the second can not even acquire the first page. Before we have a better memory allocator, SMJ should call prepare() before call any compute() of it's children. cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8511 from davies/smj_memory. (cherry picked from commit 540bdee93103a73736d282b95db6a8cda8f6a2b1) Signed-off-by: Reynold Xin <rxin@databricks.com>	31 August 2015, 22:55:29 UTC
33ce274	zsxwing	31 August 2015, 19:19:11 UTC	[SPARK-10369] [STREAMING] Don't remove ReceiverTrackingInfo when deregisterReceivering since we may reuse it later `deregisterReceiver` should not remove `ReceiverTrackingInfo`. Otherwise, it will throw `java.util.NoSuchElementException: key not found` when restarting it. Author: zsxwing <zsxwing@gmail.com> Closes #8538 from zsxwing/SPARK-10369. (cherry picked from commit 4a5fe091658b1d06f427e404a11a84fc84f953c5) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	31 August 2015, 19:19:48 UTC
bf5b2f2	Xiangrui Meng	31 August 2015, 06:20:03 UTC	[SPARK-100354] [MLLIB] fix some apparent memory issues in k-means\|\| initializaiton * do not cache first cost RDD * change following cost RDD cache level to MEMORY_AND_DISK * remove Vector wrapper to save a object per instance Further improvements will be addressed in SPARK-10329 cc: yu-iskw HuJiayin Author: Xiangrui Meng <meng@databricks.com> Closes #8526 from mengxr/SPARK-10354. (cherry picked from commit f0f563a3c43fc9683e6920890cce44611c0c5f4b) Signed-off-by: Xiangrui Meng <meng@databricks.com>	31 August 2015, 06:20:14 UTC
42a81a6	Burak Yavuz	30 August 2015, 19:21:15 UTC	[SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset of matrix multiplications mengxr jkbradley rxin It would be great if this fix made it into RC3! Author: Burak Yavuz <brkyvz@gmail.com> Closes #8525 from brkyvz/blas-scaling. (cherry picked from commit 8d2ab75d3b71b632f2394f2453af32f417cb45e5) Signed-off-by: Xiangrui Meng <meng@databricks.com>	30 August 2015, 19:21:22 UTC
1d40136	Xiangrui Meng	30 August 2015, 06:57:09 UTC	[SPARK-10331] [MLLIB] Update example code in ml-guide * The example code was added in 1.2, before `createDataFrame`. This PR switches to `createDataFrame`. Java code still uses JavaBean. * assume `sqlContext` is available * fix some minor issues from previous code review jkbradley srowen feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8518 from mengxr/SPARK-10331. (cherry picked from commit ca69fc8efda8a3e5442ffa16692a2b1eb86b7673) Signed-off-by: Xiangrui Meng <meng@databricks.com>	30 August 2015, 06:57:17 UTC
8071f6e	Xiangrui Meng	30 August 2015, 06:26:23 UTC	[SPARK-10348] [MLLIB] updates ml-guide * replace `ML Dataset` by `DataFrame` to unify the abstraction * ML algorithms -> pipeline components to describe the main concept * remove Scala API doc links from the main guide * `Section Title` -> `Section tile` to be consistent with other section titles in MLlib guide * modified lines break at 100 chars or periods jkbradley feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8517 from mengxr/SPARK-10348. (cherry picked from commit 905fbe498bdd29116468628e6a2a553c1fd57165) Signed-off-by: Xiangrui Meng <meng@databricks.com>	30 August 2015, 06:26:32 UTC
3a61e10	Yin Huai	29 August 2015, 23:39:40 UTC	[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL] Partitioned table scan can OOM driver and throw a better error message when users need to enable parquet schema merging This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables. https://issues.apache.org/jira/browse/SPARK-10339 https://issues.apache.org/jira/browse/SPARK-10334 Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do. Author: Yin Huai <yhuai@databricks.com> Closes #8515 from yhuai/partitionedTableScan. (cherry picked from commit 097a7e36e0bf7290b1879331375bacc905583bd3) Signed-off-by: Michael Armbrust <michael@databricks.com>	29 August 2015, 23:39:58 UTC
d178e1e	Josh Rosen	29 August 2015, 20:36:25 UTC	[SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in more places SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places. Author: Josh Rosen <joshrosen@databricks.com> Closes #8499 from JoshRosen/use-hadoop-reflection-in-more-places. (cherry picked from commit 6a6f3c91ee1f63dd464eb03d156d02c1a5887d88) Signed-off-by: Michael Armbrust <michael@databricks.com>	29 August 2015, 20:37:46 UTC
7c65078	wangwei	29 August 2015, 20:29:50 UTC	[SPARK-10226] [SQL] Fix exclamation mark issue in SparkSQL When I tested the latest version of spark with exclamation mark, I got some errors. Then I reseted the spark version and found that commit id "a2409d1c8e8ddec04b529ac6f6a12b5993f0eeda" brought the bug. With jline version changing from 0.9.94 to 2.12 after this commit, exclamation mark would be treated as a special character in ConsoleReader. Author: wangwei <wangwei82@huawei.com> Closes #8420 from small-wang/jline-SPARK-10226. (cherry picked from commit 277148b285748e863f2b9fdf6cf12963977f91ca) Signed-off-by: Michael Armbrust <michael@databricks.com>	29 August 2015, 20:30:01 UTC
a49ad67	Michael Armbrust	29 August 2015, 20:26:01 UTC	[SPARK-10344] [SQL] Add tests for extraStrategies Actually using this API requires access to a lot of classes that we might make private by accident. I've added some tests to prevent this. Author: Michael Armbrust <michael@databricks.com> Closes #8516 from marmbrus/extraStrategiesTests. (cherry picked from commit 5c3d16a9b91bb9a458d3ba141f7bef525cf3d285) Signed-off-by: Yin Huai <yhuai@databricks.com>	29 August 2015, 20:26:12 UTC
d17316f	GuoQiang Li	29 August 2015, 20:20:22 UTC	[SPARK-10350] [DOC] [SQL] Removed duplicated option description from SQL guide Author: GuoQiang Li <witgo@qq.com> Closes #8520 from witgo/SPARK-10350. (cherry picked from commit 5369be806848f43cb87c76504258c4e7de930c90) Signed-off-by: Michael Armbrust <michael@databricks.com>	29 August 2015, 20:21:04 UTC
69d8565	martinzapletal	29 August 2015, 04:03:48 UTC	[SPARK-9910] [ML] User guide for train validation split Author: martinzapletal <zapletal-martin@email.cz> Closes #8377 from zapletal-martin/SPARK-9910. (cherry picked from commit e8ea5bafee9ca734edf62021145d0c2d5491cba8) Signed-off-by: Xiangrui Meng <meng@databricks.com>	29 August 2015, 04:03:59 UTC
b7aab1d	felixcheung	29 August 2015, 01:35:01 UTC	[SPARK-9803] [SPARKR] Add subset and transform + tests Add subset and transform Also reorganize `[` & `[[` to subset instead of select Note: for transform, transform is very similar to mutate. Spark doesn't seem to replace existing column with the name in mutate (ie. `mutate(df, age = df$age + 2)` - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform. Though it is clearly stated it should replace column with matching name (should I open a JIRA for mutate/transform?) Author: felixcheung <felixcheung_m@hotmail.com> Closes #8503 from felixcheung/rsubset_transform. (cherry picked from commit 2a4e00ca4d4e7a148b4ff8ce0ad1c6d517cee55f) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	29 August 2015, 01:35:13 UTC
df4a2e6	Marcelo Vanzin	28 August 2015, 22:57:27 UTC	[SPARK-10326] [YARN] Fix app submission on windows. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8493 from vanzin/SPARK-10326.	28 August 2015, 22:57:27 UTC
02e10d2	Davies Liu	28 August 2015, 21:38:20 UTC	[SPARK-10323] [SQL] fix nullability of In/InSet/ArrayContain After this PR, In/InSet/ArrayContain will return null if value is null, instead of false. They also will return null even if there is a null in the set/array. Author: Davies Liu <davies@databricks.com> Closes #8492 from davies/fix_in. (cherry picked from commit bb7f35239385ec74b5ee69631b5480fbcee253e4) Signed-off-by: Davies Liu <davies.liu@gmail.com>	28 August 2015, 21:38:29 UTC
7f01480	Xiangrui Meng	28 August 2015, 20:53:31 UTC	[SPARK-9671] [MLLIB] re-org user guide and add migration guide This PR updates the MLlib user guide and adds migration guide for 1.4->1.5. * merge migration guide for `spark.mllib` and `spark.ml` packages * remove dependency section from `spark.ml` guide * move the paragraph about `spark.mllib` and `spark.ml` to the top and recommend `spark.ml` * move Sam's talk to footnote to make the section focus on dependencies Minor changes to code examples and other wording will be in a separate PR. jkbradley srowen feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8498 from mengxr/SPARK-9671. (cherry picked from commit 88032ecaf0455886aed7a66b30af80dae7f6cff7) Signed-off-by: Xiangrui Meng <meng@databricks.com>	28 August 2015, 20:53:38 UTC
9c58f64	Shuo Xiang	28 August 2015, 20:09:13 UTC	[SPARK-10336][example] fix not being able to set intercept in LR example `fitIntercept` is a command line option but not set in the main program. dbtsai Author: Shuo Xiang <sxiang@pinterest.com> Closes #8510 from coderxiang/intercept and squashes the following commits: 57c9b7d [Shuo Xiang] fix not being able to set intercept in LR example (cherry picked from commit 45723214e694b9a440723e9504c562e6393709f3) Signed-off-by: DB Tsai <dbt@netflix.com>	28 August 2015, 20:09:24 UTC
ccda27a	Josh Rosen	28 August 2015, 18:51:42 UTC	[SPARK-10325] Override hashCode() for public Row This commit fixes an issue where the public SQL `Row` class did not override `hashCode`, causing it to violate the hashCode() + equals() contract. To fix this, I simply ported the `hashCode` implementation from the 1.4.x version of `Row`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8500 from JoshRosen/SPARK-10325 and squashes the following commits: 51ffea1 [Josh Rosen] Override hashCode() for public Row. (cherry picked from commit d3f87dc39480f075170817bbd00142967a938078) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala	28 August 2015, 19:05:37 UTC
0abbc18	Luciano Resende	28 August 2015, 16:13:21 UTC	[SPARK-8952] [SPARKR] - Wrap normalizePath calls with suppressWarnings This is based on davies comment on SPARK-8952 which suggests to only call normalizePath() when path starts with '~' Author: Luciano Resende <lresende@apache.org> Closes #8343 from lresende/SPARK-8952. (cherry picked from commit 499e8e154bdcc9d7b2f685b159e0ddb4eae48fe4) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	28 August 2015, 16:13:54 UTC
0cd49ba	Yuhao Yang	28 August 2015, 15:00:44 UTC	[SPARK-9890] [DOC] [ML] User guide for CountVectorizer jira: https://issues.apache.org/jira/browse/SPARK-9890 document with Scala and java examples Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8487 from hhbyyh/cvDoc. (cherry picked from commit e2a843090cb031f6aa774f6d9c031a7f0f732ee1) Signed-off-by: Xiangrui Meng <meng@databricks.com>	28 August 2015, 15:01:00 UTC
3ccd2e6	Dharmesh Kakadia	28 August 2015, 08:38:35 UTC	typo in comment Author: Dharmesh Kakadia <dharmeshkakadia@users.noreply.github.com> Closes #8497 from dharmeshkakadia/patch-2. (cherry picked from commit 71a077f6c16c8816eae13341f645ba50d997f63d) Signed-off-by: Sean Owen <sowen@cloudera.com>	28 August 2015, 08:38:45 UTC
e23ffd6	Keiji Yoshida	28 August 2015, 08:36:50 UTC	Fix DynamodDB/DynamoDB typo in Kinesis Integration doc Fix DynamodDB/DynamoDB typo in Kinesis Integration doc Author: Keiji Yoshida <yoshida.keiji.84@gmail.com> Closes #8501 from yosssi/patch-1. (cherry picked from commit 18294cd8710427076caa86bfac596de67089d57e) Signed-off-by: Sean Owen <sowen@cloudera.com>	28 August 2015, 08:37:01 UTC
8eff069	Sean Owen	28 August 2015, 08:32:23 UTC	[SPARK-10295] [CORE] Dynamic allocation in Mesos does not release when RDDs are cached Remove obsolete warning about dynamic allocation not working with cached RDDs See discussion in https://issues.apache.org/jira/browse/SPARK-10295 Author: Sean Owen <sowen@cloudera.com> Closes #8489 from srowen/SPARK-10295. (cherry picked from commit cc39803062119c1d14611dc227b9ed0ed1284d38) Signed-off-by: Sean Owen <sowen@cloudera.com>	28 August 2015, 08:32:32 UTC
f0c4470	hyukjinkwon	20 August 2015, 00:13:25 UTC	[SPARK-10035] [SQL] Parquet filters does not process EqualNullSafe filter. As I talked with Lian, 1. I added EquelNullSafe to ParquetFilters - It uses the same equality comparison filter with EqualTo since the Parquet filter performs actually null-safe equality comparison. 2. Updated the test code (ParquetFilterSuite) - Convert catalyst.Expression to sources.Filter - Removed Cast since only Literal is picked up as a proper Filter in DataSourceStrategy - Added EquelNullSafe comparison 3. Removed deprecated createFilter for catalyst.Expression Author: hyukjinkwon <gurwls223@gmail.com> Author: 권혁진 <gurwls223@gmail.com> Closes #8275 from HyukjinKwon/master. (cherry picked from commit ba5f7e1842f2c5852b5309910c0d39926643da69) Signed-off-by: Cheng Lian <lian@databricks.com>	28 August 2015, 08:18:02 UTC
9b7f8f2	Shivaram Venkataraman	28 August 2015, 07:37:50 UTC	[SPARK-10328] [SPARKR] Fix generic for na.omit S3 function is at https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8495 from shivaram/na-omit-fix. (cherry picked from commit 2f99c37273c1d82e2ba39476e4429ea4aaba7ec6) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	28 August 2015, 07:40:01 UTC
bcb8fa8	noelsmith	28 August 2015, 06:59:30 UTC	[SPARK-10188] [PYSPARK] Pyspark CrossValidator with RMSE selects incorrect model * Added isLargerBetter() method to Pyspark Evaluator to match the Scala version. * JavaEvaluator delegates isLargerBetter() to underlying Scala object. * Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax. * Added test cases for where smaller is better (RMSE) and larger is better (R-Squared). (This contribution is my original work and that I license the work to the project under Sparks' open source license) Author: noelsmith <mail@noelsmith.com> Closes #8399 from noel-smith/pyspark-rmse-xval-fix. (cherry picked from commit 7583681e6b0824d7eed471dc4d8fa0b2addf9ffc) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	28 August 2015, 06:59:40 UTC
c77cf86	Cheng Lian	28 August 2015, 05:30:01 UTC	[SPARK-SQL] [MINOR] Fixes some typos in HiveContext Author: Cheng Lian <lian@databricks.com> Closes #8481 from liancheng/hive-context-typo. (cherry picked from commit 89b943438512fcfb239c268b43431397de46cbcf) Signed-off-by: Reynold Xin <rxin@databricks.com>	28 August 2015, 05:30:09 UTC
ede8c62	Feynman Liang	28 August 2015, 04:55:20 UTC	[SPARK-9905] [ML] [DOC] Adds LinearRegressionSummary user guide * Adds user guide for `LinearRegressionSummary` * Fixes unresolved issues in #8197 CC jkbradley mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8491 from feynmanliang/SPARK-9905. (cherry picked from commit af0e1249b1c881c0fa7a921fd21fd2c27214b980) Signed-off-by: Xiangrui Meng <meng@databricks.com>	28 August 2015, 04:55:28 UTC
6ccc0df	MechCoder	28 August 2015, 04:44:06 UTC	[SPARK-9911] [DOC] [ML] Update Userguide for Evaluator I added a small note about the different types of evaluator and the metrics used. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8304 from MechCoder/multiclass_evaluator. (cherry picked from commit 30734d45fbbb269437c062241a9161e198805a76) Signed-off-by: Xiangrui Meng <meng@databricks.com>	28 August 2015, 04:44:14 UTC
fc4c3bf	Davies Liu	27 August 2015, 23:38:00 UTC	[SPARK-10321] sizeInBytes in HadoopFsRelation Having sizeInBytes in HadoopFsRelation to enable broadcast join. cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8490 from davies/sizeInByte. (cherry picked from commit 54cda0deb6bebf1470f16ba5bcc6c4fb842bdac1) Signed-off-by: Michael Armbrust <michael@databricks.com>	27 August 2015, 23:38:10 UTC
351e849	Yin Huai	27 August 2015, 23:11:25 UTC	[SPARK-10287] [SQL] Fixes JSONRelation refreshing on read path https://issues.apache.org/jira/browse/SPARK-10287 After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet). Author: Yin Huai <yhuai@databricks.com> Closes #8469 from yhuai/jsonRefresh. (cherry picked from commit b3dd569ad40905f8861a547a1e25ed3ca8e1d272) Signed-off-by: Yin Huai <yhuai@databricks.com>	27 August 2015, 23:11:39 UTC
3239911	Feynman Liang	27 August 2015, 23:10:37 UTC	[SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java compatibility test * Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine * Cleans up scaladocs for public methods * Adds test for Java compatibility * Follow up Python user guide code example is tracked by SPARK-10249 Author: Feynman Liang <fliang@databricks.com> Closes #8436 from feynmanliang/SPARK-10230. (cherry picked from commit 5bfe9e1111d9862084586549a7dc79476f67bab9) Signed-off-by: Xiangrui Meng <meng@databricks.com>	27 August 2015, 23:10:45 UTC
501e10a	MechCoder	27 August 2015, 22:33:43 UTC	[SPARK-9906] [ML] User guide for LogisticRegressionSummary User guide for LogisticRegression summaries Author: MechCoder <manojkumarsivaraj334@gmail.com> Author: Manoj Kumar <mks542@nyu.edu> Author: Feynman Liang <fliang@databricks.com> Closes #8197 from MechCoder/log_summary_user_guide. (cherry picked from commit c94ecdfc5b3c0fe6c38a170dc2af9259354dc9e3) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	27 August 2015, 22:33:51 UTC
66db9cd	Yuhao Yang	27 August 2015, 20:57:20 UTC	[SPARK-9901] User guide for RowMatrix Tall-and-skinny QR jira: https://issues.apache.org/jira/browse/SPARK-9901 The jira covers only the document update. I can further provide example code for QR (like the ones for SVD and PCA) in a separate PR. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8462 from hhbyyh/qrDoc. (cherry picked from commit 6185cdd2afcd492b77ff225b477b3624e3bc7bb2) Signed-off-by: Xiangrui Meng <meng@databricks.com>	27 August 2015, 20:57:37 UTC
db19715	CodingCat	27 August 2015, 19:19:09 UTC	[SPARK-10315] remove document on spark.akka.failure-detector.threshold https://issues.apache.org/jira/browse/SPARK-10315 this parameter is not used any longer and there is some mistake in the current document , should be 'akka.remote.watch-failure-detector.threshold' Author: CodingCat <zhunansjtu@gmail.com> Closes #8483 from CodingCat/SPARK_10315. (cherry picked from commit 84baa5e9b5edc8c55871fbed5057324450bf097f) Signed-off-by: Sean Owen <sowen@cloudera.com>	27 August 2015, 19:19:23 UTC
965b3bb	Michael Armbrust	27 August 2015, 18:45:15 UTC	[SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide Author: Michael Armbrust <michael@databricks.com> Closes #8441 from marmbrus/documentation. (cherry picked from commit dc86a227e4fc8a9d8c3e8c68da8dff9298447fd0) Signed-off-by: Michael Armbrust <michael@databricks.com>	27 August 2015, 18:45:28 UTC
30f0f7e	Moussa Taifi	27 August 2015, 09:34:47 UTC	[DOCS] [STREAMING] [KAFKA] Fix typo in exactly once semantics Fix Typo in exactly once semantics [Semantics of output operations] link Author: Moussa Taifi <moutai10@gmail.com> Closes #8468 from moutai/patch-3. (cherry picked from commit 9625d13d575c97bbff264f6a94838aae72c9202d) Signed-off-by: Sean Owen <sowen@cloudera.com>	27 August 2015, 09:35:32 UTC
165be9a	Shivaram Venkataraman	27 August 2015, 05:27:31 UTC	[SPARK-10219] [SPARKR] Fix varargsToEnv and add test case cc sun-rui davies Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #8475 from shivaram/varargs-fix. (cherry picked from commit e936cf8088a06d6aefce44305f3904bbeb17b432) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	27 August 2015, 05:27:42 UTC
04c85a8	Cheng Lian	27 August 2015, 01:14:54 UTC	[SPARK-9424] [SQL] Parquet programming guide updates for 1.5 Author: Cheng Lian <lian@databricks.com> Closes #8467 from liancheng/spark-9424/parquet-docs-for-1.5.	27 August 2015, 01:58:48 UTC
cef707d	Shivaram Venkataraman	27 August 2015, 01:13:07 UTC	[SPARK-10308] [SPARKR] Add %in% to the exported namespace I also checked all the other functions defined in column.R, functions.R and DataFrame.R and everything else looked fine. cc yu-iskw Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #8473 from shivaram/in-namespace. (cherry picked from commit ad7f0f160be096c0fdae6e6cf7e3b6ba4a606de7) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	27 August 2015, 01:13:16 UTC
0bdb800	Davies Liu	26 August 2015, 23:04:44 UTC	[SPARK-10305] [SQL] fix create DataFrame from Python class cc jkbradley Author: Davies Liu <davies@databricks.com> Closes #8470 from davies/fix_create_df. (cherry picked from commit d41d6c48207159490c1e1d9cc54015725cfa41b2) Signed-off-by: Davies Liu <davies.liu@gmail.com>	26 August 2015, 23:04:53 UTC
efbd7af	Xiangrui Meng	26 August 2015, 21:02:19 UTC	[SPARK-10241] [MLLIB] update since versions in mllib.recommendation Same as #8421 but for `mllib.recommendation`. cc srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #8432 from mengxr/SPARK-10241. (cherry picked from commit 086d4681df3ebfccfc04188262c10482f44553b0) Signed-off-by: Xiangrui Meng <meng@databricks.com>	26 August 2015, 21:02:32 UTC
b0dde36	Xiangrui Meng	26 August 2015, 18:47:05 UTC	[SPARK-9665] [MLLIB] audit MLlib API annotations I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8452 from mengxr/SPARK-9665. (cherry picked from commit 6519fd06cc8175c9182ef16cf8a37d7f255eb846) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	26 August 2015, 18:47:14 UTC
5220db9	felixcheung	26 August 2015, 06:48:16 UTC	[SPARK-9316] [SPARKR] Add support for filtering using `[` (synonym for filter / select) Add support for ``` df[df$name == "Smith", c(1,2)] df[df$age %in% c(19, 30), 1:2] ``` shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #8394 from felixcheung/rsubset. (cherry picked from commit 75d4773aa50e24972c533e8b48697fde586429eb) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	26 August 2015, 06:48:27 UTC
21a10a8	Xiangrui Meng	26 August 2015, 06:45:41 UTC	[SPARK-10236] [MLLIB] update since versions in mllib.feature Same as #8421 but for `mllib.feature`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits: 0e8d658 [Xiangrui Meng] remove unnecessary comment ad70b03 [Xiangrui Meng] update since versions in mllib.feature (cherry picked from commit 321d7759691bed9867b1f0470f12eab2faa50aff) Signed-off-by: DB Tsai <dbt@netflix.com>	26 August 2015, 06:45:53 UTC
08d390f	Xiangrui Meng	26 August 2015, 05:49:33 UTC	[SPARK-10235] [MLLIB] update since versions in mllib.regression Same as #8421 but for `mllib.regression`. cc freeman-lab dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8426 from mengxr/SPARK-10235 and squashes the following commits: 6cd28e4 [Xiangrui Meng] update since versions in mllib.regression (cherry picked from commit 4657fa1f37d41dd4c7240a960342b68c7c591f48) Signed-off-by: DB Tsai <dbt@netflix.com>	26 August 2015, 05:49:46 UTC
6d8ebc8	Xiangrui Meng	26 August 2015, 05:35:49 UTC	[SPARK-10243] [MLLIB] update since versions in mllib.tree Same as #8421 but for `mllib.tree`. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8442 from mengxr/SPARK-10236. (cherry picked from commit fb7e12fe2e14af8de4c206ca8096b2e8113bfddc) Signed-off-by: Xiangrui Meng <meng@databricks.com>	26 August 2015, 05:35:56 UTC
be0c991	Xiangrui Meng	26 August 2015, 05:33:48 UTC	[SPARK-10234] [MLLIB] update since version in mllib.clustering Same as #8421 but for `mllib.clustering`. cc feynmanliang yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8435 from mengxr/SPARK-10234. (cherry picked from commit d703372f86d6a59383ba8569fcd9d379849cffbf) Signed-off-by: Xiangrui Meng <meng@databricks.com>	26 August 2015, 05:33:55 UTC
b776669	Xiangrui Meng	26 August 2015, 05:31:23 UTC	[SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.random and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242. (cherry picked from commit c3a54843c0c8a14059da4e6716c1ad45c69bbe6c) Signed-off-by: Xiangrui Meng <meng@databricks.com>	26 August 2015, 05:31:33 UTC
46750b9	Xiangrui Meng	26 August 2015, 03:07:56 UTC	[SPARK-10238] [MLLIB] update since versions in mllib.linalg Same as #8421 but for `mllib.linalg`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8440 from mengxr/SPARK-10238 and squashes the following commits: b38437e [Xiangrui Meng] update since versions in mllib.linalg (cherry picked from commit ab431f8a970b85fba34ccb506c0f8815e55c63bf) Signed-off-by: DB Tsai <dbt@netflix.com>	26 August 2015, 03:08:09 UTC
af98e51	Xiangrui Meng	26 August 2015, 01:17:54 UTC	[SPARK-10233] [MLLIB] update since version in mllib.evaluation Same as #8421 but for `mllib.evaluation`. cc avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8423 from mengxr/SPARK-10233. (cherry picked from commit 8668ead2e7097b9591069599fbfccf67c53db659) Signed-off-by: Xiangrui Meng <meng@databricks.com>	26 August 2015, 01:18:27 UTC
5cf266f	Feynman Liang	26 August 2015, 00:39:20 UTC	[SPARK-9888] [MLLIB] User guide for new LDA features * Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888. (cherry picked from commit 125205cdb35530cdb4a8fff3e1ee49cf4a299583) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	26 August 2015, 00:39:40 UTC
4c03cb4	Patrick Wendell	25 August 2015, 22:56:44 UTC	Preparing development version 1.5.1-SNAPSHOT	25 August 2015, 22:56:44 UTC
7277713	Patrick Wendell	25 August 2015, 22:56:37 UTC	Preparing Spark release v1.5.0-rc2	25 August 2015, 22:56:37 UTC
ab7d46d	Davies Liu	25 August 2015, 22:19:41 UTC	[SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive) Follow the rule in Hive for decimal division. see https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113 cc chenghao-intel Author: Davies Liu <davies@databricks.com> Closes #8415 from davies/decimal_div2. (cherry picked from commit 7467b52ed07f174d93dfc4cb544dc4b69a2c2826) Signed-off-by: Yin Huai <yhuai@databricks.com>	25 August 2015, 22:20:42 UTC
8925896	Davies Liu	25 August 2015, 21:55:34 UTC	[SPARK-10245] [SQL] Fix decimal literals with precision < scale In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal. Author: Davies Liu <davies@databricks.com> Closes #8428 from davies/smaller_decimal. (cherry picked from commit ec89bd840a6862751999d612f586a962cae63f6d) Signed-off-by: Yin Huai <yhuai@databricks.com>	25 August 2015, 21:55:45 UTC
6f05b7a	Xiangrui Meng	25 August 2015, 21:11:38 UTC	[SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and mllib.util Same as #8421 but for `mllib.pmml` and `mllib.util`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8430 from mengxr/SPARK-10239 and squashes the following commits: a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util (cherry picked from commit 00ae4be97f7b205432db2967ba6d506286ef2ca6) Signed-off-by: DB Tsai <dbt@netflix.com>	25 August 2015, 21:11:50 UTC
055387c	Feynman Liang	25 August 2015, 20:23:15 UTC	[SPARK-9797] [MLLIB] [DOC] StreamingLinearRegressionWithSGD.setConvergenceTol default value Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc Author: Feynman Liang <fliang@databricks.com> Closes #8424 from feynmanliang/SPARK-9797. (cherry picked from commit 9205907876cf65695e56c2a94bedd83df3675c03) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	25 August 2015, 20:23:25 UTC
186326d	Xiangrui Meng	25 August 2015, 20:22:38 UTC	[SPARK-10237] [MLLIB] update since versions in mllib.fpm Same as #8421 but for `mllib.fpm`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8429 from mengxr/SPARK-10237. (cherry picked from commit c619c7552f22d28cfa321ce671fc9ca854dd655f) Signed-off-by: Xiangrui Meng <meng@databricks.com>	25 August 2015, 20:22:45 UTC
95e44b4	Feynman Liang	25 August 2015, 20:21:05 UTC	[SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD alias * Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol * Cleans up a note in code Author: Feynman Liang <fliang@databricks.com> Closes #8425 from feynmanliang/SPARK-9800. (cherry picked from commit c0e9ff1588b4d9313cc6ec6e00e5c7663eb67910) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	25 August 2015, 20:21:16 UTC
5a32ed7	Xiangrui Meng	25 August 2015, 19:16:23 UTC	[SPARK-10231] [MLLIB] update @Since annotation for mllib.classification Update `Since` annotation in `mllib.classification`: 1. add version to classes, objects, constructors, and public variables declared in constructors 2. correct some versions 3. remove `Since` on `toString` MechCoder dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8421 from mengxr/SPARK-10231 and squashes the following commits: b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification (cherry picked from commit 16a2be1a84c0a274a60c0a584faaf58b55d4942b) Signed-off-by: DB Tsai <dbt@netflix.com>	25 August 2015, 19:16:41 UTC
c740f5d	Feynman Liang	25 August 2015, 18:58:47 UTC	[SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230. (cherry picked from commit 881208a8e849facf54166bdd69d3634407f952e7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	25 August 2015, 18:58:55 UTC
742c82e	Yuhao Yang	25 August 2015, 17:54:03 UTC	[SPARK-8531] [ML] Update ML user guide for MinMaxScaler jira: https://issues.apache.org/jira/browse/SPARK-8531 Update ML user guide for MinMaxScaler Author: Yuhao Yang <hhbyyh@gmail.com> Author: unknown <yuhaoyan@yuhaoyan-MOBL1.ccr.corp.intel.com> Closes #7211 from hhbyyh/minmaxdoc. (cherry picked from commit b37f0cc1b4c064d6f09edb161250fa8b783de52a) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	25 August 2015, 17:54:12 UTC
0402f12	Michael Armbrust	25 August 2015, 17:22:54 UTC	[SPARK-10198] [SQL] Turn off partition verification by default Author: Michael Armbrust <michael@databricks.com> Closes #8404 from marmbrus/turnOffPartitionVerification. (cherry picked from commit 5c08c86bfa43462fb2ca5f7c5980ddfb44dd57f8) Signed-off-by: Michael Armbrust <michael@databricks.com>	25 August 2015, 17:23:08 UTC
bdcc8e6	ehnalis	25 August 2015, 11:30:06 UTC	Fixed a typo in DAGScheduler. Author: ehnalis <zoltan.zvara@gmail.com> Closes #8308 from ehnalis/master. (cherry picked from commit 7f1e507bf7e82bff323c5dec3c1ee044687c4173) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 August 2015, 11:30:18 UTC
5d68405	Zhang, Liye	25 August 2015, 10:48:55 UTC	[DOC] add missing parameters in SparkContext.scala for scala doc Author: Zhang, Liye <liye.zhang@intel.com> Closes #8412 from liyezhang556520/minorDoc. (cherry picked from commit 5c14890159a5711072bf395f662b2433a389edf9) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 August 2015, 10:49:07 UTC
73f1dd1	Yin Huai	25 August 2015, 08:19:34 UTC	[SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspectors). https://issues.apache.org/jira/browse/SPARK-10197 Author: Yin Huai <yhuai@databricks.com> Closes #8407 from yhuai/ORCSPARK-10197. (cherry picked from commit 0e6368ffaec1965d0c7f89420e04a974675c7f6e) Signed-off-by: Cheng Lian <lian@databricks.com>	25 August 2015, 08:20:13 UTC
a0f22cf	Josh Rosen	25 August 2015, 08:06:36 UTC	[SPARK-10195] [SQL] Data sources Filter should not expose internal types Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties. This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0. To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions. Author: Josh Rosen <joshrosen@databricks.com> Closes #8403 from JoshRosen/datasources-internal-vs-external-types. (cherry picked from commit 7bc9a8c6249300ded31ea931c463d0a8f798e193) Signed-off-by: Reynold Xin <rxin@databricks.com>	25 August 2015, 08:06:51 UTC
e5cea56	Davies Liu	25 August 2015, 08:00:44 UTC	[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly. In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5). Author: Davies Liu <davies@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #8400 from davies/timestamp_parquet. (cherry picked from commit 2f493f7e3924b769160a16f73cccbebf21973b91) Signed-off-by: Cheng Lian <lian@databricks.com>	25 August 2015, 08:00:58 UTC
2032d66	Tathagata Das	25 August 2015, 07:35:51 UTC	[SPARK-10210] [STREAMING] Filter out non-existent blocks before creating BlockRDD When write ahead log is not enabled, a recovered streaming driver still tries to run jobs using pre-failure block ids, and fails as the block do not exists in-memory any more (and cannot be recovered as receiver WAL is not enabled). This occurs because the driver-side WAL of ReceivedBlockTracker is recovers that past block information, and ReceiveInputDStream creates BlockRDDs even if those blocks do not exist. The solution in this PR is to filter out block ids that do not exist before creating the BlockRDD. In addition, it adds unit tests to verify other logic in ReceiverInputDStream. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8405 from tdas/SPARK-10210. (cherry picked from commit 1fc37581a52530bac5d555dbf14927a5780c3b75) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	25 August 2015, 07:36:01 UTC
4841ebb	Sean Owen	25 August 2015, 07:32:20 UTC	[SPARK-6196] [BUILD] Remove MapR profiles in favor of hadoop-provided Follow up to https://github.com/apache/spark/pull/7047 pwendell mentioned that MapR should use `hadoop-provided` now, and indeed the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence the action seems to be to remove the profiles, which are now not used. CC trystanleftwich Author: Sean Owen <sowen@cloudera.com> Closes #8338 from srowen/SPARK-6196. (cherry picked from commit 57b960bf3706728513f9e089455a533f0244312e) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 August 2015, 07:32:31 UTC
76d920f	Yu ISHIKAWA	25 August 2015, 07:28:51 UTC	[SPARK-10214] [SPARKR] [DOCS] Improve SparkR Column, DataFrame API docs cc: shivaram ## Summary - Add name tags to each methods in DataFrame.R and column.R - Replace `rdname column` with `rdname {each_func}`. i.e. alias method : `rdname column` => `rdname alias` ## Generated PDF File https://drive.google.com/file/d/0B9biIZIU47lLNHN2aFpnQXlSeGs/view?usp=sharing ## JIRA [[SPARK-10214] Improve SparkR Column, DataFrame API docs - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10214) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8414 from yu-iskw/SPARK-10214. (cherry picked from commit d4549fe58fa0d781e0e891bceff893420cb1d598) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	25 August 2015, 07:28:58 UTC
b7c4ff1	Josh Rosen	25 August 2015, 07:04:10 UTC	[SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions. I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class. Author: Josh Rosen <joshrosen@databricks.com> Closes #7631 from JoshRosen/SPARK-9293. (cherry picked from commit 82268f07abfa658869df2354ae72f8d6ddd119e8) Signed-off-by: Michael Armbrust <michael@databricks.com>	25 August 2015, 07:04:23 UTC
95a14e9	Cheng Lian	25 August 2015, 06:58:42 UTC	[SPARK-10136] [SQL] A more robust fix for SPARK-10136 PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here. The structure of the problematic Parquet schema generated by parquet-avro is something like this: ``` message m { <repetition> group f (LIST) { // Level 1 repeated group array (LIST) { // Level 2 repeated <primitive-type> array; // Level 3 } } } ``` (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.) This structure consists of two nested legacy 2-level `LIST`-like structures: 1. The repeated group type at level 2 is the element type of the outer array defined at level 1 This group should map to an `CatalystArrayConverter.ElementConverter` when building converters. 2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2 This group should also map to an `CatalystArrayConverter.ElementConverter`. The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it. According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.) As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec: > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.) This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method. Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3]. [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305 [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463 [3]: https://issues.apache.org/jira/browse/PARQUET-364 Author: Cheng Lian <lian@databricks.com> Closes #8361 from liancheng/spark-10136/proper-version. (cherry picked from commit bf03fe68d62f33dda70dff45c3bda1f57b032dfc) Signed-off-by: Cheng Lian <lian@databricks.com>	25 August 2015, 06:58:57 UTC
0b425ed	Yin Huai	25 August 2015, 06:38:32 UTC	[SPARK-10196] [SQL] Correctly saving decimals in internal rows to JSON. https://issues.apache.org/jira/browse/SPARK-10196 Author: Yin Huai <yhuai@databricks.com> Closes #8408 from yhuai/DecimalJsonSPARK-10196. (cherry picked from commit df7041d02d3fd44b08a859f5d77bf6fb726895f0) Signed-off-by: Davies Liu <davies.liu@gmail.com>	25 August 2015, 06:38:42 UTC
bb1357f	zsxwing	25 August 2015, 06:34:50 UTC	[SPARK-10137] [STREAMING] Avoid to restart receivers if scheduleReceivers returns balanced results This PR fixes the following cases for `ReceiverSchedulingPolicy`. 1) Assume there are 4 executors: host1, host2, host3, host4, and 5 receivers: r1, r2, r3, r4, r5. Then `ReceiverSchedulingPolicy.scheduleReceivers` will return (r1 -> host1, r2 -> host2, r3 -> host3, r4 -> host4, r5 -> host1). Let's assume r1 starts at first on `host1` as `scheduleReceivers` suggested, and try to register with ReceiverTracker. But the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will return (host2, host3, host4) according to the current executor weights (host1 -> 1.0, host2 -> 0.5, host3 -> 0.5, host4 -> 0.5), so ReceiverTracker will reject `r1`. This is unexpected since r1 is starting exactly where `scheduleReceivers` suggested. This case can be fixed by ignoring the information of the receiver that is rescheduling in `receiverTrackingInfoMap`. 2) Assume there are 3 executors (host1, host2, host3) and each executors has 3 cores, and 3 receivers: r1, r2, r3. Assume r1 is running on host1. Now r2 is restarting, the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will always return (host1, host2, host3). So it's possible that r2 will be scheduled to host1 by TaskScheduler. r3 is similar. Then at last, it's possible that there are 3 receivers running on host1, while host2 and host3 are idle. This issue can be fixed by returning only executors that have the minimum wight rather than returning at least 3 executors. Author: zsxwing <zsxwing@gmail.com> Closes #8340 from zsxwing/fix-receiver-scheduling. (cherry picked from commit f023aa2fcc1d1dbb82aee568be0a8f2457c309ae) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	25 August 2015, 06:35:02 UTC
88991dc	cody koeninger	25 August 2015, 06:26:14 UTC	[SPARK-9786] [STREAMING] [KAFKA] fix backpressure so it works with defa… …ult maxRatePerPartition setting of 0 Author: cody koeninger <cody@koeninger.org> Closes #8413 from koeninger/backpressure-testing-master. (cherry picked from commit d9c25dec87e6da7d66a47ff94e7eefa008081b9d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	25 August 2015, 06:26:27 UTC
2239a20	Michael Armbrust	25 August 2015, 06:15:27 UTC	[SPARK-10178] [SQL] HiveComparisionTest should print out dependent tables In `HiveComparisionTest`s it is possible to fail a query of the form `SELECT * FROM dest1`, where `dest1` is the query that is actually computing the incorrect results. To aid debugging this patch improves the harness to also print these query plans and their results. Author: Michael Armbrust <michael@databricks.com> Closes #8388 from marmbrus/generatedTables. (cherry picked from commit 5175ca0c85b10045d12c3fb57b1e52278a413ecf) Signed-off-by: Reynold Xin <rxin@databricks.com>	25 August 2015, 06:15:34 UTC
c99f416	Yin Huai	25 August 2015, 04:49:50 UTC	[SPARK-10121] [SQL] Thrift server always use the latest class loader provided by the conf of executionHive's state https://issues.apache.org/jira/browse/SPARK-10121 Looks like the problem is that if we add a jar through another thread, the thread handling the JDBC session will not get the latest classloader. Author: Yin Huai <yhuai@databricks.com> Closes #8368 from yhuai/SPARK-10121. (cherry picked from commit a0c0aae1defe5e1e57704065631d201f8e3f6bac) Signed-off-by: Cheng Lian <lian@databricks.com>	25 August 2015, 04:50:44 UTC
2f7e4b4	Feynman Liang	25 August 2015, 02:45:41 UTC	[SQL] [MINOR] [DOC] Clarify docs for inferring DataFrame from RDD of Products * Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with `SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is essentially a wrapper for the latter * Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any `RDD[Product]`, not just case classes Author: Feynman Liang <fliang@databricks.com> Closes #8406 from feynmanliang/sql-doc-fixes. (cherry picked from commit 642c43c81c835139e3f35dfd6a215d668a474203) Signed-off-by: Reynold Xin <rxin@databricks.com>	25 August 2015, 02:45:48 UTC
ec5d09c	Yu ISHIKAWA	25 August 2015, 01:17:51 UTC	[SPARK-10118] [SPARKR] [DOCS] Improve SparkR API docs for 1.5 release cc: shivaram ## Summary - Modify `tdname` of expression functions. i.e. `ascii`: `rdname functions` => `rdname ascii` - Replace the dynamical function definitions to the static ones because of thir documentations. ## Generated PDF File https://drive.google.com/file/d/0B9biIZIU47lLX2t6ZjRoRnBTSEU/view?usp=sharing ## JIRA [[SPARK-10118] Improve SparkR API docs for 1.5 release - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10118) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8386 from yu-iskw/SPARK-10118. (cherry picked from commit 6511bf559b736d8e23ae398901c8d78938e66869) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	25 August 2015, 01:17:58 UTC
228e429	Michael Armbrust	25 August 2015, 01:10:51 UTC	[SPARK-10165] [SQL] Await child resolution in ResolveFunctions Currently, we eagerly attempt to resolve functions, even before their children are resolved. However, this is not valid in cases where we need to know the types of the input arguments (i.e. when resolving Hive UDFs). As a fix, this PR delays function resolution until the functions children are resolved. This change also necessitates a change to the way we resolve aggregate expressions that are not in aggregate operators (e.g., in `HAVING` or `ORDER BY` clauses). Specifically, we can't assume that these misplaced functions will be resolved, allowing us to differentiate aggregate functions from normal functions. To compensate for this change we now attempt to resolve these unresolved expressions in the context of the aggregate operator, before checking to see if any aggregate expressions are present. Author: Michael Armbrust <michael@databricks.com> Closes #8371 from marmbrus/hiveUDFResolution. (cherry picked from commit 2bf338c626e9d97ccc033cfadae8b36a82c66fd1) Signed-off-by: Michael Armbrust <michael@databricks.com>	25 August 2015, 01:11:04 UTC
8ca8bdd	Patrick Wendell	25 August 2015, 00:22:09 UTC	HOTFIX: Adding missing 1.4.1 ec2 version	25 August 2015, 00:22:09 UTC
a4bad5f	Josh Rosen	24 August 2015, 23:17:45 UTC	[SPARK-10190] Fix NPE in CatalystTypeConverters Decimal toScala converter This adds a missing null check to the Decimal `toScala` converter in `CatalystTypeConverters`, fixing an NPE. Author: Josh Rosen <joshrosen@databricks.com> Closes #8401 from JoshRosen/SPARK-10190. (cherry picked from commit d7b4c095271c36fcc7f9ded267ecf5ec66fac803) Signed-off-by: Reynold Xin <rxin@databricks.com>	24 August 2015, 23:17:52 UTC
aadb9de	Joseph K. Bradley	24 August 2015, 22:38:54 UTC	[SPARK-10061] [DOC] ML ensemble docs User guide for spark.ml GBTs and Random Forests. The examples are copied from the decision tree guide and modified to run. I caught some issues I had somehow missed in the tree guide as well. I have run all examples, including Java ones. (Of course, I thought I had previously as well...) CC: mengxr manishamde yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8369 from jkbradley/ml-ensemble-docs. (cherry picked from commit 13db11cb08eb90eb0ea3402c9fe0270aa282f247) Signed-off-by: Xiangrui Meng <meng@databricks.com>	24 August 2015, 22:39:01 UTC
9223443	Sean Owen	24 August 2015, 21:35:21 UTC	[SPARK-9758] [TEST] [SQL] Compilation issue for hive test / wrong package? Move `test.org.apache.spark.sql.hive` package tests to apparent intended `org.apache.spark.sql.hive` as they don't intend to test behavior from outside org.apache.spark.* Alternate take, per discussion at https://github.com/apache/spark/pull/8051 I think this is what vanzin and I had in mind but also CC rxin to cross-check, as this does indeed depend on whether these tests were accidentally in this package or not. Testing from a `test.org.apache.spark` package is legitimate but didn't seem to be the intent here. Author: Sean Owen <sowen@cloudera.com> Closes #8307 from srowen/SPARK-9758. (cherry picked from commit cb2d2e15844d7ae34b5dd7028b55e11586ed93fa) Signed-off-by: Sean Owen <sowen@cloudera.com>	24 August 2015, 21:35:31 UTC
d36f351	Cheng Lian	24 August 2015, 21:11:19 UTC	[SPARK-8580] [SQL] Refactors ParquetHiveCompatibilitySuite and adds more test cases This PR refactors `ParquetHiveCompatibilitySuite` so that it's easier to add new test cases. Hit two bugs, SPARK-10177 and HIVE-11625, while working on this, added test cases for them and marked as ignored for now. SPARK-10177 will be addressed in a separate PR. Author: Cheng Lian <lian@databricks.com> Closes #8392 from liancheng/spark-8580/parquet-hive-compat-tests. (cherry picked from commit a2f4cdceba32aaa0df59df335ca0ce1ac73fc6c2) Signed-off-by: Davies Liu <davies.liu@gmail.com>	24 August 2015, 21:11:30 UTC
831f78e	Andrew Or	24 August 2015, 21:10:50 UTC	[SPARK-10144] [UI] Actually show peak execution memory by default The peak execution memory metric was introduced in SPARK-8735. That was before Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` must be explicitly set to true. The result is that the memory is not displayed by default. Author: Andrew Or <andrew@databricks.com> Closes #8345 from andrewor14/show-memory-default. (cherry picked from commit 662bb9667669cb07cf6d2ccee0d8e76bb561cd89) Signed-off-by: Yin Huai <yhuai@databricks.com>	24 August 2015, 21:11:03 UTC
43dcf95	Burak Yavuz	24 August 2015, 20:48:01 UTC	[SPARK-7710] [SPARK-7998] [DOCS] Docs for DataFrameStatFunctions This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`. rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #8378 from brkyvz/update-sql-docs. (cherry picked from commit 9ce0c7ad333f4a3c01207e5e9ed42bcafb99d894) Signed-off-by: Reynold Xin <rxin@databricks.com>	24 August 2015, 20:48:09 UTC
d003373	Tathagata Das	24 August 2015, 19:40:09 UTC	[SPARK-9791] [PACKAGE] Change private class to private class to prevent unnecessary classes from showing up in the docs In addition, some random cleanup of import ordering Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8387 from tdas/SPARK-9791 and squashes the following commits: 67f3ee9 [Tathagata Das] Change private class to private[package] class to prevent them from showing up in the docs (cherry picked from commit 7478c8b66d6a2b1179f20c38b49e27e37b0caec3) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	24 August 2015, 19:40:23 UTC
36bc50c	zsxwing	24 August 2015, 19:38:01 UTC	[SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build. I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests. Author: zsxwing <zsxwing@gmail.com> Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits: e0b5818 [zsxwing] Fix the sbt build c697627 [zsxwing] Add the jar pathes to the exception message be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars (cherry picked from commit 4e0395ddb764d092b5b38447af49e196e590e0f0) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	24 August 2015, 19:38:10 UTC
b40059d	Tathagata Das	24 August 2015, 02:24:32 UTC	[SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local checkpoint paths and existing SparkContexts The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following: 1. Use the same code path as Java to check whether a valid checkpoint exists 2. Create a new Python SparkContext only if there no active one. There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8366 from tdas/SPARK-10142 and squashes the following commits: 3afa666 [Tathagata Das] Added tests 2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists 9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files (cherry picked from commit 053d94fcf32268369b5a40837271f15d6af41aa4) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	24 August 2015, 02:24:42 UTC
00f812d	Joseph K. Bradley	24 August 2015, 01:34:07 UTC	[SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix. (cherry picked from commit b963c19a803c5a26c9b65655d40ca6621acf8bd4) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	24 August 2015, 01:34:15 UTC
1c5a828	zsxwing	24 August 2015, 00:41:49 UTC	[SPARK-10148] [STREAMING] Display active and inactive receiver numbers in Streaming page Added the active and inactive receiver numbers in the summary section of Streaming page. <img width="1074" alt="screen shot 2015-08-21 at 2 08 54 pm" src="https://cloud.githubusercontent.com/assets/1000778/9402437/ff2806a2-480f-11e5-8f8e-efdf8e5d514d.png"> Author: zsxwing <zsxwing@gmail.com> Closes #8351 from zsxwing/receiver-number. (cherry picked from commit c6df5f66d9a8b9760f2cd46fcd930f977650c9c5) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	24 August 2015, 00:52:46 UTC
595f92f	Keiji Yoshida	23 August 2015, 10:04:29 UTC	Update streaming-programming-guide.md Update `See the Scala example` to `See the Java example`. Author: Keiji Yoshida <yoshida.keiji.84@gmail.com> Closes #8376 from yosssi/patch-1. (cherry picked from commit 623c675fde7a3a39957a62c7af26a54f4b01f8ce) Signed-off-by: Sean Owen <sowen@cloudera.com>	23 August 2015, 10:04:44 UTC
5f03b7a	Keiji Yoshida	22 August 2015, 09:38:10 UTC	Update programming-guide.md Update `lineLengths.persist();` to `lineLengths.persist(StorageLevel.MEMORY_ONLY());` because `JavaRDD#persist` needs a parameter of `StorageLevel`. Author: Keiji Yoshida <yoshida.keiji.84@gmail.com> Closes #8372 from yosssi/patch-1. (cherry picked from commit 46fcb9e0dbb2b28110f68a3d9f6c0c47bfd197b1) Signed-off-by: Reynold Xin <rxin@databricks.com>	22 August 2015, 09:38:18 UTC
fbf9a6e	Xusen Yin	21 August 2015, 23:30:12 UTC	[SPARK-9893] User guide with Java test suite for VectorSlicer Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin <yinxusen@gmail.com> Closes #8267 from yinxusen/SPARK-9893. (cherry picked from commit 630a994e6a9785d1704f8e7fb604f32f5dea24f8) Signed-off-by: Xiangrui Meng <meng@databricks.com>	21 August 2015, 23:30:19 UTC
cb61c7b	Joseph K. Bradley	21 August 2015, 23:28:00 UTC	[SPARK-10163] [ML] Allow single-category features for GBT models Removed categorical feature info validation since no longer needed This is needed to make the ML user guide examples work (in another current PR). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8367 from jkbradley/gbt-single-cat. (cherry picked from commit f01c4220d2b791f470fa6596ffe11baa51517fbe) Signed-off-by: Xiangrui Meng <meng@databricks.com>	21 August 2015, 23:28:07 UTC
914da35	Patrick Wendell	21 August 2015, 21:56:50 UTC	Preparing development version 1.5.0-SNAPSHOT	21 August 2015, 21:56:50 UTC

Newer
Older