Revision history - None - origin: https://github.com/apache/spark

visit type:

Revision	Author	Date	Message	Commit Date
49c30c1	Patrick Wendell	03 November 2015, 16:51:57 UTC	Preparing Spark release v1.5.2-rc2	03 November 2015, 16:51:57 UTC
9795666	Reynold Xin	03 November 2015, 16:50:08 UTC	Update branch-1.5 for 1.5.2 release. Author: Reynold Xin <rxin@databricks.com> Closes #9435 from rxin/patch1.5.2.	03 November 2015, 16:50:08 UTC
5604ce9	Dilip Biswal	03 November 2015, 11:14:01 UTC	[SPARK-11188] [SQL] Elide stacktraces in bin/spark-sql for AnalysisExceptions Only print the error message to the console for Analysis Exceptions in sql-shell Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9374 from dilipbiswal/dkb-11188-v152 and squashes the following commits: a58cedc [Dilip Biswal] [SPARK-11188][SQL] Elide stacktraces in bin/spark-sql for AnalysisExceptions	03 November 2015, 11:14:01 UTC
b85bf8f	Josh Rosen	31 October 2015, 17:47:22 UTC	[SPARK-11424] Guard against double-close() of RecordReaders TL;DR: We can rule out one rare but potential cause of input stream corruption via defensive programming. ## Background [MAPREDUCE-5918](https://issues.apache.org/jira/browse/MAPREDUCE-5918) is a bug where an instance of a decompressor ends up getting placed into a pool multiple times. Since the pool is backed by a list instead of a set, this can lead to the same decompressor being used in different places at the same time, which is not safe because those decompressors will overwrite each other's buffers. Sometimes this buffer sharing will lead to exceptions but other times it will might silently result in invalid / garbled input. That Hadoop bug is fixed in Hadoop 2.7 but is still present in many Hadoop versions that we wish to support. As a result, I think that we should try to work around this issue in Spark via defensive programming to prevent RecordReaders from being closed multiple times. So far, I've had a hard time coming up with explanations of exactly how double-`close()`s occur in practice, but I do have a couple of explanations that work on paper. For instance, it looks like https://github.com/apache/spark/pull/7424, added in 1.5, introduces at least one extremely~rare corner-case path where Spark could double-close() a LineRecordReader instance in a way that triggers the bug. Here are the steps involved in the bad execution that I brainstormed up: * [The task has finished reading input, so we call close()](https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L168). * [While handling the close call and trying to close the reader, reader.close() throws an exception]( https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L190) * We don't set `reader = null` after handling this exception, so the [TaskCompletionListener also ends up calling NewHadoopRDD.close()](https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L156), which, in turn, closes the record reader again. In this hypothetical situation, `LineRecordReader.close()` could [fail with an exception if its InputStream failed to close](https://github.com/apache/hadoop/blob/release-1.2.1/src/mapred/org/apache/hadoop/mapred/LineRecordReader.java#L212). I googled for "Exception in RecordReader.close()" and it looks like it's possible for a closed Hadoop FileSystem to trigger an error there: [SPARK-757](https://issues.apache.org/jira/browse/SPARK-757), [SPARK-2491](https://issues.apache.org/jira/browse/SPARK-2491) Looking at [SPARK-3052](https://issues.apache.org/jira/browse/SPARK-3052), it seems like it's possible to get spurious exceptions there when there is an error reading from Hadoop. If the Hadoop FileSystem were to get into an error state _right_ after reading the last record then it looks like we could hit the bug here in 1.5. ## The fix This patch guards against these issues by modifying `HadoopRDD.close()` and `NewHadoopRDD.close()` so that they set `reader = null` even if an exception occurs in the `reader.close()` call. In addition, I modified `NextIterator. closeIfNeeded()` to guard against double-close if the first `close()` call throws an exception. I don't have an easy way to test this, since I haven't been able to reproduce the bug that prompted this patch, but these changes seem safe and seem to rule out the on-paper reproductions that I was able to brainstorm up. Author: Josh Rosen <joshrosen@databricks.com> Closes #9382 from JoshRosen/hadoop-decompressor-pooling-fix and squashes the following commits: 5ec97d7 [Josh Rosen] Add SqlNewHadoopRDD.unsetInputFileName() that I accidentally deleted. ae46cf4 [Josh Rosen] Merge remote-tracking branch 'origin/master' into hadoop-decompressor-pooling-fix 087aa63 [Josh Rosen] Guard against double-close() of RecordReaders. (cherry picked from commit ac4118db2dda802b936bb7a18a08844846c71285) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	31 October 2015, 17:47:56 UTC
c9ac0e9	Yin Huai	31 October 2015, 03:05:07 UTC	[SPARK-11434][SPARK-11103][SQL] Fix test ": Filter applied on merged Parquet schema with new column fails" https://issues.apache.org/jira/browse/SPARK-11434 Author: Yin Huai <yhuai@databricks.com> Closes #9387 from yhuai/SPARK-11434. (cherry picked from commit 3c471885dc4f86bea95ab542e0d48d22ae748404) Signed-off-by: Yin Huai <yhuai@databricks.com>	31 October 2015, 03:05:20 UTC
6b10ea5	Wenchen Fan	30 October 2015, 19:14:53 UTC	[SPARK-10829] [SPARK-11301] [SQL] fix 2 bugs for filter on partitioned columns (1.5 backport) [SPARK-10829](https://github.com/apache/spark/pull/8916) Filter combine partition key and attribute doesn't work in DataSource scan [SPARK-11301](https://github.com/apache/spark/pull/9271) fix case sensitivity for filter on partitioned columns Author: Wenchen Fan <wenchen@databricks.com> This patch had conflicts when merged, resolved by Committer: Yin Huai <yhuai@databricks.com> Closes #9371 from cloud-fan/branch-1.5.	30 October 2015, 19:14:53 UTC
06d3257	hyukjinkwon	30 October 2015, 10:17:35 UTC	[SPARK-11103][SQL] Filter applied on Merged Parquet shema with new column fail When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema. This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389). For now, it just simply disables predicate push down when using merged schema in this PR. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9327 from HyukjinKwon/SPARK-11103. (cherry picked from commit 59db9e9c382fab40aac0633f2c779bee8cf2025f) Signed-off-by: Cheng Lian <lian@databricks.com>	30 October 2015, 10:21:52 UTC
0df2c78	Davies Liu	30 October 2015, 07:36:20 UTC	[SPARK-11417] [SQL] no @Override in codegen Older version of Janino (>2.7) does not support Override, we should not use that in codegen. Author: Davies Liu <davies@databricks.com> Closes #9372 from davies/no_override. (cherry picked from commit eb59b94c450fe6391d24d44ff7ea9bd4c6893af8) Signed-off-by: Davies Liu <davies.liu@gmail.com> Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala	30 October 2015, 07:38:07 UTC
bb3b362	Wenchen Fan	14 October 2015, 00:11:22 UTC	[SPARK-11032] [SQL] correctly handle having We should not stop resolving having when the having condtion is resolved, or something like `count(1)` will crash. Author: Wenchen Fan <cloud0fan@163.com> Closes #9105 from cloud-fan/having. (cherry picked from commit e170c22160bb452f98c340489ebf8390116a8cbb) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala	29 October 2015, 15:05:31 UTC
76d7423	xin Wu	29 October 2015, 14:42:46 UTC	[SPARK-11246] [SQL] Table cache for Parquet broken in 1.5 The root cause is that when spark.sql.hive.convertMetastoreParquet=true by default, the cached InMemoryRelation of the ParquetRelation can not be looked up from the cachedData of CacheManager because the key comparison fails even though it is the same LogicalPlan representing the Subquery that wraps the ParquetRelation. The solution in this PR is overriding the LogicalPlan.sameResult function in Subquery case class to eliminate subquery node first before directly comparing the child (ParquetRelation), which will find the key to the cached InMemoryRelation. Author: xin Wu <xinwu@us.ibm.com> Closes #9326 from xwu0226/spark-11246-commit. (cherry picked from commit f7a51deebad1b4c3b970a051f25d286110b94438) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala	29 October 2015, 14:57:10 UTC
9e3197a	Mageswaran.D	28 October 2015, 15:46:30 UTC	Typo in mllib-evaluation-metrics.md Recall by threshold snippet was using "precisionByThreshold" Author: Mageswaran.D <mageswaran1989@gmail.com> Closes #9333 from Mageswaran1989/Typo_in_mllib-evaluation-metrics.md. (cherry picked from commit fd9e345ceeff385ba614a16d478097650caa98d0) Signed-off-by: Xiangrui Meng <meng@databricks.com>	28 October 2015, 15:46:40 UTC
3bd596d	Yanbo Liang	27 October 2015, 10:28:59 UTC	[SPARK-11303][SQL] filter should not be pushed down into sample When sampling and then filtering DataFrame, the SQL Optimizer will push down filter into sample and produce wrong result. This is due to the sampler is calculated based on the original scope rather than the scope after filtering. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9294 from yanboliang/spark-11303.	28 October 2015, 13:21:31 UTC
86ee81e	Sean Owen	28 October 2015, 06:07:37 UTC	[SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes https://github.com/apache/spark/pull/9293 Author: Sean Owen <sowen@cloudera.com> Closes #9309 from srowen/SPARK-11302.2. (cherry picked from commit 826e1e304b57abbc56b8b7ffd663d53942ab3c7c) Signed-off-by: Xiangrui Meng <meng@databricks.com>	28 October 2015, 06:07:48 UTC
abb0ca7	Nick Evans	27 October 2015, 08:29:06 UTC	[SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition from the Kafka Streaming API jerryshao tdas I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise. Instead of doing something like: ``` assert topic_and_partition_instance._topic == "foo" assert topic_and_partition_instance._partition == 0 ``` You can do something like: ``` assert topic_and_partition_instance == TopicAndPartition("foo", 0) ``` Before: ``` >>> from pyspark.streaming.kafka import TopicAndPartition >>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0) False ``` After: ``` >>> from pyspark.streaming.kafka import TopicAndPartition >>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0) True ``` I couldn't find any tests - am I missing something? Author: Nick Evans <me@nicolasevans.org> Closes #9236 from manygrams/topic_and_partition_equality. (cherry picked from commit 8f888eea1aef5a28916ec406a99fc19648681ecf) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	27 October 2015, 08:30:29 UTC
8a6e63c	Alexander Slesarenko	26 October 2015, 22:49:14 UTC	[SQL][DOC] Minor document fixes in interfaces.scala rxin just noticed this while reading the code. Author: Alexander Slesarenko <avslesarenko@gmail.com> Closes #9284 from aslesarenko/doc-typos. (cherry picked from commit 4bb2b3698ffed58cc5159db36f8b11573ad26b23) Signed-off-by: Reynold Xin <rxin@databricks.com>	26 October 2015, 22:49:48 UTC
a355d0d	Kevin Yu	26 October 2015, 09:34:15 UTC	[SPARK-5966][WIP] Spark-submit deploy-mode cluster is not compatible with master local> … master local> Author: Kevin Yu <qyu@us.ibm.com> Closes #9220 from kevinyu98/working_on_spark-5966. (cherry picked from commit 616be29c7f2ebc184bd5ec97210da36a2174d80c) Signed-off-by: Sean Owen <sowen@cloudera.com>	26 October 2015, 09:35:52 UTC
74921c2	Bryan Cutler	25 October 2015, 19:05:45 UTC	[SPARK-11287] Fixed class name to properly start TestExecutor from deploy.client.TestClient Executing deploy.client.TestClient fails due to bad class name for TestExecutor in ApplicationDescription. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9255 from BryanCutler/fix-TestClient-classname-SPARK-11287. (cherry picked from commit 80279ac1875d488f7000f352a958a35536bd4c2e) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 October 2015, 19:05:55 UTC
36fddb0	Josh Rosen	25 October 2015, 09:31:44 UTC	[SPARK-11299][DOC] Fix link to Scala DataFrame Functions reference The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that. Author: Josh Rosen <joshrosen@databricks.com> Closes #9269 from JoshRosen/SPARK-11299. (cherry picked from commit b67dc6a4342577e73b0600b51052c286c4569960) Signed-off-by: Reynold Xin <rxin@databricks.com>	25 October 2015, 09:31:54 UTC
5200a6e	Jacek Laskowski	25 October 2015, 00:33:22 UTC	Fix typos Two typos squashed. BTW Let me know how to proceed with other typos if I ran across any. I don't feel well to leave them aside as much as sending pull requests with such tiny changes. Guide me. Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #9250 from jaceklaskowski/typos-hunting. (cherry picked from commit 146da0d8100490a6e49a6c076ec253cdaf9f8905) Signed-off-by: Sean Owen <sowen@cloudera.com>	25 October 2015, 00:33:37 UTC
1cd2d9c	Jeffrey Naisbitt	24 October 2015, 17:21:36 UTC	[SPARK-11264] bin/spark-class can't find assembly jars with certain GREP_OPTIONS set Temporarily remove GREP_OPTIONS if set in bin/spark-class. Some GREP_OPTIONS will modify the output of the grep commands that are looking for the assembly jars. For example, if the -n option is specified, the grep output will look like: 5:spark-assembly-1.5.1-hadoop2.4.0.jar This will not match the regular expressions, and so the jar files will not be found. We could improve the regular expression to handle this case and trim off extra characters, but it is difficult to know which options may or may not be set. Unsetting GREP_OPTIONS within the script handles all the cases and gives the desired output. Author: Jeffrey Naisbitt <jnaisbitt@familysearch.org> Closes #9231 from naisbitt/unset-GREP_OPTIONS. (cherry picked from commit 28132ceb10d0c127495ce8cb36135e1cb54164d7) Signed-off-by: Sean Owen <sowen@cloudera.com>	24 October 2015, 17:21:47 UTC
56f0bb6	felixcheung	24 October 2015, 04:42:00 UTC	[SPARK-11294][SPARKR] Improve R doc for read.df, write.df, saveAsTable Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix formatting and text truncation for write.df, saveAsTable. Several text issues: ![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png) - text collapsed into a single paragraph - text truncated at 2 places, eg. "overwrite: Existing data is expected to be overwritten by the contents of error:" shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #9261 from felixcheung/rdocreadwritedf. (cherry picked from commit 5e458125018029cef5cde3390f4a55dd4e164fde) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	24 October 2015, 04:42:14 UTC
9695f45	Sun Rui	24 October 2015, 04:38:04 UTC	[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript. Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes. The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395). BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host. For your information, PYSPARK has two environment variables serving simliar purpose: PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`). PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script. Author: Sun Rui <rui.sun@intel.com> Closes #9179 from sun-rui/SPARK-10971. (cherry picked from commit 2462dbcce89d657bca17ae311c99c2a4bee4a5fa) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	24 October 2015, 04:39:00 UTC
03d3ad4	Rohan Bhanderi	23 October 2015, 08:10:46 UTC	Fix typo "Received" to "Receiver" in streaming-kafka-integration.md Removed typo on line 8 in markdown : "Received" -> "Receiver" Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu> Closes #9242 from RohanBhanderi/patch-1. (cherry picked from commit 16dc9f344c08deee104090106cb0a537a90e33fc) Signed-off-by: Reynold Xin <rxin@databricks.com>	23 October 2015, 08:10:56 UTC
be3e343	Patrick Wendell	22 October 2015, 23:02:11 UTC	Preparing development version 1.5.3-SNAPSHOT	22 October 2015, 23:02:11 UTC
ad6ade1	Patrick Wendell	22 October 2015, 23:02:05 UTC	Preparing Spark release v1.5.2-rc1	22 October 2015, 23:02:05 UTC
a76cf51	Andrew Or	22 October 2015, 22:58:08 UTC	[SPARK-11251] Fix page size calculation in local mode ``` // My machine only has 8 cores $ bin/spark-shell --master local[32] scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b") scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count() Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351) ``` Author: Andrew Or <andrew@databricks.com> Closes #9209 from andrewor14/fix-local-page-size. (cherry picked from commit 34e71c6d89c1f2b6236dbf0d75cd12da08003c84) Signed-off-by: Reynold Xin <rxin@databricks.com>	22 October 2015, 22:58:17 UTC
e405c2a	Marcelo Vanzin	07 October 2015, 18:38:07 UTC	[SPARK-10812] [YARN] Fix shutdown of token renewer. A recent change to fix the referenced bug caused this exception in the `SparkContext.stop()` path: org.apache.spark.SparkException: YarnSparkHadoopUtil is not available in non-YARN mode! at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:167) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:182) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:440) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1579) at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1730) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185) at org.apache.spark.SparkContext.stop(SparkContext.scala:1729) Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8996 from vanzin/SPARK-10812. (cherry picked from commit 4b74755122d51edb1257d4f3785fb24508681068)	22 October 2015, 20:14:26 UTC
c49e0c3	Holden Karau	28 September 2015, 13:33:45 UTC	[SPARK-10812] [YARN] Spark hadoop util support switching to yarn While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight. ``` [info] SampleMiniClusterTest: [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest * ABORTED * [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163) [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257) [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) [info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186) [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103) ``` Author: Holden Karau <holden@pigscanfly.ca> Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn. (cherry picked from commit d8d50ed388d2e695b69d2b93a620045ef2f0bc18)	22 October 2015, 20:14:21 UTC
f9ad0e5	Forest Fang	22 October 2015, 16:34:07 UTC	[SPARK-11244][SPARKR] sparkR.stop() should remove SQLContext SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains: ```r sc <- sparkR.init("local") sqlContext <- sparkRSQL.init(sc) sparkR.stop() sc <- sparkR.init("local") sqlContext <- sparkRSQL.init(sc) sqlContext ``` producing ```r Error in callJMethod(x, "getClass") : Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed. ``` I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead. p.s. I tried lint-r but ended up a lots of errors on existing code. Author: Forest Fang <forest.fang@outlook.com> Closes #9205 from saurfang/sparkR.stop. (cherry picked from commit 94e2064fa1b04c05c805d9175c7c78bf583db5c6) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	22 October 2015, 16:34:18 UTC
fe5c3ae	Shagun Sodhani	21 October 2015, 21:18:06 UTC	[SPARK-11233][SQL] register cosh in function registry Author: Shagun Sodhani <sshagunsodhani@gmail.com> Closes #9199 from shagunsodhani/proposed-fix-#11233. (cherry picked from commit 19ad18638e27cc7b403ea98c4f9f40a940932e30) Signed-off-by: Reynold Xin <rxin@databricks.com>	21 October 2015, 21:18:13 UTC
59747d0	Dilip Biswal	21 October 2015, 18:10:32 UTC	[SPARK-10534] [SQL] ORDER BY clause allows only columns that are present in the select projection list Find out the missing attributes by recursively looking at the sort order expression and rest of the code takes care of projecting them out. Added description from cloud-fan I wanna explain a bit more about this bug. When we resolve sort ordering, we will use a special method, which only resolves UnresolvedAttributes and UnresolvedExtractValue. However, for something like Floor('a), even the 'a is resolved, the floor expression may still being unresolved as data type mismatch(for example, 'a is string type and Floor need double type), thus can't pass this filter, and we can't push down this missing attribute 'a Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9123 from dilipbiswal/SPARK-10534. (cherry picked from commit 49ea0e9d7ce805d312d94a5b2936eec2053bc052) Signed-off-by: Yin Huai <yhuai@databricks.com>	21 October 2015, 18:10:44 UTC
0887e5e	Cheng Lian	21 October 2015, 01:02:20 UTC	[SPARK-11153][SQL] Disables Parquet filter push-down for string and binary columns Due to PARQUET-251, `BINARY` columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely: - `StringType` - `BinaryType` - `DecimalType` (But Spark SQL doesn't support pushing down filters involving `DecimalType` columns for now.) To avoid wrong query results, we should disable filter push-down for columns of `StringType` and `BinaryType` until we upgrade to parquet-mr 1.8. Author: Cheng Lian <lian@databricks.com> Closes #9152 from liancheng/spark-11153.workaround-parquet-251.	21 October 2015, 01:02:20 UTC
a3ab671	Yin Huai	20 October 2015, 01:01:21 UTC	[HOT-FIX] Remove unnessary changed introduced by the backport of SPARK-10577 (https://github.com/apache/spark/commit/30eea40fff97391b8ee3201dd7c6ea7440521386).	20 October 2015, 01:01:21 UTC
2195fec	Liang-Chi Hsieh	19 October 2015, 23:16:31 UTC	[SPARK-11051][CORE] Do not allow local checkpointing after the RDD is materialized and checkpointed JIRA: https://issues.apache.org/jira/browse/SPARK-11051 When a `RDD` is materialized and checkpointed, its partitions and dependencies are cleared. If we allow local checkpointing on it and assign `LocalRDDCheckpointData` to its `checkpointData`. Next time when the RDD is materialized again, the error will be thrown. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9072 from viirya/no-localcheckpoint-after-checkpoint. (cherry picked from commit a1413b3662250dd5e980e8b1f7c3dc4585ab4766) Signed-off-by: Andrew Or <andrew@databricks.com>	19 October 2015, 23:16:39 UTC
5186ec8	zsxwing	19 October 2015, 22:35:14 UTC	[SPARK-11063] [STREAMING] Change preferredLocations of Receiver's RDD to hosts rather than hostports The format of RDD's preferredLocations must be hostname but the format of Streaming Receiver's scheduling executors is hostport. So it doesn't work. This PR converts `schedulerExecutors` to `hosts` before creating Receiver's RDD. Author: zsxwing <zsxwing@gmail.com> Closes #9075 from zsxwing/SPARK-11063. (cherry picked from commit 67582132bffbaaeaadc5cf8218f6239d03c39da0) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	19 October 2015, 22:35:46 UTC
6480740	zsxwing	19 October 2015, 07:06:51 UTC	[SPARK-11126][SQL] Fix the potential flaky test The unit test added in #9132 is flaky. This is a follow up PR to add `listenerBus.waitUntilEmpty` to fix it. Author: zsxwing <zsxwing@gmail.com> Closes #9163 from zsxwing/SPARK-11126-follow-up. (cherry picked from commit beb8bc1ea588b7f9ab7effff707c0f784421364d) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	19 October 2015, 07:08:02 UTC
803339c	zsxwing	18 October 2015, 20:51:45 UTC	[SPARK-11126][SQL] Fix a memory leak in SQLListener._stageIdToStageMetrics SQLListener adds all stage infos to `_stageIdToStageMetrics`, but only removes stage infos belonging to SQL executions. This PR fixed it by ignoring stages that don't belong to SQL executions. Reported by Terry Hoo in https://www.mail-archive.com/userspark.apache.org/msg38810.html Author: zsxwing <zsxwing@gmail.com> Closes #9132 from zsxwing/SPARK-11126. (cherry picked from commit 94c8fef296e5cdac9a93ed34acc079e51839caa7) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	18 October 2015, 20:52:07 UTC
fa7c3ea	zsxwing	16 October 2015, 20:56:51 UTC	[SPARK-11104] [STREAMING] Fix a deadlock in StreamingContex.stop The following deadlock may happen if shutdownHook and StreamingContext.stop are running at the same time. ``` Java stack information for the threads listed above: =================================================== "Thread-2": at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699) - waiting to lock <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext) at org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729) at org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236) - locked <0x00000005405b6a00> (a org.apache.spark.util.SparkShutdownHookManager) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) "main": at org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248) - waiting to lock <0x00000005405b6a00> (a org.apache.spark.util.SparkShutdownHookManager) at org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712) - locked <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684) - locked <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext) at org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108) at org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` This PR just moved `ShutdownHookManager.removeShutdownHook` out of `synchronized` to avoid deadlock. Author: zsxwing <zsxwing@gmail.com> Closes #9116 from zsxwing/stop-deadlock. (cherry picked from commit e1eef248f13f6c334fe4eea8a29a1de5470a2e62) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	16 October 2015, 20:57:25 UTC
9814de0	Pravin Gadakh	16 October 2015, 20:38:50 UTC	[SPARK-10581] [DOCS] Groups are not resolved in scaladoc in sql classes Groups are not resolved properly in scaladoc in following classes: sql/core/src/main/scala/org/apache/spark/sql/Column.scala sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala sql/core/src/main/scala/org/apache/spark/sql/functions.scala Author: Pravin Gadakh <pravingadakh177@gmail.com> Closes #9148 from pravingadakh/master. (cherry picked from commit 3d683a139b333456a6bd8801ac5f113d1ac3fd18) Signed-off-by: Reynold Xin <rxin@databricks.com>	16 October 2015, 20:39:49 UTC
9903e05	Jakob Odersky	16 October 2015, 13:26:34 UTC	[SPARK-11094] Strip extra strings from Java version in test runner Removes any extra strings from the Java version, fixing subsequent integer parsing. This is required since some OpenJDK versions (specifically in Debian testing), append an extra "-internal" string to the version field. Author: Jakob Odersky <jodersky@gmail.com> Closes #9111 from jodersky/fixtestrunner. (cherry picked from commit 08698ee1d6f29b2c999416f18a074d5193cdacd5) Signed-off-by: Sean Owen <sowen@cloudera.com>	16 October 2015, 13:26:46 UTC
7aaf485	Josh Rosen	16 October 2015, 00:36:55 UTC	[SPARK-11135] [SQL] Exchange incorrectly skips sorts when existing ordering is non-empty subset of required ordering In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange. However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however. This bug was introduced in https://github.com/apache/spark/pull/7458, so it affects 1.5.0+. This patch fixes the bug and significantly improves the unit test coverage of Exchange's sort-planning logic. Author: Josh Rosen <joshrosen@databricks.com> Closes #9140 from JoshRosen/SPARK-11135. (cherry picked from commit eb0b4d6e2ddfb765f082d0d88472626336ad2609) Signed-off-by: Michael Armbrust <michael@databricks.com>	16 October 2015, 00:37:07 UTC
13920d5	KaiXinXiaoLei	15 October 2015, 21:48:01 UTC	[SPARK-10515] When killing executor, the pending replacement executors should not be lost If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them. For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1. see: https://github.com/apache/spark/pull/8668 Author: KaiXinXiaoLei <huleilei1@huawei.com> Author: huleilei <huleilei1@huawei.com> Closes #8945 from KaiXinXiaoLei/pendingexecutor.	15 October 2015, 21:48:46 UTC
166fdf4	Nick Pritchard	15 October 2015, 19:45:37 UTC	[SPARK-11039][Documentation][Web UI] Document additional ui configurations Add documentation for configuration: - spark.sql.ui.retainedExecutions - spark.streaming.ui.retainedBatches Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #9052 from pnpritchard/SPARK-11039. (cherry picked from commit b591de7c07ba8e71092f71e34001520bec995a8a) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	15 October 2015, 19:45:49 UTC
025e16b	Carson Wang	15 October 2015, 17:36:54 UTC	[SPARK-11047] Internal accumulators miss the internal flag when replaying events in the history server Internal accumulators don't write the internal flag to event log. So on the history server Web UI, all accumulators are not internal. This causes incorrect peak execution memory and unwanted accumulator table displayed on the stage page. To fix it, I add the "internal" property of AccumulableInfo when writing the event log. Author: Carson Wang <carson.wang@intel.com> Closes #9061 from carsonwang/accumulableBug. (cherry picked from commit d45a0d3ca23df86cf0a95508ccc3b4b98f1b611c) Signed-off-by: Reynold Xin <rxin@databricks.com>	15 October 2015, 17:37:04 UTC
63ca9f9	shellberg	15 October 2015, 17:07:10 UTC	[SPARK-11066] Update DAGScheduler's "misbehaved ResultHandler" Restrict tasks (of job) to only 1 to ensure that the causing Exception asserted for job failure is the deliberately thrown DAGSchedulerSuiteDummyException intended, not an UnsupportedOperationException from any second/subsequent tasks that can propagate from a race condition during code execution. Author: shellberg <sah@zepler.org> Closes #9076 from shellberg/shellberg-DAGSchedulerSuite-misbehavedResultHandlerTest-patch-1. (cherry picked from commit 523adc24a683930304f408d477607edfe9de7b76) Signed-off-by: Sean Owen <sowen@cloudera.com>	15 October 2015, 17:07:21 UTC
c27e190	Huaxin Gao	14 October 2015, 19:31:29 UTC	[SPARK-8386] [SQL] add write.mode for insertIntoJDBC when the parm overwrite is false the fix is for jira https://issues.apache.org/jira/browse/SPARK-8386 Author: Huaxin Gao <huaxing@us.ibm.com> Closes #9042 from huaxingao/spark8386. (cherry picked from commit 7e1308d37f6ca35f063e67e4b87a77e932ad89a5) Signed-off-by: Reynold Xin <rxin@databricks.com>	14 October 2015, 19:31:39 UTC
30eea40	Jian Feng	22 September 2015, 06:36:41 UTC	[SPARK-10577] [PYSPARK] DataFrame hint for broadcast join https://issues.apache.org/jira/browse/SPARK-10577 Author: Jian Feng <jzhang.chs@gmail.com> Closes #8801 from Jianfeng-chs/master. (cherry picked from commit 0180b849dbaf191826231eda7dfaaf146a19602b) Signed-off-by: Reynold Xin <rxin@databricks.com> Conflicts: python/pyspark/sql/tests.py	14 October 2015, 19:23:22 UTC
f366249	Cheng Lian	27 September 2015, 02:08:55 UTC	[SPARK-10845] [SQL] Makes spark.sql.hive.version a SQLConfEntry When refactoring SQL options from plain strings to the strongly typed `SQLConfEntry`, `spark.sql.hive.version` wasn't migrated, and doesn't show up in the result of `SET -v`, as `SET -v` only shows public `SQLConfEntry` instances. This affects compatibility with Simba ODBC driver. This PR migrates this SQL option as a `SQLConfEntry` to fix this issue. Author: Cheng Lian <lian@databricks.com> Closes #8925 from liancheng/spark-10845/hive-version-conf. (cherry picked from commit 6f94d56a95e8c3a410a8d0c6a24ccca043227ba9) Signed-off-by: Reynold Xin <rxin@databricks.com>	14 October 2015, 19:19:42 UTC
ad5bf0e	Tom Graves	14 October 2015, 17:12:25 UTC	[SPARK-10619] Can't sort columns on Executor Page should pick into spark 1.5.2 also. https://issues.apache.org/jira/browse/SPARK-10619 looks like this was broken by commit: https://github.com/apache/spark/commit/fb1d06fc242ec00320f1a3049673fbb03c4a6eb9#diff-b8adb646ef90f616c34eb5c98d1ebd16 It looks like somethings were change to use the UIUtils.listingTable but executor page wasn't converted so when it removed sortable from the UIUtils. TABLE_CLASS_NOT_STRIPED it broke this page. Simply add the sortable tag back in and it fixes both active UI and the history server UI. Author: Tom Graves <tgraves@yahoo-inc.com> Closes #9101 from tgravescs/SPARK-10619. (cherry picked from commit 135a2ce5b0b927b512c832d61c25e7b9d57e30be) Signed-off-by: Reynold Xin <rxin@databricks.com>	14 October 2015, 17:12:43 UTC
3e9d56e	Monica Liu	14 October 2015, 05:24:52 UTC	[SPARK-10981] [SPARKR] SparkR Join improvements I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file. Pull request because I filed this JIRA bug report: https://issues.apache.org/jira/browse/SPARK-10981 Author: Monica Liu <liu.monica.f@gmail.com> Closes #9029 from mfliu/master. (cherry picked from commit 8b32885704502ab2a715cf5142d7517181074428) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	14 October 2015, 05:25:03 UTC
94e6d8f	Wenchen Fan	13 October 2015, 23:16:08 UTC	[SPARK-10389] [SQL] [1.5] support order by non-attribute grouping expression on Aggregate backport https://github.com/apache/spark/pull/8548 to 1.5 Author: Wenchen Fan <cloud0fan@163.com> Closes #9102 from cloud-fan/branch-1.5.	13 October 2015, 23:16:08 UTC
15d2736	Bryan Cutler	13 October 2015, 22:59:36 UTC	[SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not t… …rain with given regParam and StreamingLinearRegressionWithSGD intercept param is not in correct position. regParam was being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training. The param is added as a named argument to the call. For StreamingLinearRegressionWithSGC the intercept parameter was not in the correct position and was being passed in as the regularization value. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9087 from BryanCutler/StreamingSGD-convergenceTol-bug-10959-branch-1.5.	13 October 2015, 22:59:36 UTC
77eeaad	Josh Rosen	13 October 2015, 22:18:20 UTC	[SPARK-10932] [PROJECT INFRA] Port two minor changes to release-build.sh from scripts' old repo Spark's release packaging scripts used to live in a separate repository. Although these scripts are now part of the Spark repo, there are some minor patches made against the old repos that are missing in Spark's copy of the script. This PR ports those changes. /cc shivaram, who originally submitted these changes against https://github.com/rxin/spark-utils Author: Josh Rosen <joshrosen@databricks.com> Closes #8986 from JoshRosen/port-release-build-fixes-from-rxin-repo.	13 October 2015, 22:19:29 UTC
edc5095	Davies Liu	13 October 2015, 16:40:36 UTC	[SPARK-11009] [SQL] fix wrong result of Window function in cluster mode Currently, All windows function could generate wrong result in cluster sometimes. The root cause is that AttributeReference is called in executor, then id of it may not be unique than others created in driver. Here is the script that could reproduce the problem (run in local cluster): ``` from pyspark import SparkContext, HiveContext from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber sqlContext = HiveContext(SparkContext()) sqlContext.setConf("spark.sql.shuffle.partitions", "3") df = sqlContext.range(1<<20) df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B')) ws = Window.partitionBy(df2.A).orderBy(df2.B) df3 = df2.select("client", "date", rowNumber().over(ws).alias("rn")).filter("rn < 0") assert df3.count() == 0 ``` Author: Davies Liu <davies@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #9050 from davies/wrong_window. Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala	13 October 2015, 16:47:28 UTC
47bc6c0	Lianhui Wang	13 October 2015, 13:29:47 UTC	[SPARK-11026] [YARN] spark.yarn.user.classpath.first does work for 'spark-submit --jars hdfs://user/foo.jar' when spark.yarn.user.classpath.first=true and using 'spark-submit --jars hdfs://user/foo.jar', it can not put foo.jar to system classpath. so we need to put yarn's linkNames of jars to the system classpath. vanzin tgravescs Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #9045 from lianhuiwang/spark-11026. (cherry picked from commit 626aab79c9b4d4ac9d65bf5fa45b81dd9cbc609c) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>	13 October 2015, 13:31:00 UTC
2217f4f	Joseph K. Bradley	12 October 2015, 21:53:13 UTC	Revert "[SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters" This reverts commit f95129c17523ea60220a37576b8a9390943cf98e.	12 October 2015, 21:53:13 UTC
ebbff39	Kay Ousterhout	12 October 2015, 21:23:29 UTC	[SPARK-11056] Improve documentation of SBT build. This commit improves the documentation around building Spark to (1) recommend using SBT interactive mode to avoid the overhead of launching SBT and (2) refer to the wiki page that documents using SPARK_PREPEND_CLASSES to avoid creating the assembly jar for each compile. cc srowen Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9068 from kayousterhout/SPARK-11056. (cherry picked from commit 091c2c3ecd69803d78c2b15a1487046701059d38) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>	12 October 2015, 21:23:56 UTC
be68a4b	zero323	12 October 2015, 19:09:06 UTC	[SPARK-10973] [ML] [PYTHON] Fix IndexError exception on SparseVector when asked for index after the last non-zero entry See https://github.com/apache/spark/pull/9009 for details. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9064 from zero323/SPARK-10973_1.5.	12 October 2015, 19:09:06 UTC
cd55fbc	Marcelo Vanzin	12 October 2015, 17:21:57 UTC	[SPARK-11023] [YARN] Avoid creating URIs from local paths directly. The issue is that local paths on Windows, when provided with drive letters or backslashes, are not valid URIs. Instead of trying to figure out whether paths are URIs or not, use Utils.resolveURI() which does that for us. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9049 from vanzin/SPARK-11023 and squashes the following commits: 77021f2 [Marcelo Vanzin] [SPARK-11023] [yarn] Avoid creating URIs from local paths directly. (cherry picked from commit 149472a01d12828c64b0a852982d48c123984182)	12 October 2015, 17:42:06 UTC
6dc23e6	Liang-Chi Hsieh	12 October 2015, 16:16:14 UTC	[SPARK-10960] [SQL] SQL with windowing function should be able to refer column in inner select JIRA: https://issues.apache.org/jira/browse/SPARK-10960 When accessing a column in inner select from a select with window function, `AnalysisException` will be thrown. For example, an query like this: select area, rank() over (partition by area order by tmp.month) + tmp.tmp1 as c1 from (select month, area, product, 1 as tmp1 from windowData) tmp Currently, the rule `ExtractWindowExpressions` in `Analyzer` only extracts regular expressions from `WindowFunction`, `WindowSpecDefinition` and `AggregateExpression`. We need to also extract other attributes as the one in `Alias` as shown in the above query. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9011 from viirya/fix-window-inner-column. (cherry picked from commit fcb37a04177edc2376e39dd0b910f0268f7c72ec) Signed-off-by: Yin Huai <yhuai@databricks.com>	12 October 2015, 16:16:32 UTC
156ac27	Tom Graves	09 October 2015, 21:06:25 UTC	[SPARK-10858] YARN: archives/jar/files rename with # doesn't work unl https://issues.apache.org/jira/browse/SPARK-10858 The issue here is that in resolveURI we default to calling new File(path).getAbsoluteFile().toURI(). But if the path passed in already has a # in it then File(path) will think that is supposed to be part of the actual file path and not a fragment so it changes # to %23. Then when we try to parse that later in Client as a URI it doesn't recognize there is a fragment. so to fix we just check if there is a fragment, still create the File like we did before and then add the fragment back on. Author: Tom Graves <tgraves@yahoo-inc.com> Closes #9035 from tgravescs/SPARK-10858. (cherry picked from commit 63c340a710b24869410d56602b712fbfe443e6f0)	09 October 2015, 21:08:21 UTC
f95129c	Bryan Cutler	09 October 2015, 05:21:07 UTC	[SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training. Same with StreamingLinearRegressionWithSGD. I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959. (cherry picked from commit 5410747a84e9be1cea44159dfc2216d5e0728ab4) Signed-off-by: Xiangrui Meng <meng@databricks.com>	09 October 2015, 05:23:16 UTC
3df7500	Hari Shreedharan	09 October 2015, 01:53:38 UTC	[SPARK-10955] [STREAMING] Add a warning if dynamic allocation for Streaming applications Dynamic allocation can be painful for streaming apps and can lose data. Log a warning for streaming applications if dynamic allocation is enabled. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #8998 from harishreedharan/ss-log-error and squashes the following commits: 462b264 [Hari Shreedharan] Improve log message. 2733d94 [Hari Shreedharan] Minor change to warning message. eaa48cc [Hari Shreedharan] Log a warning instead of failing the application if dynamic allocation is enabled. 725f090 [Hari Shreedharan] Add config parameter to allow dynamic allocation if the user explicitly sets it. b3f9a95 [Hari Shreedharan] Disable dynamic allocation and kill app if it is enabled. a4a5212 [Hari Shreedharan] [streaming] SPARK-10955. Disable dynamic allocation for Streaming applications. (cherry picked from commit 09841290055770a619a2e72fbaef1a5e694916ae) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	09 October 2015, 01:53:52 UTC
ba601b1	Reynold Xin	09 October 2015, 00:25:14 UTC	[SPARK-10914] UnsafeRow serialization breaks when two machines have different Oops size. UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM). To reproduce, launch Spark using MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" And then run the following scala> sql("select 1 xx").collect() Author: Reynold Xin <rxin@databricks.com> Closes #9030 from rxin/SPARK-10914. (cherry picked from commit 84ea287178247c163226e835490c9c70b17d8d3b) Signed-off-by: Reynold Xin <rxin@databricks.com>	09 October 2015, 00:25:23 UTC
57978ae	Davies Liu	07 October 2015, 22:51:09 UTC	[SPARK-10980] [SQL] fix bug in create Decimal The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0. This bug exists since the beginning. Author: Davies Liu <davies@databricks.com> Closes #9014 from davies/fix_decimal. (cherry picked from commit 37526aca2430e36a931fbe6e01a152e701a1b94e) Signed-off-by: Davies Liu <davies.liu@gmail.com>	07 October 2015, 22:51:22 UTC
b6a0933	Kevin Cox	07 October 2015, 16:53:17 UTC	[SPARK-10952] Only add hive to classpath if HIVE_HOME is set. Currently if it isn't set it scans `/lib/*` and adds every dir to the classpath which makes the env too large and every command called afterwords fails. Author: Kevin Cox <kevincox@kevincox.ca> Closes #8994 from kevincox/kevincox-only-add-hive-to-classpath-if-var-is-set.	07 October 2015, 16:55:11 UTC
84f510c	zsxwing	06 October 2015, 23:51:03 UTC	[SPARK-10885] [STREAMING] Display the failed output op in Streaming UI This PR implements the following features for both `master` and `branch-1.5`. 1. Display the failed output op count in the batch list 2. Display the failure reason of output op in the batch detail page Screenshots: <img width="1356" alt="1" src="https://cloud.githubusercontent.com/assets/1000778/10198387/5b2b97ec-67ce-11e5-81c2-f818b9d2f3ad.png"> <img width="1356" alt="2" src="https://cloud.githubusercontent.com/assets/1000778/10198388/5b76ac14-67ce-11e5-8c8b-de2683c5b485.png"> There are still two remaining problems in the UI. 1. If an output operation doesn't run any spark job, we cannot get the its duration since now it's the sum of all jobs' durations. 2. If an output operation doesn't run any spark job, we cannot get the description since it's the latest job's call site. We need to add new `StreamingListenerEvent` about output operations to fix them. So I'd like to fix them only for `master` in another PR. Author: zsxwing <zsxwing@gmail.com> Closes #8950 from zsxwing/batch-failure. (cherry picked from commit ffe6831e49e28eb855f857fdfa5dd99341e80c9d) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	06 October 2015, 23:51:31 UTC
6847be6	Thomas Graves	06 October 2015, 17:18:50 UTC	[SPARK-10901] [YARN] spark.yarn.user.classpath.first doesn't work This should go into 1.5.2 also. The issue is we were no longer adding the __app__.jar to the system classpath. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes #8959 from tgravescs/SPARK-10901. (cherry picked from commit e9783601599758df87418bf61a7b4636f06714fa) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	06 October 2015, 17:19:05 UTC
c8392cd	Wenchen Fan	06 October 2015, 00:30:54 UTC	[SPARK-10934] [SQL] handle hashCode of unsafe array correctly `Murmur3_x86_32.hashUnsafeWords` only accepts word-aligned bytes, but unsafe array is not. Author: Wenchen Fan <cloud0fan@163.com> Closes #8987 from cloud-fan/hash.	06 October 2015, 00:33:34 UTC
d323e5e	Avrohom Katz	04 October 2015, 08:36:07 UTC	[SPARK-10889] [STREAMING] Bump KCL to add MillisBehindLatest metric I don't believe the API changed at all. Author: Avrohom Katz <iambpentameter@gmail.com> Closes #8957 from akatz/kcl-upgrade. (cherry picked from commit 883bd8fccf83aae7a2a847c9a6ca129fac86e6a3) Signed-off-by: Sean Owen <sowen@cloudera.com>	04 October 2015, 08:36:18 UTC
8836ac3	felixcheung	04 October 2015, 05:42:36 UTC	[SPARK-10904] [SPARKR] Fix to support `select(df, c("col1", "col2"))` The fix is to coerce `c("a", "b")` into a list such that it could be serialized to call JVM with. Author: felixcheung <felixcheung_m@hotmail.com> Closes #8961 from felixcheung/rselect. (cherry picked from commit 721e8b5f35b230ff426c1757a9bdc1399fb19afa) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	04 October 2015, 05:48:43 UTC
cbc6aec	zsxwing	01 October 2015, 14:09:31 UTC	[SPARK-10058] [CORE] [TESTS] Fix the flaky tests in HeartbeatReceiverSuite Fixed the test failure here: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/ This failure is because `HeartbeatReceiverSuite. heartbeatReceiver` may receive `SparkListenerExecutorAdded("driver")` sent from [LocalBackend](https://github.com/apache/spark/blob/8fb3a65cbb714120d612e58ef9d12b0521a83260/core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala#L121). There are other race conditions in `HeartbeatReceiverSuite` because `HeartbeatReceiver.onExecutorAdded` and `HeartbeatReceiver.onExecutorRemoved` are asynchronous. This PR also fixed them. Author: zsxwing <zsxwing@gmail.com> Closes #8946 from zsxwing/SPARK-10058. (cherry picked from commit 9b3e7768a27d51ddd4711c4a68a428a6875bd6d7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	01 October 2015, 14:10:12 UTC
3b23873	Ryan Williams	29 September 2015, 20:19:46 UTC	[SPARK-10871] include number of executor failures in error msg Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #8939 from ryan-williams/errmsg. (cherry picked from commit b7ad54ec793af1c84973b402f5cceb88307f7996) Signed-off-by: Andrew Or <andrew@databricks.com>	29 September 2015, 20:19:52 UTC
d544932	zsxwing	29 September 2015, 18:53:28 UTC	[SPARK-10825] [CORE] [TESTS] Fix race conditions in StandaloneDynamicAllocationSuite Fix the following issues in StandaloneDynamicAllocationSuite: 1. It should not assume master and workers start in order 2. It should not assume master and workers get ready at once 3. It should not assume the application is already registered with master after creating SparkContext 4. It should not access Master.app and idToApp which are not thread safe The changes includes: * Use `eventually` to wait until master and workers are ready to fix 1 and 2 * Use `eventually` to wait until the application is registered with master to fix 3 * Use `askWithRetry[MasterStateResponse](RequestMasterState)` to get the application info to fix 4 Author: zsxwing <zsxwing@gmail.com> Closes #8914 from zsxwing/fix-StandaloneDynamicAllocationSuite. (cherry picked from commit dba95ea03216e6b8e623db4a36e1018c6ed95538) Signed-off-by: Andrew Or <andrew@databricks.com>	29 September 2015, 18:53:38 UTC
9b3014b	Sean Owen	29 September 2015, 02:56:43 UTC	[SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree. The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that. Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way. The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing. Author: Sean Owen <sowen@cloudera.com> Closes #8919 from srowen/SPARK-10833. (cherry picked from commit bf4199e261c3c8dd2970e2a154c97b46fb339f02) Signed-off-by: Sean Owen <sowen@cloudera.com>	29 September 2015, 02:56:59 UTC
a367840	Davies Liu	28 September 2015, 21:40:40 UTC	[SPARK-10859] [SQL] fix stats of StringType in columnar cache The UTF8String may come from UnsafeRow, then underline buffer of it is not copied, so we should clone it in order to hold it in Stats. cc yhuai Author: Davies Liu <davies@databricks.com> Closes #8929 from davies/pushdown_string. (cherry picked from commit ea02e5513a8f9853094d5612c962fc8c1a340f50) Signed-off-by: Yin Huai <yhuai@databricks.com>	28 September 2015, 21:40:52 UTC
de25931	jerryshao	28 September 2015, 13:38:54 UTC	[SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790). Also consolidate and simplify some similar code snippets to keep the consistent semantics. Author: jerryshao <sshao@hortonworks.com> Closes #8910 from jerryshao/SPARK-10790. (cherry picked from commit 353c30bd7dfbd3b76fc8bc9a6dfab9321439a34b) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	28 September 2015, 13:39:13 UTC
e0c3212	Wenchen Fan	27 September 2015, 16:08:38 UTC	[SPARK-10741] [SQL] Hive Query Having/OrderBy against Parquet table is not working https://issues.apache.org/jira/browse/SPARK-10741 I choose the second approach: do not change output exprIds when convert MetastoreRelation to LogicalRelation Author: Wenchen Fan <cloud0fan@163.com> Closes #8889 from cloud-fan/hot-bug. (cherry picked from commit 418e5e4cbdaab87addb91ac0bb2245ff0213ac81) Signed-off-by: Yin Huai <yhuai@databricks.com>	27 September 2015, 16:08:51 UTC
3fb011a	Patrick Wendell	24 September 2015, 05:49:40 UTC	Preparing development version 1.5.2-SNAPSHOT	24 September 2015, 05:49:40 UTC
4df9793	Patrick Wendell	24 September 2015, 05:49:35 UTC	Preparing Spark release v1.5.1-rc1	24 September 2015, 05:49:35 UTC
179f36e	Patrick Wendell	24 September 2015, 04:32:16 UTC	Preparing development version 1.5.2-SNAPSHOT	24 September 2015, 04:32:16 UTC
4f894dd	Patrick Wendell	24 September 2015, 04:32:10 UTC	Preparing Spark release v1.5.1-rc1	24 September 2015, 04:32:10 UTC
cdc4ac0	Patrick Wendell	24 September 2015, 02:55:27 UTC	Preparing development version 1.5.2-SNAPSHOT	24 September 2015, 02:55:27 UTC
20db818	Patrick Wendell	24 September 2015, 02:55:19 UTC	Preparing Spark release v1.5.1-rc1	24 September 2015, 02:55:19 UTC
c8a3d66	Reynold Xin	24 September 2015, 02:53:56 UTC	Update release notes.	24 September 2015, 02:53:56 UTC
4c48593	zsxwing	24 September 2015, 02:52:02 UTC	[SPARK-10692] [STREAMING] Expose failureReasons in BatchInfo for streaming UI to clear failed batches Slightly modified version of #8818, all credit goes to zsxwing Author: zsxwing <zsxwing@gmail.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8892 from tdas/SPARK-10692. (cherry picked from commit 758c9d25e92417f8c06328c3af7ea2ef0212c79f) Signed-off-by: Reynold Xin <rxin@databricks.com>	24 September 2015, 02:52:10 UTC
1000b5d	Reynold Xin	24 September 2015, 02:46:13 UTC	Update branch-1.5 for 1.5.1 release. Author: Reynold Xin <rxin@databricks.com> Closes #8890 from rxin/release-1.5.1.	24 September 2015, 02:46:13 UTC
1f47e68	Andrew Or	24 September 2015, 02:34:31 UTC	[SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array (round 2) This patch reverts most of the changes in a previous fix #8827. The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now *we will simply not track the pointer array*. (Note that elsewhere we already don't track the pointer array, e.g. [here](https://github.com/apache/spark/blob/a18208047f06a4244703c17023bb20cbe1f59d73/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java#L88)) Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff. Author: Andrew Or <andrew@databricks.com> Closes #8888 from andrewor14/dont-track-pointer-array. (cherry picked from commit 83f6f54d12a418f5158ee7ee985b54eef8cc1cf0) Signed-off-by: Andrew Or <andrew@databricks.com>	24 September 2015, 02:34:47 UTC
7564c24	Reynold Xin	23 September 2015, 23:43:21 UTC	[SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame. Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take). This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion. Author: Reynold Xin <rxin@databricks.com> Closes #8876 from rxin/SPARK-10731. (cherry picked from commit 9952217749118ae78fe794ca11e1c4a87a4ae8ba) Signed-off-by: Reynold Xin <rxin@databricks.com>	23 September 2015, 23:43:34 UTC
64cc62c	Josh Rosen	23 September 2015, 18:31:01 UTC	[SPARK-10403] Allow UnsafeRowSerializer to work with tungsten-sort ShuffleManager This patch attempts to fix an issue where Spark SQL's UnsafeRowSerializer was incompatible with the `tungsten-sort` ShuffleManager. Author: Josh Rosen <joshrosen@databricks.com> Closes #8873 from JoshRosen/SPARK-10403. (cherry picked from commit a18208047f06a4244703c17023bb20cbe1f59d73) Signed-off-by: Michael Armbrust <michael@databricks.com>	23 September 2015, 18:31:14 UTC
6c6cadb	Marcelo Vanzin	10 August 2015, 17:10:40 UTC	[SPARK-9710] [TEST] Fix RPackageUtilsSuite when R is not available. RUtils.isRInstalled throws an exception if R is not installed, instead of returning false. Fix that. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8008 from vanzin/SPARK-9710 and squashes the following commits: df72d8c [Marcelo Vanzin] [SPARK-9710] [test] Fix RPackageUtilsSuite when R is not available. (cherry picked from commit 0f3366a4c740147a7a7519922642912e2dd238f8) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	23 September 2015, 14:38:31 UTC
4174b94	zsxwing	23 September 2015, 08:29:30 UTC	[SPARK-10769] [STREAMING] [TESTS] Fix o.a.s.streaming.CheckpointSuite.maintains rate controller Fixed the following failure in https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/ ``` sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 660 times over 10.000044392000001 seconds. Last failure message: 9223372036854775807 did not equal 200. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413) at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396) at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ``` In this test, it calls `advanceTimeWithRealDelay(ssc, 2)` to run two batch jobs. However, one race condition is these two jobs can finish before the receiver is registered. Then `UpdateRateLimit` won't be sent to the receiver and `getDefaultBlockGeneratorRateLimit` cannot be updated. Here are the logs related to this issue: ``` 15/09/22 19:28:26.154 pool-1-thread-1-ScalaTest-running-CheckpointSuite INFO CheckpointSuite: Manual clock before advancing = 2500 15/09/22 19:28:26.869 JobScheduler INFO JobScheduler: Finished job streaming job 3000 ms.0 from job set of time 3000 ms 15/09/22 19:28:26.869 JobScheduler INFO JobScheduler: Total delay: 1442975303.869 s for time 3000 ms (execution: 0.711 s) 15/09/22 19:28:26.873 JobScheduler INFO JobScheduler: Finished job streaming job 3500 ms.0 from job set of time 3500 ms 15/09/22 19:28:26.873 JobScheduler INFO JobScheduler: Total delay: 1442975303.373 s for time 3500 ms (execution: 0.004 s) 15/09/22 19:28:26.879 sparkDriver-akka.actor.default-dispatcher-3 INFO ReceiverTracker: Registered receiver for stream 0 from localhost:57749 15/09/22 19:28:27.154 pool-1-thread-1-ScalaTest-running-CheckpointSuite INFO CheckpointSuite: Manual clock after advancing = 3500 ``` `advanceTimeWithRealDelay(ssc, 2)` triggered job 3000ms and 3500ms but the receiver was registered after job 3000ms and 3500ms finished. So we should make sure the receiver online before running `advanceTimeWithRealDelay(ssc, 2)`. Author: zsxwing <zsxwing@gmail.com> Closes #8877 from zsxwing/SPARK-10769. (cherry picked from commit 50e4634236668a0195390f0080d0ac230d428d05) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	23 September 2015, 08:30:21 UTC
6a616d0	zsxwing	23 September 2015, 08:28:02 UTC	[SPARK-10224] [STREAMING] Fix the issue that blockIntervalTimer won't call updateCurrentBuffer when stopping `blockIntervalTimer.stop(interruptTimer = false)` doesn't guarantee calling `updateCurrentBuffer`. So it's possible that `blockIntervalTimer` will exit when `updateCurrentBuffer` is not empty. Then the data in `currentBuffer` will be lost. To reproduce it, you can add `Thread.sleep(200)` in this line (https://github.com/apache/spark/blob/69c9c177160e32a2fbc9b36ecc52156077fca6fc/streaming/src/main/scala/org/apache/spark/streaming/util/RecurringTimer.scala#L100) and run `StreamingContexSuite`. I cannot write a unit test to reproduce it because I cannot find an approach to force `RecurringTimer` suspend at this line for a few milliseconds. There was a failure in Jenkins here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41455/console This PR updates RecurringTimer to make sure `stop(interruptTimer = false)` will call `callback` at least once after the `stop` method is called. Author: zsxwing <zsxwing@gmail.com> Closes #8417 from zsxwing/SPARK-10224. (cherry picked from commit 44c28abf120754c0175c65ffd3d4587a350b3798) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	23 September 2015, 08:28:16 UTC
8a23ef5	Tathagata Das	23 September 2015, 05:44:09 UTC	[SPARK-10652] [SPARK-10742] [STREAMING] Set meaningful job descriptions for all streaming jobs Here is the screenshot after adding the job descriptions to threads that run receivers and the scheduler thread running the batch jobs. ## All jobs page * Added job descriptions with links to relevant batch details page ![image](https://cloud.githubusercontent.com/assets/663212/9924165/cda4a372-5cb1-11e5-91ca-d43a32c699e9.png) ## All stages page * Added stage descriptions with links to relevant batch details page ![image](https://cloud.githubusercontent.com/assets/663212/9923814/2cce266a-5cae-11e5-8a3f-dad84d06c50e.png) ## Streaming batch details page * Added the +details link ![image](https://cloud.githubusercontent.com/assets/663212/9921977/24014a32-5c98-11e5-958e-457b6c38065b.png) Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8791 from tdas/SPARK-10652. (cherry picked from commit 5548a254755bb84edae2768b94ab1816e1b49b91) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	23 September 2015, 05:45:05 UTC
7f07cc6	Matt Hagen	23 September 2015, 04:14:25 UTC	[SPARK-10663] Removed unnecessary invocation of DataFrame.toDF method. The Scala example under the "Example: Pipeline" heading in this document initializes the "test" variable to a DataFrame. Because test is already a DF, there is not need to call test.toDF as the example does in a subsequent line: model.transform(test.toDF). So, I removed the extraneous toDF invocation. Author: Matt Hagen <anonz3000@gmail.com> Closes #8875 from hagenhaus/SPARK-10663. (cherry picked from commit 558e9c7e60a7c0d85ba26634e97562ad2163e91d) Signed-off-by: Xiangrui Meng <meng@databricks.com>	23 September 2015, 04:14:34 UTC
73d0621	Zhichao Li	23 September 2015, 02:41:57 UTC	[SPARK-10310] [SQL] Fixes script transformation field/line delimiters Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`. This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes. Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s). Author: Cheng Lian <lian@databricks.com> Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters. (cherry picked from commit 84f81e035e1dab1b42c36563041df6ba16e7b287) Signed-off-by: Yin Huai <yhuai@databricks.com>	23 September 2015, 03:09:46 UTC
26187ab	Andrew Or	22 September 2015, 23:35:43 UTC	[SPARK-10640] History server fails to parse TaskCommitDenied ... simply because the code is missing! Author: Andrew Or <andrew@databricks.com> Closes #8828 from andrewor14/task-end-reason-json. Conflicts: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala	23 September 2015, 00:35:05 UTC
118ebd4	Andrew Or	23 September 2015, 00:10:58 UTC	Revert "[SPARK-10640] History server fails to parse TaskCommitDenied" This reverts commit 5ffd0841e016301807b0a008af7c3346e9f59e7a.	23 September 2015, 00:10:58 UTC
5ffd084	Andrew Or	22 September 2015, 23:35:43 UTC	[SPARK-10640] History server fails to parse TaskCommitDenied ... simply because the code is missing! Author: Andrew Or <andrew@databricks.com> Closes #8828 from andrewor14/task-end-reason-json. Conflicts: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala	22 September 2015, 23:52:47 UTC
3339916	Reynold Xin	22 September 2015, 21:11:46 UTC	[SPARK-10714] [SPARK-8632] [SPARK-10685] [SQL] Refactor Python UDF handling This patch refactors Python UDF handling: 1. Extract the per-partition Python UDF calling logic from PythonRDD into a PythonRunner. PythonRunner itself expects iterator as input/output, and thus has no dependency on RDD. This way, we can use PythonRunner directly in a mapPartitions call, or in the future in an environment without RDDs. 2. Use PythonRunner in Spark SQL's BatchPythonEvaluation. 3. Updated BatchPythonEvaluation to only use its input once, rather than twice. This should fix Python UDF performance regression in Spark 1.5. There are a number of small cleanups I wanted to do when I looked at the code, but I kept most of those out so the diff looks small. This basically implements the approach in https://github.com/apache/spark/pull/8833, but with some code moving around so the correctness doesn't depend on the inner workings of Spark serialization and task execution. Author: Reynold Xin <rxin@databricks.com> Closes #8835 from rxin/python-iter-refactor. (cherry picked from commit a96ba40f7ee1352288ea676d8844e1c8174202eb) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	22 September 2015, 21:22:40 UTC

Newer
Older