https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
11ee9d1 [SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9803 from hhbyyh/w2vVocab. (cherry picked from commit e391abdf2cb6098a35347bd123b815ee9ac5b689) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 November 2015, 21:26:39 UTC
19835ec [MINOR] [MLLIB] [ML] [DOC] fixed typo: label for negative result should be 0.0 (original: 1.0) Small typo in the example for `LabelledPoint` in the MLLib docs. Author: Sean Paradiso <seanparadiso@gmail.com> Closes #8680 from sparadiso/docs_mllib_smalltypo. (cherry picked from commit 1dc7548c598c4eb4ecc7d5bb8962a735bbd2c0f7) Signed-off-by: Xiangrui Meng <meng@databricks.com> 10 September 2015, 05:10:33 UTC
119d6cf [SPARK-9633] [BUILD] SBT download locations outdated; need an update Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https. Follow up on https://github.com/apache/spark/pull/7792 Author: Sean Owen <sowen@cloudera.com> Closes #7956 from srowen/SPARK-9633 and squashes the following commits: caa40bd [Sean Owen] Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https. Conflicts: sbt/sbt-launch-lib.bash 10 August 2015, 20:14:46 UTC
326988e [SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc tests Several places in the PySpark SparseVector docs have one defined as: ``` SparseVector(4, [2, 4], [1.0, 2.0]) ``` The index 4 goes out of bounds (but this is not checked). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits: c806a65 [Joseph K. Bradley] fixed doc test e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests (cherry picked from commit a5d05819afcc9b19aeae4817d842205f32b34335) Signed-off-by: Xiangrui Meng <meng@databricks.com> 20 July 2015, 23:54:03 UTC
e0c2ccf [SPARK-8563] [MLLIB] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k I'm sorry that I made https://github.com/apache/spark/pull/6949 closed by mistake. I pushed codes again. And, I added a test code. > There is a bug that `U.numCols() = self.nCols` in `IndexedRowMatrix.computeSVD()` It should have been `U.numCols() = k = svd.U.numCols()` > ``` self = U * sigma * V.transpose (m x n) = (m x n) * (k x k) * (k x n) //ASIS --> (m x n) = (m x k) * (k x k) * (k x n) //TOBE ``` Author: lee19 <lee19@live.co.kr> Closes #6953 from lee19/MLlibBugfix and squashes the following commits: c1812a0 [lee19] [SPARK-8563] [MLlib] Used nRows instead of numRows() to reduce a burden. 4b9803b [lee19] [SPARK-8563] [MLlib] Fixed a build error. c2ccd89 [lee19] Added a unit test that validates matrix sizes of svd for [SPARK-8563][MLlib] 8373424 [lee19] [SPARK-8563][MLlib] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k (cherry picked from commit e72526227fdcf93b7a33375ef954746ac08753f5) Signed-off-by: Xiangrui Meng <meng@databricks.com> 30 June 2015, 21:12:12 UTC
faa35ca [SPARK-8525] [MLLIB] fix LabeledPoint parser when there is a whitespace between label and features vector fix LabeledPoint parser when there is a whitespace between label and features vector, e.g. (y, [x1, x2, x3]) Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com> Closes #6954 from fe2s/SPARK-8525 and squashes the following commits: 0755b9d [Oleksiy Dyagilev] [SPARK-8525][MLLIB] addressing comment, removing dep on commons-lang c1abc2b [Oleksiy Dyagilev] [SPARK-8525][MLLIB] fix LabeledPoint parser when there is a whitespace on specific position (cherry picked from commit a8031183aff2e23de9204ddfc7e7f5edbf052a7e) Signed-off-by: Xiangrui Meng <meng@databricks.com> 23 June 2015, 20:18:05 UTC
36eed2f [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robust The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4` It fails in my system since I have version `1.10` :P Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6579 from MechCoder/np_ver and squashes the following commits: 15430f8 [MechCoder] fix syntax error 893fb7e [MechCoder] remove equal to e35f0d4 [MechCoder] minor e89376c [MechCoder] Better checking 22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust (cherry picked from commit 452eb82dd722e5dfd00ee47bb8b6353933b0016e) Signed-off-by: Xiangrui Meng <meng@databricks.com> 03 June 2015, 06:26:20 UTC
672f322 [SPARK-7883] [DOCS] [MLLIB] Fixing broken trainImplicit Scala example in MLlib Collaborative Filtering documentation. Fixing broken trainImplicit Scala example in MLlib Collaborative Filtering documentation to match one of the possible ALS.trainImplicit function signatures. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6422 from dusenberrymw/Fix_MLlib_Collab_Filtering_trainImplicit_Example and squashes the following commits: 36492f4 [Mike Dusenberry] Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation to match one of the possible ALS.trainImplicit function signatures. (cherry picked from commit 0463428b6e8f364f0b1f39445a60cd85ae7c07bc) Signed-off-by: Xiangrui Meng <meng@databricks.com> 27 May 2015, 01:09:35 UTC
ee06e92 [SPARK-6753] Clone SparkConf in ShuffleSuite tests Prior to this change, the unit test for SPARK-3426 did not clone the original SparkConf, which meant that that test did not use the options set by suites that subclass ShuffleSuite.scala. This commit fixes that problem. JoshRosen would be great if you could take a look at this, since you wrote this test originally. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #5401 from kayousterhout/SPARK-6753 and squashes the following commits: 368c540 [Kay Ousterhout] [SPARK-6753] Clone SparkConf in ShuffleSuite tests (cherry picked from commit 9d44ddce1d1e19011026605549c37d0db6d6afa1) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 08 April 2015, 17:27:37 UTC
39761f5 [SPARK-6132][HOTFIX] ContextCleaner InterruptedException should be quiet If the cleaner is stopped, we shouldn't print a huge stack trace when the cleaner thread is interrupted because we purposefully did this. Author: Andrew Or <andrew@databricks.com> Closes #4882 from andrewor14/cleaner-interrupt and squashes the following commits: 8652120 [Andrew Or] Just a hot fix 22 March 2015, 13:08:34 UTC
e445ce6 [SPARK-6132] ContextCleaner race condition across SparkContexts The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`. JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce). Author: Andrew Or <andrew@databricks.com> Closes #4869 from andrewor14/cleaner-masquerade and squashes the following commits: 29168c0 [Andrew Or] Synchronize ContextCleaner stop 22 March 2015, 13:08:14 UTC
c583681 [EXAMPLES] fix typo. Author: Makoto Fukuhara <fukuo33@gmail.com> Closes #4724 from fukuo33/fix-typo and squashes the following commits: 8c806b9 [Makoto Fukuhara] fix typo. 04 March 2015, 23:13:45 UTC
d70754d Revert "[SPARK-5423][Core] Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp file" This reverts commit 36f3c499fd1ad53a68a084d6a16a2c68099e7049. 03 March 2015, 21:05:13 UTC
91d0eff [SPARK-6055] [PySpark] fix incorrect DataType.__eq__ (for 1.1) The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released. Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython). Author: Davies Liu <davies@databricks.com> Closes #4810 from davies/leak3 and squashes the following commits: 48d643d [Davies Liu] Update sql.py 968a28c [Davies Liu] fix __eq__ of singleton ac9db57 [Davies Liu] fix tests f748114 [Davies Liu] fix incorrect DataType.__eq__ 28 February 2015, 04:06:03 UTC
814934d fix spark-6033, clarify the spark.worker.cleanup behavior in standalone mode jira case spark-6033 https://issues.apache.org/jira/browse/SPARK-6033 In standalone deploy mode, the cleanup will only remove the stopped application's directories. The original description about the cleanup behavior is incorrect. Author: 许鹏 <peng.xu@fraudmetrix.cn> Closes #4803 from hseagle/spark-6033 and squashes the following commits: 927a6a0 [许鹏] fix the incorrect description about the spark.worker.cleanup in standalone mode 27 February 2015, 07:07:36 UTC
2785210 Add a note for context termination for History server on Yarn The history server on Yarn only shows completed jobs. This adds a note concerning the needed explicit context termination at the end of a spark job which is a best practice anyway. Related to SPARK-2972 and SPARK-3458 Author: moussa taifi <moutai10@gmail.com> Closes #4721 from moutai/add-history-server-note-for-closing-the-spark-context and squashes the following commits: 9f5b6c3 [moussa taifi] Fix upper case typo for YARN 3ad3db4 [moussa taifi] Add context termination for History server on Yarn (cherry picked from commit c871e2dae0182e914135560d14304242e1f97f7e) Signed-off-by: Andrew Or <andrew@databricks.com> 26 February 2015, 22:20:51 UTC
36f3c49 [SPARK-5423][Core] Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp file This PR adds a `finalize` method in DiskMapIterator to clean up the resources even if some exception happens during processing data. Author: zsxwing <zsxwing@gmail.com> Closes #4219 from zsxwing/SPARK-5423 and squashes the following commits: d4b2ca6 [zsxwing] Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp file (cherry picked from commit 90095bf3ce9304d09a32ceffaa99069079071b59) Signed-off-by: Ubuntu <ubuntu@ip-172-31-36-14.us-west-2.compute.internal> 19 February 2015, 18:38:27 UTC
651ceae [SQL] Fix flaky SET test Author: Michael Armbrust <michael@databricks.com> Closes #4480 from marmbrus/fixSetTests and squashes the following commits: f2e501e [Michael Armbrust] [SQL] Fix flaky SET test 09 February 2015, 22:57:55 UTC
03d4097 [SPARK-5691] Fixing wrong data structure lookup for dupe app registration In Master's registerApplication method, it checks if the application had already registered by examining the addressToWorker hash map. In reality, it should refer to the addressToApp data structure, as this is what really tracks which apps have been registered. 09 February 2015, 21:21:59 UTC
40bce63 SPARK-5425: Use synchronised methods in system properties to create SparkConf SPARK-5425: Fixed usages of system properties This patch fixes few problems caused by the fact that the Scala wrapper over system properties is not thread-safe and is basically invalid because it doesn't take into account the default values which could have been set in the properties object. The problem is fixed by modifying `Utils.getSystemProperties` method so that it uses `stringPropertyNames` method of the `Properties` class, which is thread-safe (internally it creates a defensive copy in a synchronized method) and returns keys of the properties which were set explicitly and which are defined as defaults. The other related problem, which is fixed here. was in `ResetSystemProperties` mix-in. It created a copy of the system properties in the wrong way. This patch also introduces a test case for thread-safeness of SparkConf creation. Refer to the discussion in https://github.com/apache/spark/pull/4220 for more details. Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4220 from jacek-lewandowski/SPARK-5425-1.1 and squashes the following commits: 6c48a1f [Jacek Lewandowski] SPARK-5425: Modified Utils.getSystemProperties to return a map of all system properties - explicit + defaults 74b4489 [Jacek Lewandowski] SPARK-5425: Use SerializationUtils to save properties in ResetSystemProperties trait 685780e [Jacek Lewandowski] SPARK-5425: Use synchronised methods in system properties to create SparkConf 02 February 2015, 22:06:23 UTC
e19c70f [SPARK-4989][CORE] backport for branch-1.1 catch eventlog exception for wrong eventlog conf JIRA is [SPARK-4989](https://issues.apache.org/jira/browse/SPARK-4989) Author: Zhang, Liye <liye.zhang@intel.com> Closes #3970 from liyezhang556520/apache-branch-1.1 and squashes the following commits: 93fa427 [Zhang, Liye] catch eventlog exception for wrong eventlog conf 28 January 2015, 21:09:33 UTC
3bb5d13 SPARK-4430 [STREAMING] [TEST] Apache RAT Checks fail spuriously on test files Another trivial one. The RAT failure was due to temp files from `FailureSuite` not being cleaned up. This just makes the cleanup more reliable by using the standard temp dir mechanism. Author: Sean Owen <sowen@cloudera.com> Closes #4189 from srowen/SPARK-4430 and squashes the following commits: 9ea63ff [Sean Owen] Properly acquire a temp directory to ensure it is cleaned up at shutdown, which helps avoid a RAT check failure (cherry picked from commit 0528b85cf96f9c9c074b5fbb5b9c5dd8071c0bc7) Signed-off-by: Andrew Or <andrew@databricks.com> 26 January 2015, 03:16:59 UTC
49e1f10 SPARK-4506 [DOCS] Addendum: Update more docs to reflect that standalone works in cluster mode This is a trivial addendum to SPARK-4506, which was already resolved. noted by Asim Jalis in SPARK-4506. Author: Sean Owen <sowen@cloudera.com> Closes #4160 from srowen/SPARK-4506 and squashes the following commits: 5f5f7df [Sean Owen] Update more docs to reflect that standalone works in cluster mode Conflicts: docs/submitting-applications.md 25 January 2015, 23:26:44 UTC
7805480 [SPARK-4161]Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf Author: GuoQiang Li <witgo@qq.com> Closes #3050 from witgo/SPARK-4161 and squashes the following commits: abb6fa4 [GuoQiang Li] move usejavacp opt to spark-shell 89e39e7 [GuoQiang Li] review commit c2a6f04 [GuoQiang Li] Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf 21 January 2015, 18:57:28 UTC
8b9b4b2 [SPARK-4569] Rename 'externalSorting' in Aggregator Hi all - I've renamed the unhelpfully named variable and added a comment clarifying what's actually happening. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #3666 from ilganeli/SPARK-4569B and squashes the following commits: 1810394 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator e2d2092 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator d7cefec [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator 5b3f39c [Ilya Ganelin] [SPARK-4569] Rename in Aggregator 21 January 2015, 18:50:27 UTC
ecdeca0 SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840... ... by Piotr Kolaczkowski) Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4113 from jacek-lewandowski/SPARK-4660-master and squashes the following commits: a5e84ca [Jacek Lewandowski] SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840 by Piotr Kolaczkowski) (cherry picked from commit c93a57f0d6dc32b127aa68dbe4092ab0b22a9667) Signed-off-by: Patrick Wendell <patrick@databricks.com> 20 January 2015, 20:38:20 UTC
3ee5c17 [SPARK-4504][Examples] fix run-example failure if multiple assembly jars exist Fix run-example script to fail fast with useful error message if multiple example assembly JARs are present. Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #3377 from gvramana/run-example_fails and squashes the following commits: fa7f481 [Venkata Ramana Gollamudi] Fixed review comments, avoiding ls output scanning. 6aa1ab7 [Venkata Ramana Gollamudi] Fix run-examples script error during multiple jars (cherry picked from commit 74de94ea6db96a04b278c6106264313504d7b8f3) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: bin/compute-classpath.sh bin/run-example 19 January 2015, 21:45:44 UTC
68df734 [SPARK-4920][UI]: back port the PR-3763 to branch 1.1 Author: uncleGen <hustyugm@gmail.com> Closes #3768 from uncleGen/branch-1.1-1223 and squashes the following commits: ec2365f [uncleGen] error fix 67a27e5 [uncleGen] [SPARK-4920][UI]: back port the PR-3763 to branch 1.1 18 January 2015, 01:26:13 UTC
c411182 [DOCS] Fix typo in return type of cogroup This fixes a simple typo in the cogroup docs noted in http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAMAsSdJ8_24evMAMg7fOZCQjwimisbYWa9v8BN6Rc3JCauja6wmail.gmail.com%3E I didn't bother with a JIRA Author: Sean Owen <sowen@cloudera.com> Closes #4072 from srowen/CogroupDocFix and squashes the following commits: 43c850b [Sean Owen] Fix typo in return type of cogroup (cherry picked from commit f6b852aade7668c99f37c69f606c64763cb265d2) Signed-off-by: Andrew Or <andrew@databricks.com> 16 January 2015, 17:28:59 UTC
810a224 [SPARK-3433][BUILD] Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations. Actually false positive reported was due to mima generator not picking up the new jars in presence of old jars(theoretically this should not have happened.). So as a workaround, ran them both separately and just append them together. Author: Prashant Sharma <prashant@apache.org> Author: Prashant Sharma <prashant.s@imaginea.com> Closes #2285 from ScrapCodes/mima-fix and squashes the following commits: 093c76f [Prashant Sharma] Update mima 59012a8 [Prashant Sharma] Update mima 35b6c71 [Prashant Sharma] SPARK-3433 Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations. (cherry picked from commit ecf0c02935815f0d4018c0e30ec4c784e60a5db0) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: project/MimaExcludes.scala 13 January 2015, 00:43:19 UTC
c98dc0e [SPARK-4348] [SPARK-4821] Backport PySpark random.py -> rand.py fix to branch-1.1 This backports #3216 and #3669 to `branch-1.1` in order to fix the PySpark unit tests. Author: Joseph K. Bradley <joseph@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4011 from JoshRosen/pyspark-rand-fix-1.1-backport and squashes the following commits: ace4cb6 [Joseph K. Bradley] [SPARK-4821] [mllib] [python] [docs] Fix for pyspark.mllib.rand doc 7ae5a1c [Davies Liu] [SPARK-4348] [PySpark] [MLlib] rename random.py to rand.py 13 January 2015, 00:16:36 UTC
ee33699 [SPARK-5200] Disable web UI in Hive ThriftServer tests Disables the Spark web UI in HiveThriftServer2Suite in order to prevent Jenkins test failures due to port contention. Author: Josh Rosen <joshrosen@databricks.com> Closes #3998 from JoshRosen/SPARK-5200 and squashes the following commits: a384416 [Josh Rosen] [SPARK-5200] Disable web UI in Hive Thriftserver tests. (cherry picked from commit 82fd38dcdcc9f7df18930c0e08cc8ec34eaee828) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suite.scala 12 January 2015, 18:50:52 UTC
55325af [SPARK-5132][Core]Correct stage Attempt Id key in stageInfofromJson SPARK-5132: stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id Author: hushan[胡珊] <hushan@xiaomi.com> Closes #3932 from suyanNone/json-stage and squashes the following commits: 41419ab [hushan[胡珊]] Correct stage Attempt Id key in stageInfofromJson (cherry picked from commit d345ebebd554ac3faa4e870bd7800ed02e89da58) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 07 January 2015, 20:10:07 UTC
6d1ca23 [SPARK-4787] Stop SparkContext if a DAGScheduler init error occurs Author: Dale <tigerquoll@outlook.com> Closes #3809 from tigerquoll/SPARK-4787 and squashes the following commits: 5661e01 [Dale] [SPARK-4787] Ensure that call to stop() doesn't lose the exception by using a finally block. 2172578 [Dale] [SPARK-4787] Stop context properly if an exception occurs during DAGScheduler initialization. (cherry picked from commit 3fddc9468fa50e7683caa973fec6c52e1132268d) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 04 January 2015, 21:30:05 UTC
c532acf [HOTFIX] Bind web UI to ephemeral port in DriverSuite The job launched by DriverSuite should bind the web UI to an ephemeral port, since it looks like port contention in this test has caused a large number of Jenkins failures when many builds are started simultaneously. Our tests already disable the web UI, but this doesn't affect subprocesses launched by our tests. In this case, I've opted to bind to an ephemeral port instead of disabling the UI because disabling features in this test may mask its ability to catch certain bugs. See also: e24d3a9 Author: Josh Rosen <joshrosen@databricks.com> Closes #3873 from JoshRosen/driversuite-webui-port and squashes the following commits: 48cd05c [Josh Rosen] [HOTFIX] Bind web UI to ephemeral port in DriverSuite. (cherry picked from commit 012839807c3dc6e7c8c41ac6e956d52a550bb031) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 01 January 2015, 23:04:24 UTC
61eb9be [SPARK-5035] [Streaming] ReceiverMessage trait should extend Serializable Spark Streaming's ReceiverMessage trait should extend Serializable in order to fix a subtle bug that only occurs when running on a real cluster: If you attempt to send a fire-and-forget message to a remote Akka actor and that message cannot be serialized, then this seems to lead to more-or-less silent failures. As an optimization, Akka skips message serialization for messages sent within the same JVM. As a result, Spark's unit tests will never fail due to non-serializable Akka messages, but these will cause mostly-silent failures when running on a real cluster. Before this patch, here was the code for ReceiverMessage: ``` /** Messages sent to the NetworkReceiver. */ private[streaming] sealed trait ReceiverMessage private[streaming] object StopReceiver extends ReceiverMessage ``` Since ReceiverMessage does not extend Serializable and StopReceiver is a regular `object`, not a `case object`, StopReceiver will throw serialization errors. As a result, graceful receiver shutdown is broken on real clusters (and local-cluster mode) but works in local modes. If you want to reproduce this, try running the word count example from the Streaming Programming Guide in the Spark shell: ``` import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ val ssc = new StreamingContext(sc, Seconds(10)) // Create a DStream that will connect to hostname:port, like localhost:9999 val lines = ssc.socketTextStream("localhost", 9999) // Split each line into words val words = lines.flatMap(_.split(" ")) import org.apache.spark.streaming.StreamingContext._ // Count each word in each batch val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // Print the first ten elements of each RDD generated in this DStream to the console wordCounts.print() ssc.start() Thread.sleep(10000) ssc.stop(true, true) ``` Prior to this patch, this would work correctly in local mode but fail when running against a real cluster (it would report that some receivers were not shut down). Author: Josh Rosen <joshrosen@databricks.com> Closes #3857 from JoshRosen/SPARK-5035 and squashes the following commits: 71d0eae [Josh Rosen] [SPARK-5035] ReceiverMessage trait should extend Serializable. (cherry picked from commit fe6efacc0b865e9e827a1565877077000e63976e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 01 January 2015, 00:03:24 UTC
1034707 [HOTFIX] Disable Spark UI in SparkSubmitSuite tests This should fix a major cause of build breaks when running many parallel tests. 31 December 2014, 22:14:22 UTC
08d4f70 [SPARK-4298][Core] - The spark-submit cannot read Main-Class from Manifest. Resolves a bug where the `Main-Class` from a .jar file wasn't being read in properly. This was caused by the fact that the `primaryResource` object was a URI and needed to be normalized through a call to `.getPath` before it could be passed into the `JarFile` object. Author: Brennon York <brennon.york@capitalone.com> Closes #3561 from brennonyork/SPARK-4298 and squashes the following commits: 5e0fce1 [Brennon York] Use string interpolation for error messages, moved comment line from original code to above its necessary code segment 14daa20 [Brennon York] pushed mainClass assignment into match statement, removed spurious spaces, removed { } from case statements, removed return values c6dad68 [Brennon York] Set case statement to support multiple jar URI's and enabled the 'file' URI to load the main-class 8d20936 [Brennon York] updated to reset the error message back to the default a043039 [Brennon York] updated to split the uri and jar vals 8da7cbf [Brennon York] fixes SPARK-4298 (cherry picked from commit 8e14c5eb551ab06c94859c7f6d8c6b62b4d00d59) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala 31 December 2014, 19:57:11 UTC
babcafa [SPARK-1010] Clean up uses of System.setProperty in unit tests Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures. This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself). For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure. See the block comment at the top of the ResetSystemProperties class for more details. Author: Josh Rosen <joshrosen@databricks.com> Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits: 0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools 3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext 4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties 4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering. 0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite. 7a3d224 [Josh Rosen] Fix trait ordering 3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite 655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite 3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite 8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait. 633a84a [Josh Rosen] Remove use of system properties in FileServerSuite 25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite 1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite 5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite 0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite 51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite 60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite 14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite 628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite 9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite. 4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class. (cherry picked from commit 352ed6bbe3c3b67e52e298e7c535ae414d96beca) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/test/scala/org/apache/spark/ShuffleSuite.scala core/src/test/scala/org/apache/spark/SparkConfSuite.scala core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala core/src/test/scala/org/apache/spark/SparkContextSuite.scala core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala core/src/test/scala/org/apache/spark/util/UtilsSuite.scala external/flume/src/test/java/org/apache/spark/streaming/LocalJavaStreamingContext.java external/mqtt/src/test/java/org/apache/spark/streaming/LocalJavaStreamingContext.java external/twitter/src/test/java/org/apache/spark/streaming/LocalJavaStreamingContext.java external/zeromq/src/test/java/org/apache/spark/streaming/LocalJavaStreamingContext.java tools/src/main/scala/org/apache/spark/tools/StoragePerfTester.scala 31 December 2014, 19:31:44 UTC
eac740e [SPARK-4813][Streaming] Fix the issue that ContextWaiter didn't handle 'spurious wakeup' Used `Condition` to rewrite `ContextWaiter` because it provides a convenient API `awaitNanos` for timeout. Author: zsxwing <zsxwing@gmail.com> Closes #3661 from zsxwing/SPARK-4813 and squashes the following commits: 52247f5 [zsxwing] Add explicit unit type be42bcf [zsxwing] Update as per review suggestion e06bd4f [zsxwing] Fix the issue that ContextWaiter didn't handle 'spurious wakeup' (cherry picked from commit 6a897829444e2ef273586511f93a40d36e64fb0b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 30 December 2014, 22:39:53 UTC
d6b8d2c Revert "[SPARK-4882] Register PythonBroadcast with Kryo so that PySpark works with KryoSerializer" This reverts commit 822a0b42f79acea2771d0b298e803c11c37aab81. This fix does not apply to branch-1.1 or branch-1.0, since PythonBroadcast is new in 1.2. 30 December 2014, 17:33:07 UTC
822a0b4 [SPARK-4882] Register PythonBroadcast with Kryo so that PySpark works with KryoSerializer This PR fixes an issue where PySpark broadcast variables caused NullPointerExceptions if KryoSerializer was used. The fix is to register PythonBroadcast with Kryo so that it's deserialized with a KryoJavaSerializer. Author: Josh Rosen <joshrosen@databricks.com> Closes #3831 from JoshRosen/SPARK-4882 and squashes the following commits: 0466c7a [Josh Rosen] Register PythonBroadcast with Kryo. d5b409f [Josh Rosen] Enable registrationRequired, which would have caught this bug. 069d8a7 [Josh Rosen] Add failing test for SPARK-4882 (cherry picked from commit efa80a531ecd485f6cf0cdc24ffa42ba17eea46d) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 30 December 2014, 17:30:32 UTC
d5e0a45 [HOTFIX] Add SPARK_VERSION to Spark package object. This helps to avoid build breaks when backporting patches that use org.apache.spark.SPARK_VERSION. 30 December 2014, 06:03:07 UTC
3442b7b [SPARK-4952][Core]Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails Author: GuoQiang Li <witgo@qq.com> Closes #3788 from witgo/SPARK-4952 and squashes the following commits: d903529 [GuoQiang Li] Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails (cherry picked from commit 080ceb771a1e6b9f844cfd4f1baa01133c106888) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 27 December 2014, 07:32:00 UTC
d21347d [SPARK-4537][Streaming] Expand StreamingSource to add more metrics Add `processingDelay`, `schedulingDelay` and `totalDelay` for the last completed batch. Add `lastReceivedBatchRecords` and `totalReceivedBatchRecords` to the received records counting. Author: jerryshao <saisai.shao@intel.com> Closes #3466 from jerryshao/SPARK-4537 and squashes the following commits: 00f5f7f [jerryshao] Change the code style and add totalProcessedRecords 44721a6 [jerryshao] Further address the comments c097ddc [jerryshao] Address the comments 02dd44f [jerryshao] Fix the addressed comments c7a9376 [jerryshao] Expand StreamingSource to add more metrics (cherry picked from commit f205fe477c33a541053c198cd43a5811d6cf9fe2) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 26 December 2014, 03:42:17 UTC
dd0287c [SPARK-4606] Send EOF to child JVM when there's no more data to read. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3460 from vanzin/SPARK-4606 and squashes the following commits: 031207d [Marcelo Vanzin] [SPARK-4606] Send EOF to child JVM when there's no more data to read. (cherry picked from commit 7e2deb71c4239564631b19c748e95c3d1aa1c77d) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 24 December 2014, 00:09:34 UTC
b1de461 [SPARK-4802] [streaming] Remove receiverInfo once receiver is de-registered Once the streaming receiver is de-registered at executor, the `ReceiverTrackerActor` needs to remove the corresponding reveiverInfo from the `receiverInfo` map at `ReceiverTracker`. Author: Ilayaperumal Gopinathan <igopinathan@pivotal.io> Closes #3647 from ilayaperumalg/receiverInfo-RTracker and squashes the following commits: 6eb97d5 [Ilayaperumal Gopinathan] Polishing based on the review 3640c86 [Ilayaperumal Gopinathan] Remove receiverInfo once receiver is de-registered (cherry picked from commit 10d69e9cbfdabe95d0e513176d5347d7b59da0ee) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> Conflicts: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala 23 December 2014, 23:23:39 UTC
3bce43f [SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, ```Scala val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") ``` outputed ``` 1 2 3 Iterable map done Iterator map done ``` So we should use 'iterator' to reduce memory consumed by join. Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E Author: zsxwing <zsxwing@gmail.com> Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits: 48ee7b9 [zsxwing] Remove the explicit types 95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join (cherry picked from commit c233ab3d8d75a33495298964fe73dbf7dd8fe305) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 22 December 2014, 22:27:43 UTC
e5f2752 [Minor] Build Failed: value defaultProperties not found Mvn Build Failed: value defaultProperties not found .Maybe related to this pr: https://github.com/apache/spark/commit/1d648123a77bbcd9b7a34cc0d66c14fa85edfecd andrewor14 can you look at this problem? Author: huangzhaowei <carlmartinmax@gmail.com> Closes #3749 from SaintBacchus/Mvn-Build-Fail and squashes the following commits: 8e2917c [huangzhaowei] Build Failed: value defaultProperties not found (cherry picked from commit a764960b3b6d842eef7fa4777c8fa99d3f60fa1e) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 20 December 2014, 07:33:42 UTC
3597c2e SPARK-2641: Passing num executors to spark arguments from properties file Since we can set spark executor memory and executor cores using property file, we must also be allowed to set the executor instances. Author: Kanwaljit Singh <kanwaljit.singh@guavus.com> Closes #1657 from kjsingh/branch-1.0 and squashes the following commits: d8a5a12 [Kanwaljit Singh] SPARK-2641: Fixing how spark arguments are loaded from properties file for num executors Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala 20 December 2014, 03:28:29 UTC
546a239 [SPARK-4896] don’t redundantly overwrite executor JAR deps Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #2848 from ryan-williams/fetch-file and squashes the following commits: c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently 8e39c16 [Ryan Williams] code review feedback 788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps (cherry picked from commit 7981f969762e77f1752ef8f86c546d4fc32a1a4f) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/util/Utils.scala 19 December 2014, 23:55:12 UTC
2d66463 SPARK-3428. TaskMetrics for running tasks is missing GC time metrics Author: Sandy Ryza <sandy@cloudera.com> Closes #3684 from sryza/sandy-spark-3428 and squashes the following commits: cb827fe [Sandy Ryza] SPARK-3428. TaskMetrics for running tasks is missing GC time metrics (cherry picked from commit 283263ffaa941e7e9ba147cf0ad377d9202d3761) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 19 December 2014, 06:41:09 UTC
f4e6ffc [SPARK-4884]: Improve Partition docs Rewording was based on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-td9804.html This is the associated JIRA ticket: https://issues.apache.org/jira/browse/SPARK-4884 Author: Madhu Siddalingaiah <madhu@madhu.com> Closes #3722 from msiddalingaiah/master and squashes the following commits: 79e679f [Madhu Siddalingaiah] [DOC]: improve documentation 51d14b9 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 38faca4 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again) 332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code> cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions (cherry picked from commit d5a596d4188bfa85ff49ee85039f54255c19a4de) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 19 December 2014, 00:01:41 UTC
c15e7f2 [HOTFIX] Fix RAT exclusion for known_translations file Author: Josh Rosen <joshrosen@databricks.com> Closes #3719 from JoshRosen/rat-fix and squashes the following commits: 1542886 [Josh Rosen] [HOTFIX] Fix RAT exclusion for known_translations file (cherry picked from commit 3d0c37b8118f6057a663f959321a79b8061132b6) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 17 December 2014, 07:01:48 UTC
0efd691 [Release] Update contributors list format and sort it Additionally, we now warn the user when a duplicate author name arises, in which case he/she needs to resolve it manually. Conflicts: .rat-excludes 17 December 2014, 06:15:56 UTC
991748d [Release] Cache known author translations locally This bypasses unnecessary calls to the Github and JIRA API. Additionally, having a local cache allows us to remember names that we had to manually discover ourselves. 17 December 2014, 03:31:29 UTC
581f866 [Release] Major improvements to generate contributors script This commit introduces several major improvements to the script that generates the contributors list for release notes, notably: (1) Use release tags instead of a range of commits. Across branches, commits are not actually strictly two-dimensional, and so it is not sufficient to specify a start hash and an end hash. Otherwise, we end up counting commits that were already merged in an older branch. (2) Match PR numbers in addition to commit hashes. This is related to the first point in that if a PR is already merged in an older minor release tag, it should be filtered out here. This requires us to do some intelligent regex parsing on the commit description in addition to just relying on the GitHub API. (3) Relax author validity check. The old code fails on a name that has many middle names, for instance. The test was just too strict. (4) Use GitHub authentication. This allows us to make far more requests through the GitHub API than before (5000 as opposed to 60 per hour). (5) Translate from Github username, not commit author name. This is important because the commit author name is not always configured correctly by the user. For instance, the username "falaki" used to resolve to just "Hossein", which was treated as a github username and translated to something else that is completely arbitrary. (6) Add an option to use the untranslated name. If there is not a satisfactory candidate to replace the untranslated name with, at least allow the user to not translate it. 17 December 2014, 02:06:39 UTC
892685b SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions. Author: Sean Owen <sowen@cloudera.com> Closes #3692 from srowen/SPARK-4814 and squashes the following commits: caca704 [Sean Owen] Disable assertions just for Hive f71e783 [Sean Owen] Enable assertions for SBT and Maven build (cherry picked from commit 81112e4b573292e76c7feeed995751bd7a5fe489) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: pom.xml 16 December 2014, 01:14:14 UTC
fa3b3e3 SPARK-785 [CORE] ClosureCleaner not invoked on most PairRDDFunctions This looked like perhaps a simple and important one. `combineByKey` looks like it should clean its arguments' closures, and that in turn covers apparently all remaining functions in `PairRDDFunctions` which delegate to it. Author: Sean Owen <sowen@cloudera.com> Closes #3690 from srowen/SPARK-785 and squashes the following commits: 8df68fe [Sean Owen] Clean context of most remaining functions in PairRDDFunctions, which ultimately call combineByKey (cherry picked from commit 2a28bc61009a170af3853c78f7f36970898a6d56) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 16 December 2014, 00:07:30 UTC
0faea17 fixed spelling errors in documentation changed "form" to "from" in 3 documentation entries for Kafka integration Author: Peter Klipfel <peter@klipfel.me> Closes #3691 from peterklipfel/master and squashes the following commits: 0fe7fc5 [Peter Klipfel] fixed spelling errors in documentation (cherry picked from commit 2a2983f7c5edce8e4d1d7592adcf227cbd462ae9) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 14 December 2014, 08:02:10 UTC
396de67 [SPARK-4759] Fix driver hanging from coalescing partitions The driver hangs sometimes when we coalesce RDD partitions. See JIRA for more details and reproduction. This is because our use of empty string as default preferred location in `CoalescedRDDPartition` causes the `TaskSetManager` to schedule the corresponding task on host `""` (empty string). The intended semantics here, however, is that the partition does not have a preferred location, and the TSM should schedule the corresponding task accordingly. Author: Andrew Or <andrew@databricks.com> Closes #3633 from andrewor14/coalesce-preferred-loc and squashes the following commits: e520d6b [Andrew Or] Oops 3ebf8bd [Andrew Or] A few comments f370a4e [Andrew Or] Fix tests 2f7dfb6 [Andrew Or] Avoid using empty string as default preferred location (cherry picked from commit 4f93d0cabe5d1fc7c0fd0a33d992fd85df1fecb4) Signed-off-by: Andrew Or <andrew@databricks.com> 10 December 2014, 22:28:04 UTC
273f2c8 [SPARK-4771][Docs] Document standalone cluster supervise mode tdas looks like streaming already refers to the supervise mode. The link from there is broken though. Author: Andrew Or <andrew@databricks.com> Closes #3627 from andrewor14/document-supervise and squashes the following commits: 9ca0908 [Andrew Or] Wording changes 2b55ed2 [Andrew Or] Document standalone cluster supervise mode 10 December 2014, 20:43:48 UTC
6dcafa7 [SPARK-4772] Clear local copies of accumulators as soon as we're done with them Accumulators keep thread-local copies of themselves. These copies were only cleared at the beginning of a task. This meant that (a) the memory they used was tied up until the next task ran on that thread, and (b) if a thread died, the memory it had used for accumulators was locked up forever on that worker. This PR clears the thread-local copies of accumulators at the end of each task, in the tasks finally block, to make sure they are cleaned up between tasks. It also stores them in a ThreadLocal object, so that if, for some reason, the thread dies, any memory they are using at the time should be freed up. Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com> Closes #3570 from nkronenfeld/Accumulator-Improvements and squashes the following commits: a581f3f [Nathan Kronenfeld] Change Accumulators to private[spark] instead of adding mima exclude to get around false positive in mima tests b6c2180 [Nathan Kronenfeld] Include MiMa exclude as per build error instructions - this version incompatibility should be irrelevent, as it will only surface if a master is talking to a worker running a different version of spark. 537baad [Nathan Kronenfeld] Fuller refactoring as intended, incorporating JR's suggestions for ThreadLocal localAccums, and keeping clear(), but also calling it in tasks' finally block, rather than just at the beginning of the task. 39a82f2 [Nathan Kronenfeld] Clear local copies of accumulators as soon as we're done with them (cherry picked from commit 94b377f94487109a1cc3e07dd230b1df7a96e28d) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/Accumulators.scala core/src/main/scala/org/apache/spark/executor/Executor.scala 10 December 2014, 07:56:59 UTC
9b99237 [SPARK-4714] BlockManager.dropFromMemory() should check whether block has been removed after synchronizing on BlockInfo instance. After synchronizing on the `info` lock in the `removeBlock`/`dropOldBlocks`/`dropFromMemory` methods in BlockManager, the block that `info` represented may have already removed. The three methods have the same logic to get the `info` lock: ``` info = blockInfo.get(id) if (info != null) { info.synchronized { // do something } } ``` So, there is chance that when a thread enters the `info.synchronized` block, `info` has already been removed from the `blockInfo` map by some other thread who entered `info.synchronized` first. The `removeBlock` and `dropOldBlocks` methods are idempotent, so it's safe for them to run on blocks that have already been removed. But in `dropFromMemory` it may be problematic since it may drop block data which already removed into the diskstore, and this calls data store operations that are not designed to handle missing blocks. This patch fixes this issue by adding a check to `dropFromMemory` to test whether blocks have been removed by a racing thread. Author: hushan[胡珊] <hushan@xiaomi.com> Closes #3574 from suyanNone/refine-block-concurrency and squashes the following commits: edb989d [hushan[胡珊]] Refine code style and comments position 55fa4ba [hushan[胡珊]] refine code e57e270 [hushan[胡珊]] add check info is already remove or not while having gotten info.syn (cherry picked from commit 30dca924df0efbdc1b638fa7c705fe8743570783) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 09 December 2014, 23:55:10 UTC
fe7d7a9 SPARK-3926 [CORE] Reopened: result of JavaRDD collectAsMap() is not serializable My original 'fix' didn't fix at all. Now, there's a unit test to check whether it works. Of the two options to really fix it -- copy the `Map` to a `java.util.HashMap`, or copy and modify Scala's implementation in `Wrappers.MapWrapper`, I went with the latter. Author: Sean Owen <sowen@cloudera.com> Closes #3587 from srowen/SPARK-3926 and squashes the following commits: 8586bb9 [Sean Owen] Remove unneeded no-arg constructor, and add additional note about copied code in LICENSE 7bb0e66 [Sean Owen] Make SerializableMapWrapper actually serialize, and add unit test (cherry picked from commit e829bfa1ab9b68f44c489d26efb042f793fd9362) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 09 December 2014, 00:14:28 UTC
16bc77b [SPARK-4764] Ensure that files are fetched atomically tempFile is created in the same directory than targetFile, so that the move from tempFile to targetFile is always atomic Author: Christophe Préaud <christophe.preaud@kelkoo.com> Closes #2855 from preaudc/master and squashes the following commits: 9ba89ca [Christophe Préaud] Ensure that files are fetched atomically 54419ae [Christophe Préaud] Merge remote-tracking branch 'upstream/master' c6a5590 [Christophe Préaud] Revert commit 8ea871f8130b2490f1bad7374a819bf56f0ccbbd 7456a33 [Christophe Préaud] Merge remote-tracking branch 'upstream/master' 8ea871f [Christophe Préaud] Ensure that files are fetched atomically (cherry picked from commit ab2abcb5ef925f15fa0e08d34a79b94a7b6578ef) Signed-off-by: Josh Rosen <rosenville@gmail.com> Conflicts: core/src/main/scala/org/apache/spark/util/Utils.scala 08 December 2014, 19:50:10 UTC
8ee2d18 Fix typo in Spark SQL docs. Author: Andy Konwinski <andykonwinski@gmail.com> Closes #3611 from andyk/patch-3 and squashes the following commits: 7bab333 [Andy Konwinski] Fix typo in Spark SQL docs. (cherry picked from commit 15cf3b0125fe238dea2ce13e703034ba7cef477f) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 05 December 2014, 02:27:37 UTC
b09382a [SPARK-4421] Wrong link in spark-standalone.html Modified the link of building Spark. (backport version of #3279.) Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3280 from tsudukim/feature/SPARK-4421-2 and squashes the following commits: 3b4d38d [Masayoshi TSUZUKI] [SPARK-4421] Wrong link in spark-standalone.html 05 December 2014, 02:13:23 UTC
bf637e0 [SPARK-4652][DOCS] Add docs about spark-git-repo option There might be some cases when WIPS spark version need to be run on EC2 cluster. In order to setup this type of cluster more easily, add --spark-git-repo option description to ec2 documentation. Author: lewuathe <lewuathe@me.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits: 6dae8ee [lewuathe] Wrap consistent with other descriptions cfaf9be [lewuathe] Add docs about spark-git-repo option (Editing / cleanup by Josh Rosen) (cherry picked from commit ab8177da2defab1ecd8bc0cd5a21f07be5b8d2c5) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 04 December 2014, 23:25:41 UTC
e98aa54 [SPARK-4459] Change groupBy type parameter from K to U Please see https://issues.apache.org/jira/browse/SPARK-4459 Author: Saldanha <saldaal1@phusca-l24858.wlan.na.novartis.net> Closes #3327 from alokito/master and squashes the following commits: 54b1095 [Saldanha] [SPARK-4459] changed type parameter for keyBy from K to U d5f73c3 [Saldanha] [SPARK-4459] added keyBy test 316ad77 [Saldanha] SPARK-4459 changed type parameter for groupBy from K to U. 62ddd4b [Saldanha] SPARK-4459 added failing unit test (cherry picked from commit 743a889d2778f797aabc3b1e8146e7aa32b62a48) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 04 December 2014, 22:59:02 UTC
d01fdd3 [SPARK-4745] Fix get_existing_cluster() function with multiple security groups The current get_existing_cluster() function would only find an instance belonged to a cluster if the instance's security groups == cluster_name + "-master" (or "-slaves"). This fix allows for multiple security groups by checking if the cluster_name + "-master" security group is in the list of groups for a particular instance. Author: alexdebrie <alexdebrie1@gmail.com> Closes #3596 from alexdebrie/master and squashes the following commits: 9d51232 [alexdebrie] Fix get_existing_cluster() function with multiple security groups (cherry picked from commit 794f3aec24acb578e258532ad0590554d07958ba) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 04 December 2014, 22:16:56 UTC
5ac55c8 [SPARK-4253] Ignore spark.driver.host in yarn-cluster and standalone-cluster modes In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched. If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches. This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits: ed1a25c [WangTaoTheTonic] revert unrelated formatting issue 02c4e49 [WangTao] add comment 32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext 667cf24 [WangTaoTheTonic] document fix ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode 2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode (cherry picked from commit 8106b1e36b2c2b9f5dc5d7252540e48cc3fc96d5) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 04 December 2014, 20:08:47 UTC
6c53225 [Release] Correctly translate contributors name in release notes This commit involves three main changes: (1) It separates the translation of contributor names from the generation of the contributors list. This is largely motivated by the Github API limit; even if we exceed this limit, we should at least be able to proceed manually as before. This is why the translation logic is abstracted into its own script translate-contributors.py. (2) When we look for candidate replacements for invalid author names, we should look for the assignees of the associated JIRAs too. As a result, the intermediate file must keep track of these. (3) This provides an interactive mode with which the user can sit at the terminal and manually pick the candidate replacement that he/she thinks makes the most sense. As before, there is a non-interactive mode that picks the first candidate that the script considers "valid." TODO: We should have a known_contributors file that stores known mappings so we don't have to go through all of this translation every time. This is also valuable because some contributors simply cannot be automatically translated. Conflicts: .gitignore 04 December 2014, 03:20:05 UTC
17dfd41 [SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. Author: Mark Hamstra <markhamstra@gmail.com> Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver 03 December 2014, 23:11:16 UTC
3e3cd5a [SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document. Added descriptions about these parameters. - spark.yarn.queue Modified description about the defalut value of this parameter. - spark.yarn.submit.file.replication Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits: ce99655 [Masayoshi TSUZUKI] better gramatically. 21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties. 88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update 03 December 2014, 21:20:18 UTC
af76954 [SPARK-4715][Core] Make sure tryToAcquire won't return a negative value ShuffleMemoryManager.tryToAcquire may return a negative value. The unit test demonstrates this bug. It will output `0 did not equal -200 granted is negative`. Author: zsxwing <zsxwing@gmail.com> Closes #3575 from zsxwing/SPARK-4715 and squashes the following commits: a193ae6 [zsxwing] Make sure tryToAcquire won't return a negative value 03 December 2014, 20:20:33 UTC
e484b8a [SPARK-4701] Typo in sbt/sbt Modified typo. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3560 from tsudukim/feature/SPARK-4701 and squashes the following commits: ed2a3f1 [Masayoshi TSUZUKI] Another whitespace position error. 1af3a35 [Masayoshi TSUZUKI] [SPARK-4701] Typo in sbt/sbt (cherry picked from commit 96786e3ee53a13a57463b74bec0e77b172f719a3) Signed-off-by: Andrew Or <andrew@databricks.com> 03 December 2014, 20:08:26 UTC
aec20af [Release] Translate unknown author names automatically 03 December 2014, 00:37:33 UTC
f333e4f [SPARK-4686] Link to allowed master URLs is broken The link points to the old scala programming guide; it should point to the submitting applications page. This should be backported to 1.1.2 (it's been broken as of 1.0). Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #3542 from kayousterhout/SPARK-4686 and squashes the following commits: a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken (cherry picked from commit d9a148ba6a67a01e4bf77c35c41dd4cbc8918c82) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com> 02 December 2014, 17:07:32 UTC
91eadd2 [DOC] Fixes formatting typo in SQL programming guide <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3498) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3498 from liancheng/fix-sql-doc-typo and squashes the following commits: 865ecd7 [Cheng Lian] Fixes formatting typo in SQL programming guide (cherry picked from commit 2a4d389f70b2066b1ac32b081bef44e61fefb03c) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 01 December 2014, 03:05:02 UTC
90d90b2 [HOTFIX] Fix build break in 1a2508b73f6d46a0faf7740b85a5c216c925c25a org.apache.spark.SPARK_VERSION is new in 1.2; in earlier versions, we have to use SparkContext.SPARK_VERSION. 30 November 2014, 19:48:46 UTC
1a2508b SPARK-2143 [WEB UI] Add Spark version to UI footer This PR adds the Spark version number to the UI footer; this is how it looks: ![screen shot 2014-11-21 at 22 58 40](https://cloud.githubusercontent.com/assets/822522/5157738/f4822094-7316-11e4-98f1-333a535fdcfa.png) Author: Sean Owen <sowen@cloudera.com> Closes #3410 from srowen/SPARK-2143 and squashes the following commits: e9b3a7a [Sean Owen] Add Spark version to footer 30 November 2014, 19:43:20 UTC
24b5c03 [SPARK-4597] Use proper exception and reset variable in Utils.createTempDir() `File.exists()` and `File.mkdirs()` only throw `SecurityException` instead of `IOException`. Then, when an exception is thrown, `dir` should be reset too. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3449 from viirya/fix_createtempdir and squashes the following commits: 36cacbd [Liang-Chi Hsieh] Use proper exception and reset variable. (cherry picked from commit 49fe8797e64f10c574e0790b32a8c3fdc7e594a0) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 29 November 2014, 02:06:16 UTC
f8a4fd3 [BRANCH-1.1][SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler v1.1 backport for #3483 Author: roxchkplusony <roxchkplusony@gmail.com> Closes #3503 from roxchkplusony/bugfix/4626-1.1 and squashes the following commits: 234d350 [roxchkplusony] [SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler 28 November 2014, 08:34:41 UTC
a59c445 [Release] Automate generation of contributors list This commit provides a script that computes the contributors list by linking the github commits with JIRA issues. Automatically translating github usernames remains a TODO at this point. 27 November 2014, 07:19:30 UTC
1a7f414 [HOTFIX] Fixing broken build due to missing imports. 25 November 2014, 23:28:25 UTC
7aa592c [SPARK-4196][SPARK-4602][Streaming] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles Solves two JIRAs in one shot - Makes the ForechDStream created by saveAsNewAPIHadoopFiles serializable for checkpoints - Makes the default configuration object used saveAsNewAPIHadoopFiles be the Spark's hadoop configuration Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #3457 from tdas/savefiles-fix and squashes the following commits: bb4729a [Tathagata Das] Same treatment for saveAsHadoopFiles b382ea9 [Tathagata Das] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles. (cherry picked from commit 8838ad7c135a585cde015dc38b5cb23314502dd9) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 25 November 2014, 22:42:59 UTC
6371737 Update versions to 1.1.2-SNAPSHOT 24 November 2014, 22:29:40 UTC
1df1c1d [maven-release-plugin] prepare for next development iteration 19 November 2014, 20:11:02 UTC
3693ae5 [maven-release-plugin] prepare release v1.1.1-rc2 19 November 2014, 20:10:56 UTC
aa3c794 Update CHANGES.txt for 1.1.1-rc2 19 November 2014, 19:35:43 UTC
16bf5f3 [SPARK-4480] Avoid many small spills in external data structures (1.1) This is the branch-1.1 version of #3353. This requires a separate PR because the code in master has been refactored a little to eliminate duplicate code. I have tested this on a standalone cluster. The goal is to merge this into 1.1.1. Author: Andrew Or <andrew@databricks.com> Closes #3354 from andrewor14/avoid-small-spills-1.1 and squashes the following commits: f2e552c [Andrew Or] Fix tests 7012595 [Andrew Or] Avoid many small spills 19 November 2014, 18:45:42 UTC
e22a759 [SPARK-4380] Log more precise number of bytes spilled (1.1) This is the branch-1.1 version of #3243. Author: Andrew Or <andrew@databricks.com> Closes #3355 from andrewor14/spill-log-bytes-1.1 and squashes the following commits: 36ec152 [Andrew Or] Log more precise representation of bytes in spilling code 19 November 2014, 04:15:00 UTC
f9739b9 [SPARK-4468][SQL] Backports #3334 to branch-1.1 <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3338) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3338 from liancheng/spark-3334-for-1.1 and squashes the following commits: bd17512 [Cheng Lian] Backports #3334 to branch-1.1 19 November 2014, 01:40:24 UTC
ae9b1f6 [SPARK-4433] fix a racing condition in zipWithIndex Spark hangs with the following code: ~~~ sc.parallelize(1 to 10).zipWithIndex.repartition(10).count() ~~~ This is because ZippedWithIndexRDD triggers a job in getPartitions and it causes a deadlock in DAGScheduler.getPreferredLocs (synced). The fix is to compute `startIndices` during construction. This should be applied to branch-1.0, branch-1.1, and branch-1.2. pwendell Author: Xiangrui Meng <meng@databricks.com> Closes #3291 from mengxr/SPARK-4433 and squashes the following commits: c284d9f [Xiangrui Meng] fix a racing condition in zipWithIndex (cherry picked from commit bb46046154a438df4db30a0e1fd557bd3399ee7b) Signed-off-by: Xiangrui Meng <meng@databricks.com> 19 November 2014, 00:26:16 UTC
91b5fa8 [SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer (For branch-1.1) This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs. This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask. I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3321 from sarutak/connection-manager-timeout-bugfix and squashes the following commits: 786af91 [Kousuke Saruta] Fixed memory leak issue of ConnectionManager 18 November 2014, 20:09:18 UTC
aa9ebda [SPARK-4467] Partial fix for fetch failure in sort-based shuffle (1.1) This is the 1.1 version of #3302. There has been some refactoring in master so we can't cherry-pick that PR. Author: Andrew Or <andrew@databricks.com> Closes #3330 from andrewor14/sort-fetch-fail and squashes the following commits: 486fc49 [Andrew Or] Reset `elementsRead` 18 November 2014, 02:10:49 UTC
e4f5695 Revert "[maven-release-plugin] prepare release v1.1.1-rc1" This reverts commit 72a4fdbe82203b962fe776d0edaed7f56898cb02. 17 November 2014, 19:49:48 UTC
cf8d0ef Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 685bdd2b7e584c84e7d39e40de2d5f30c5388cb5. 17 November 2014, 19:49:33 UTC
b528367 Revert "[SPARK-4075] [Deploy] Jar url validation is not enough for Jar file" This reverts commit 098f83c7ccd7dad9f9228596da69fe5f55711a52. 17 November 2014, 19:25:38 UTC
back to top