https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
b6eaf77 Preparing Spark release v1.2.1-rc3 03 February 2015, 00:39:27 UTC
a64c7a8 Revert "Preparing Spark release v1.2.1-rc2" This reverts commit b77f87673d1f9f03d4c83cf583158227c551359b. 03 February 2015, 00:38:44 UTC
d944c0b Revert "Preparing development version 1.2.2-SNAPSHOT" This reverts commit 0a16abadc59082b7d3a24d7f3625236658632813. 03 February 2015, 00:38:42 UTC
88e0f2d Revert "[SPARK-5195][sql]Update HiveMetastoreCatalog.scala(override the MetastoreRelation's sameresult method only compare databasename and table name)" This reverts commit 54864403c4f132d9c1380c015122a849dd44dff8. 03 February 2015, 00:33:46 UTC
5486440 [SPARK-5195][sql]Update HiveMetastoreCatalog.scala(override the MetastoreRelation's sameresult method only compare databasename and table name) override the MetastoreRelation's sameresult method only compare databasename and table name because in previous : cache table t1; select count(*) from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count(*) from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan with out alias so modify the sameresult method only compare databasename and table name Author: seayi <405078363@qq.com> Author: Michael Armbrust <michael@databricks.com> Closes #3898 from seayi/branch-1.2 and squashes the following commits: 8f0c7d2 [seayi] Update CachedTableSuite.scala a277120 [seayi] Update HiveMetastoreCatalog.scala 8d910aa [seayi] Update HiveMetastoreCatalog.scala 03 February 2015, 00:09:10 UTC
b978c9f Disabling Utils.chmod700 for Windows This patch makes Spark 1.2.1rc2 work again on Windows. Without it you get following log output on creating a Spark context: INFO org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in .... Ignoring this directory. ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any local dir. Author: Martin Weindel <martin.weindel@gmail.com> Author: mweindel <m.weindel@usu-software.de> Closes #4299 from MartinWeindel/branch-1.2 and squashes the following commits: 535cb7f [Martin Weindel] fixed last commit f17072e [Martin Weindel] moved condition to caller to avoid confusion on chmod700() return value 4de5e91 [Martin Weindel] reverted to unix line ends fe2740b [mweindel] moved comment ac4749c [mweindel] fixed chmod700 for Windows 02 February 2015, 21:46:18 UTC
00746a5 [Docs] Fix Building Spark link text Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4312 from nchammas/patch-2 and squashes the following commits: 9d943aa [Nicholas Chammas] [Docs] Fix Building Spark link text (cherry picked from commit 3f941b68a2336aa7876aeda99865e7c19b53bc5c) Signed-off-by: Andrew Or <andrew@databricks.com> 02 February 2015, 20:33:56 UTC
0a16aba Preparing development version 1.2.2-SNAPSHOT 28 January 2015, 07:48:55 UTC
b77f876 Preparing Spark release v1.2.1-rc2 28 January 2015, 07:48:55 UTC
4026bba Revert "Preparing Spark release v1.2.1-rc1" This reverts commit 3e2d7d310b76c293b9ac787f204e6880f508f6ec. 28 January 2015, 07:47:00 UTC
063a4c5 Revert "Preparing development version 1.2.2-SNAPSHOT" This reverts commit f53a4319ba5f0843c077e64ae5a41e2fac835a5b. 28 January 2015, 07:46:57 UTC
fea9b43 [MLlib] fix python example of ALS in guide fix python example of ALS in guide, use Rating instead of np.array. Author: Davies Liu <davies@databricks.com> Closes #4226 from davies/fix_als_guide and squashes the following commits: 1433d76 [Davies Liu] fix python example of als in guide (cherry picked from commit fdaad4eb0388cfe43b5b6600927eb7b9182646f9) Signed-off-by: Xiangrui Meng <meng@databricks.com> 27 January 2015, 23:33:15 UTC
8090448 SPARK-5308 [BUILD] MD5 / SHA1 hash format doesn't match standard Maven output Here's one way to make the hashes match what Maven's plugins would create. It takes a little extra footwork since OS X doesn't have the same command line tools. An alternative is just to make Maven output these of course - would that be better? I ask in case there is a reason I'm missing, like, we need to hash files that Maven doesn't build. Author: Sean Owen <sowen@cloudera.com> Closes #4161 from srowen/SPARK-5308 and squashes the following commits: 70d09d0 [Sean Owen] Use $(...) syntax e25eff8 [Sean Owen] Generate MD5, SHA1 hashes in a format like Maven's plugin (cherry picked from commit ff356e2a21e31998cda3062e560a276a3bfaa7ab) Signed-off-by: Patrick Wendell <patrick@databricks.com> 27 January 2015, 18:22:57 UTC
f53a431 Preparing development version 1.2.2-SNAPSHOT 27 January 2015, 01:07:29 UTC
3e2d7d3 Preparing Spark release v1.2.1-rc1 27 January 2015, 01:07:29 UTC
8c46100 Revert "Preparing Spark release v1.2.1-rc1" This reverts commit e87eb2b42f137c22194cfbca2abf06fecdf943da. 27 January 2015, 01:06:22 UTC
e8da342 Revert "Preparing development version 1.2.2-SNAPSHOT" This reverts commit adfed7086f10fa8db4eeac7996c84cf98f625e9a. 27 January 2015, 01:06:19 UTC
adfed70 Preparing development version 1.2.2-SNAPSHOT 27 January 2015, 00:12:04 UTC
e87eb2b Preparing Spark release v1.2.1-rc1 27 January 2015, 00:12:04 UTC
07c0fd1 Updating versions for Spark 1.2.1 27 January 2015, 00:09:22 UTC
b378e9a SPARK-4147 [CORE] Reduce log4j dependency Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. The only change is to push one half of the check in the original `if` condition inside. This is a trivial change, may or may not actually solve a problem, but I think it's all that makes sense to do for SPARK-4147. Author: Sean Owen <sowen@cloudera.com> Closes #4190 from srowen/SPARK-4147 and squashes the following commits: 4e99942 [Sean Owen] Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. (cherry picked from commit 54e7b456dd56c9e52132154e699abca87563465b) Signed-off-by: Patrick Wendell <patrick@databricks.com> 26 January 2015, 22:23:56 UTC
ef6fe84 [SPARK-5355] use j.u.c.ConcurrentHashMap instead of TrieMap j.u.c.ConcurrentHashMap is more battle tested. cc rxin JoshRosen pwendell Author: Davies Liu <davies@databricks.com> Closes #4208 from davies/safe-conf and squashes the following commits: c2182dc [Davies Liu] address comments, fix tests 3a1d821 [Davies Liu] fix test da14ced [Davies Liu] Merge branch 'master' of github.com:apache/spark into safe-conf ae4d305 [Davies Liu] change to j.u.c.ConcurrentMap f8fa1cf [Davies Liu] change to TrieMap a1d769a [Davies Liu] make SparkConf thread-safe (cherry picked from commit 142093179a4c40bdd90744191034de7b94a963ff) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 26 January 2015, 21:22:17 UTC
cf65620 SPARK-4430 [STREAMING] [TEST] Apache RAT Checks fail spuriously on test files Another trivial one. The RAT failure was due to temp files from `FailureSuite` not being cleaned up. This just makes the cleanup more reliable by using the standard temp dir mechanism. Author: Sean Owen <sowen@cloudera.com> Closes #4189 from srowen/SPARK-4430 and squashes the following commits: 9ea63ff [Sean Owen] Properly acquire a temp directory to ensure it is cleaned up at shutdown, which helps avoid a RAT check failure (cherry picked from commit 0528b85cf96f9c9c074b5fbb5b9c5dd8071c0bc7) Signed-off-by: Andrew Or <andrew@databricks.com> 26 January 2015, 03:16:50 UTC
2a2da42 Revert "[SPARK-5344][WebUI] HistoryServer cannot recognize that inprogress file was renamed to completed file" This reverts commit 8f55beeb51e6ea72e63af3f276497f61dd24d09b. 25 January 2015, 23:39:18 UTC
8f55bee [SPARK-5344][WebUI] HistoryServer cannot recognize that inprogress file was renamed to completed file `FsHistoryProvider` tries to update application status but if `checkForLogs` is called before `.inprogress` file is renamed to completed file, the file is not recognized as completed. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #4132 from sarutak/SPARK-5344 and squashes the following commits: 9658008 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5344 d2c72b6 [Kousuke Saruta] Fixed update issue of FsHistoryProvider (cherry picked from commit 8f5c827b01026bf45fc774ed7387f11a941abea8) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala 25 January 2015, 23:36:24 UTC
f34c113 SPARK-4506 [DOCS] Addendum: Update more docs to reflect that standalone works in cluster mode This is a trivial addendum to SPARK-4506, which was already resolved. noted by Asim Jalis in SPARK-4506. Author: Sean Owen <sowen@cloudera.com> Closes #4160 from srowen/SPARK-4506 and squashes the following commits: 5f5f7df [Sean Owen] Update more docs to reflect that standalone works in cluster mode (cherry picked from commit 9f6435763d173d2abf82d16b5878983fa8bf3419) Signed-off-by: Andrew Or <andrew@databricks.com> 25 January 2015, 23:25:16 UTC
e82e960 SPARK-5382: Use SPARK_CONF_DIR in spark-class and spark-submit, spark-su... ...bmit2.cmd if it is defined Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4177 from jacek-lewandowski/SPARK-5382-1.2 and squashes the following commits: 41cef25 [Jacek Lewandowski] SPARK-5382: Use SPARK_CONF_DIR in spark-class and spark-submit, spark-submit2.cmd if it is defined 25 January 2015, 23:20:50 UTC
7652809 SPARK-5382: Use SPARK_CONF_DIR in spark-class if it is defined Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4179 from jacek-lewandowski/SPARK-5382-1.3 and squashes the following commits: 55d7791 [Jacek Lewandowski] SPARK-5382: Use SPARK_CONF_DIR in spark-class if it is defined 25 January 2015, 23:18:20 UTC
a7e99ed SPARK-3852 [DOCS] Document spark.driver.extra* configs As per the JIRA. I copied the `spark.executor.extra*` text, but removed info that appears to be specific to the `executor` config and not `driver`. Author: Sean Owen <sowen@cloudera.com> Closes #4185 from srowen/SPARK-3852 and squashes the following commits: f60a8a1 [Sean Owen] Document spark.driver.extra* configs (cherry picked from commit c586b45dd25b50be7f195df2ce91b307e1ed71a9) Signed-off-by: Andrew Or <andrew@databricks.com> 25 January 2015, 23:08:41 UTC
c573af4 [SPARK-5402] log executor ID at executor-construction time also rename "slaveHostname" to "executorHostname" Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4195 from ryan-williams/exec and squashes the following commits: e60a7bb [Ryan Williams] log executor ID at executor-construction time (cherry picked from commit aea25482c370fbcf712a464501605bc16ee4ed5d) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala 25 January 2015, 22:21:21 UTC
1f8b718 [SPARK-5401] set executor ID before creating MetricsSystem Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4194 from ryan-williams/metrics and squashes the following commits: 7c5a33f [Ryan Williams] set executor ID before creating MetricsSystem 25 January 2015, 22:19:09 UTC
ff2d7bd [SPARK-5058] Part 2. Typos and broken URL - Also fixed java link Author: Jongyoul Lee <jongyoul@gmail.com> Closes #4172 from jongyoul/SPARK-FIXDOC and squashes the following commits: 6be03e5 [Jongyoul Lee] [SPARK-5058] Part 2. Typos and broken URL - Also fixed java link (cherry picked from commit 09e09c548e7722fca1cdc89bd37de2cee58f4ce9) Signed-off-by: Reynold Xin <rxin@databricks.com> 24 January 2015, 07:34:21 UTC
73cb806 [SPARK-5351][GraphX] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImp... If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages( ctx => { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, "graph.txt") val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #4136 from maropu/EdgePartitionBugFix and squashes the following commits: 0cd8942 [Ankur Dave] Use more concise getOrElse aad4a2c [Ankur Dave] Add unit test for non-default number of edge partitions 0a2f32b [Takeshi Yamamuro] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImpl (cherry picked from commit e224dbb011789297cd6c6ba095f702c042869ed6) Signed-off-by: Ankur Dave <ankurdave@gmail.com> 24 January 2015, 03:29:42 UTC
2ea782a [SPARK-5063] More helpful error messages for several invalid operations This patch adds more helpful error messages for invalid programs that define nested RDDs, broadcast RDDs, perform actions inside of transformations (e.g. calling `count()` from inside of `map()`), and call certain methods on stopped SparkContexts. Currently, these invalid programs lead to confusing NullPointerExceptions at runtime and have been a major source of questions on the mailing list and StackOverflow. In a few cases, I chose to log warnings instead of throwing exceptions in order to avoid any chance that this patch breaks programs that worked "by accident" in earlier Spark releases (e.g. programs that define nested RDDs but never run any jobs with them). In SparkContext, the new `assertNotStopped()` method is used to check whether methods are being invoked on a stopped SparkContext. In some cases, user programs will not crash in spite of calling methods on stopped SparkContexts, so I've only added `assertNotStopped()` calls to methods that always throw exceptions when called on stopped contexts (e.g. by dereferencing a null `dagScheduler` pointer). Author: Josh Rosen <joshrosen@databricks.com> Closes #3884 from JoshRosen/SPARK-5063 and squashes the following commits: a38774b [Josh Rosen] Fix spelling typo a943e00 [Josh Rosen] Convert two exceptions into warnings in order to avoid breaking user programs in some edge-cases. 2d0d7f7 [Josh Rosen] Fix test to reflect 1.2.1 compatibility 3f0ea0c [Josh Rosen] Revert two unintentional formatting changes 8e5da69 [Josh Rosen] Remove assertNotStopped() calls for methods that were sometimes safe to call on stopped SC's in Spark 1.2 8cff41a [Josh Rosen] IllegalStateException fix 6ef68d0 [Josh Rosen] Fix Python line length issues. 9f6a0b8 [Josh Rosen] Add improved error messages to PySpark. 13afd0f [Josh Rosen] SparkException -> IllegalStateException 8d404f3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5063 b39e041 [Josh Rosen] Fix BroadcastSuite test which broadcasted an RDD 99cc09f [Josh Rosen] Guard against calling methods on stopped SparkContexts. 34833e8 [Josh Rosen] Add more descriptive error message. 57cc8a1 [Josh Rosen] Add error message when directly broadcasting RDD. 15b2e6b [Josh Rosen] [SPARK-5063] Useful error messages for nested RDDs and actions inside of transformations (cherry picked from commit cef1f092a628ac20709857b4388bb10e0b5143b0) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 24 January 2015, 01:53:47 UTC
5aaf0e0 [SPARK-5233][Streaming] Fix error replaying of WAL introduced bug Because of lacking of `BlockAllocationEvent` in WAL recovery, the dangled event will mix into the new batch, which will lead to the wrong result. Details can be seen in [SPARK-5233](https://issues.apache.org/jira/browse/SPARK-5233). Author: jerryshao <saisai.shao@intel.com> Closes #4032 from jerryshao/SPARK-5233 and squashes the following commits: f0b0c0b [jerryshao] Further address the comments a237c75 [jerryshao] Address the comments e356258 [jerryshao] Fix bug in unit test 558bdc3 [jerryshao] Correctly replay the WAL log when recovering from failure (cherry picked from commit 3c3fa632e6ba45ce536065aa1145698385301fb2) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 23 January 2015, 05:59:22 UTC
5d07488 [HOTFIX] Fixed compilation error due to missing SparkContext._ implicit conversions. 22 January 2015, 09:38:09 UTC
cab410c [SPARK-5147][Streaming] Delete the received data WAL log periodically This is a refactored fix based on jerryshao 's PR #4037 This enabled deletion of old WAL files containing the received block data. Improvements over #4037 - Respecting the rememberDuration of all receiver streams. In #4037, if there were two receiver streams with multiple remember durations, the deletion would have delete based on the shortest remember duration, thus deleting data prematurely for the receiver stream with longer remember duration. - Added unit test to test creation of receiver WAL, automatic deletion, and respecting of remember duration. jerryshao I am going to merge this ASAP to make it 1.2.1 Thanks for the initial draft of this PR. Made my job much easier. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: jerryshao <saisai.shao@intel.com> Closes #4149 from tdas/SPARK-5147 and squashes the following commits: 730798b [Tathagata Das] Added comments. c4cf067 [Tathagata Das] Minor fixes 2579b27 [Tathagata Das] Refactored the fix to make sure that the cleanup respects the remember duration of all the receiver streams 2736fd1 [jerryshao] Delete the old WAL log periodically (cherry picked from commit 3027f06b4127ab23a43c5ce8cebf721e3b6766e5) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 22 January 2015, 07:42:04 UTC
dd18429 [SPARK-5355] make SparkConf thread-safe The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. This PR changes SparkConf.settings to a thread-safe TrieMap. Author: Davies Liu <davies@databricks.com> Closes #4143 from davies/safe-conf and squashes the following commits: f8fa1cf [Davies Liu] change to TrieMap a1d769a [Davies Liu] make SparkConf thread-safe (cherry picked from commit 9bad062268676aaa66dcbddd1e0ab7f2d7742425) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 22 January 2015, 00:52:46 UTC
079b3be Make sure only owner can read / write to directories created for the job. Whenever a directory is created by the utility method, immediately restrict its permissions so that only the owner has access to its contents. Signed-off-by: Josh Rosen <joshrosen@databricks.com> 21 January 2015, 22:38:14 UTC
bb8bd11 [SPARK-5006][Deploy]spark.port.maxRetries doesn't work https://issues.apache.org/jira/browse/SPARK-5006 I think the issue is produced in https://github.com/apache/spark/pull/1777. Not digging mesos's backend yet. Maybe should add same logic either. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3841 from WangTaoTheTonic/SPARK-5006 and squashes the following commits: 8cdf96d [WangTao] indent thing 2d86d65 [WangTaoTheTonic] fix line length 7cdfd98 [WangTaoTheTonic] fit for new HttpServer constructor 61a370d [WangTaoTheTonic] some minor fixes bc6e1ec [WangTaoTheTonic] rebase 67bcb46 [WangTaoTheTonic] put conf at 3rd position, modify suite class, add comments f450cd1 [WangTaoTheTonic] startServiceOnPort will use a SparkConf arg 29b751b [WangTaoTheTonic] rebase as ExecutorRunnableUtil changed to ExecutorRunnable 396c226 [WangTaoTheTonic] make the grammar more like scala 191face [WangTaoTheTonic] invalid value name 62ec336 [WangTaoTheTonic] spark.port.maxRetries doesn't work Conflicts: external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala 21 January 2015, 21:05:09 UTC
37db20c [SPARK-5064][GraphX] Add numEdges upperbound validation for R-MAT graph generator to prevent infinite loop I looked into GraphGenerators#chooseCell, and found that chooseCell can't generate more edges than pow(2, (2 * (log2(numVertices)-1))) to make a Power-law graph. (Ex. numVertices:4 upperbound:4, numVertices:8 upperbound:16, numVertices:16 upperbound:64) If we request more edges over the upperbound, rmatGraph fall into infinite loop. So, how about adding an argument validation? Author: Kenji Kikushima <kikushima.kenji@lab.ntt.co.jp> Closes #3950 from kj-ki/SPARK-5064 and squashes the following commits: 4ee18c7 [Ankur Dave] Reword error message and add unit test d760bc7 [Kenji Kikushima] Add numEdges upperbound validation for R-MAT graph generator to prevent infinite loop. (cherry picked from commit 3ee3ab592eee831d759c940eb68231817ad6d083) Signed-off-by: Ankur Dave <ankurdave@gmail.com> 21 January 2015, 20:38:09 UTC
e90f6b5 [SPARK-4161]Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf Author: GuoQiang Li <witgo@qq.com> Closes #3050 from witgo/SPARK-4161 and squashes the following commits: abb6fa4 [GuoQiang Li] move usejavacp opt to spark-shell 89e39e7 [GuoQiang Li] review commit c2a6f04 [GuoQiang Li] Spark shell class path is not correctly set if "spark.driver.extraClassPath" is set in defaults.conf 21 January 2015, 18:57:15 UTC
1d73017 [SPARK-4569] Rename 'externalSorting' in Aggregator Hi all - I've renamed the unhelpfully named variable and added a comment clarifying what's actually happening. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #3666 from ilganeli/SPARK-4569B and squashes the following commits: 1810394 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator e2d2092 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator d7cefec [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator 5b3f39c [Ilya Ganelin] [SPARK-4569] Rename in Aggregator 21 January 2015, 18:50:38 UTC
0c13eed [SPARK-4759] Fix driver hanging from coalescing partitions The driver hangs sometimes when we coalesce RDD partitions. See JIRA for more details and reproduction. This is because our use of empty string as default preferred location in `CoalescedRDDPartition` causes the `TaskSetManager` to schedule the corresponding task on host `""` (empty string). The intended semantics here, however, is that the partition does not have a preferred location, and the TSM should schedule the corresponding task accordingly. Author: Andrew Or <andrew@databricks.com> Closes #3633 from andrewor14/coalesce-preferred-loc and squashes the following commits: e520d6b [Andrew Or] Oops 3ebf8bd [Andrew Or] A few comments f370a4e [Andrew Or] Fix tests 2f7dfb6 [Andrew Or] Avoid using empty string as default preferred location (cherry picked from commit 4f93d0cabe5d1fc7c0fd0a33d992fd85df1fecb4) Signed-off-by: Andrew Or <andrew@databricks.com> 21 January 2015, 18:42:35 UTC
fd6266f [HOTFIX] Update pom.xml to pull MapR's Hadoop version 2.4.1. Author: Kannan Rajah <rkannan82@gmail.com> Closes #4108 from rkannan82/master and squashes the following commits: eca095b [Kannan Rajah] Update pom.xml to pull MapR's Hadoop version 2.4.1. (cherry picked from commit ec5b0f2cef4b30047c7f88bdc00d10b6aa308124) Signed-off-by: Patrick Wendell <patrick@databricks.com> 21 January 2015, 07:34:34 UTC
410b908 [SPARK-5275] [Streaming] include python source code Include the python source code into assembly jar. cc mengxr pwendell Author: Davies Liu <davies@databricks.com> Closes #4128 from davies/build_streaming2 and squashes the following commits: 546af4c [Davies Liu] fix indent 48859b2 [Davies Liu] include python source code (cherry picked from commit bad6c5721167153d7ed834b49f87bf2980c6ed67) Signed-off-by: Patrick Wendell <patrick@databricks.com> 21 January 2015, 06:45:34 UTC
92c238c [SPARK-4959][SQL] Attributes are case sensitive when using a select query from a projection(Backport to Spark-1.2) This is a follow up of #3796 , which can not be merged back to Spark-1.2. Manually merge it. Author: Cheng Hao <hao.cheng@intel.com> Closes #4013 from chenghao-intel/spark_4959_backport and squashes the following commits: 1f6c93d [Cheng Hao] backport to Spark-1.2 21 January 2015, 06:41:38 UTC
692dc5b SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840... ... by Piotr Kolaczkowski) Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4113 from jacek-lewandowski/SPARK-4660-master and squashes the following commits: a5e84ca [Jacek Lewandowski] SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840 by Piotr Kolaczkowski) (cherry picked from commit c93a57f0d6dc32b127aa68dbe4092ab0b22a9667) Signed-off-by: Patrick Wendell <patrick@databricks.com> 20 January 2015, 20:38:09 UTC
228bf6c [SPARK-4803] [streaming] Remove duplicate RegisterReceiver message - The ReceiverTracker receivers `RegisterReceiver` messages two times 1) When the actor at `ReceiverSupervisorImpl`'s preStart is invoked 2) After the receiver is started at the executor `onReceiverStart()` at `ReceiverSupervisorImpl` Though, RegisterReceiver message uses the same streamId and the receiverInfo gets updated everytime the message is processed at the `ReceiverTracker`, it makes sense to call register receiver only after the receiver is started. Author: Ilayaperumal Gopinathan <igopinathan@pivotal.io> Closes #3648 from ilayaperumalg/RTActor-remove-prestart and squashes the following commits: 868efab [Ilayaperumal Gopinathan] Increase receiverInfo collector timeout to 2 secs 3118e5e [Ilayaperumal Gopinathan] Fix StreamingListenerSuite's startedReceiverStreamIds size 634abde [Ilayaperumal Gopinathan] Remove duplicate RegisterReceiver message (cherry picked from commit 4afad9c7702239f6d5b1b49dc48ee08580964e17) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 20 January 2015, 09:46:58 UTC
6599f50 [SPARK-4504][Examples] fix run-example failure if multiple assembly jars exist Fix run-example script to fail fast with useful error message if multiple example assembly JARs are present. Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #3377 from gvramana/run-example_fails and squashes the following commits: fa7f481 [Venkata Ramana Gollamudi] Fixed review comments, avoiding ls output scanning. 6aa1ab7 [Venkata Ramana Gollamudi] Fix run-examples script error during multiple jars (cherry picked from commit 74de94ea6db96a04b278c6106264313504d7b8f3) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: bin/compute-classpath.sh 19 January 2015, 20:03:55 UTC
5d5ee40 [SPARK-5282][mllib]: RowMatrix easily gets int overflow in the memory size warning JIRA: https://issues.apache.org/jira/browse/SPARK-5282 fix the possible int overflow in the memory computation warning Author: Yuhao Yang <hhbyyh@gmail.com> Closes #4069 from hhbyyh/addscStop and squashes the following commits: e54e5c8 [Yuhao Yang] change to MB based number 7afac23 [Yuhao Yang] 5282: fix int overflow in the warning (cherry picked from commit 4432568aac1d4a44fa1a7c3469f095eb7a6ce945) Signed-off-by: Xiangrui Meng <meng@databricks.com> 19 January 2015, 18:10:22 UTC
94bafd8 [SPARK-5289]: Backport publishing of repl, yarn into branch-1.2. This change was done in SPARK-4048 as part of a larger refactoring, but we need to backport this publishing of yarn and repl into Spark 1.2, so that we can cut a 1.2.1 release with these artifacts. Author: Patrick Wendell <patrick@databricks.com> Closes #4079 from pwendell/skip-deps and squashes the following commits: 807b833 [Patrick Wendell] [SPARK-5289]: Backport publishing of repl, yarn into branch-1.2. 18 January 2015, 01:05:04 UTC
4a550ac [SPARK-733] Add documentation on use of accumulators in lazy transformation I've added documentation clarifying the particular lack of clarity highlighted in the relevant JIRA. I've also added code examples for this issue to clarify the explanation. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #4022 from ilganeli/SPARK-733 and squashes the following commits: 587def5 [Ilya Ganelin] Updated to clarify verbage df3afd7 [Ilya Ganelin] Revert "Partially updated task metrics to make some vars private" 3f6c512 [Ilya Ganelin] Revert "Completed refactoring to make vars in TaskMetrics class private" 58034fb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733 4dc2cdb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733 3a38db1 [Ilya Ganelin] Verified documentation update by building via jekyll 33b5a2d [Ilya Ganelin] Added code examples for java and python 1fd59b2 [Ilya Ganelin] Updated documentation for accumulators to highlight lazy evaluation issue 5525c20 [Ilya Ganelin] Completed refactoring to make vars in TaskMetrics class private c64da4f [Ilya Ganelin] Partially updated task metrics to make some vars private (cherry picked from commit fd3a8a1d15ad516ea056089e30d6fd14e2f2d9a1) Signed-off-by: Imran Rashid <irashid@cloudera.com> 16 January 2015, 21:25:47 UTC
473777e [DOCS] Fix typo in return type of cogroup This fixes a simple typo in the cogroup docs noted in http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAMAsSdJ8_24evMAMg7fOZCQjwimisbYWa9v8BN6Rc3JCauja6wmail.gmail.com%3E I didn't bother with a JIRA Author: Sean Owen <sowen@cloudera.com> Closes #4072 from srowen/CogroupDocFix and squashes the following commits: 43c850b [Sean Owen] Fix typo in return type of cogroup (cherry picked from commit f6b852aade7668c99f37c69f606c64763cb265d2) Signed-off-by: Andrew Or <andrew@databricks.com> 16 January 2015, 17:28:49 UTC
e38cb29 [SPARK-5201][CORE] deal with int overflow in the ParallelCollectionRDD.slice method There is an int overflow in the ParallelCollectionRDD.slice method. That's originally reported by SaintBacchus. ``` sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count // result = 2147483646 = Int.MaxValue - 1 ``` see https://github.com/apache/spark/pull/2874 for more details. This pr try to fix the overflow. However, There's another issue I don't address. ``` val largeRange = Int.MinValue to Int.MaxValue largeRange.length // throws java.lang.IllegalArgumentException: -2147483648 to 2147483647 by 1: seqs cannot contain more than Int.MaxValue elements. ``` So, the range we feed to sc.makeRDD cannot contain more than Int.MaxValue elements. This is the limitation of Scala. However I think we may want to support that kind of range. But the fix is beyond this pr. srowen andrewor14 would you mind take a look at this pr? Author: Ye Xianjin <advancedxy@gmail.com> Closes #4002 from advancedxy/SPARk-5201 and squashes the following commits: 96265a1 [Ye Xianjin] Update slice method comment and some responding docs. e143d7a [Ye Xianjin] Update inclusive range check for splitting inclusive range. b3f5577 [Ye Xianjin] We can include the last element in the last slice in general for inclusive range, hence eliminate the need to check Int.MaxValue or Int.MinValue. 7d39b9e [Ye Xianjin] Convert the two cases pattern matching to one case. 651c959 [Ye Xianjin] rename sign to needsInclusiveRange. add some comments 196f8a8 [Ye Xianjin] Add test cases for ranges end with Int.MaxValue or Int.MinValue e66e60a [Ye Xianjin] Deal with inclusive and exclusive ranges in one case. If the range is inclusive and the end of the range is (Int.MaxValue or Int.MinValue), we should use inclusive range instead of exclusive 16 January 2015, 17:24:44 UTC
89a0990 [SPARK-4033][Examples]Input of the SparkPi too big causes the emption exception If input of the SparkPi args is larger than the 25000, the integer 'n' inside the code will be overflow, and may be a negative number. And it causes the (0 until n) Seq as an empty seq, then doing the action 'reduce' will throw the UnsupportedOperationException("empty collection"). The max size of the input of sc.parallelize is Int.MaxValue - 1, not the Int.MaxValue. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #2874 from SaintBacchus/SparkPi and squashes the following commits: 62d7cd7 [huangzhaowei] Add a commit to explain the modify 4cdc388 [huangzhaowei] Update SparkPi.scala 9a2fb7b [huangzhaowei] Input of the SparkPi is too big 16 January 2015, 17:24:36 UTC
b3fe6df [SPARK-5224] [PySpark] improve performance of parallelize list/ndarray After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default. Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__. Here is the benchmark for parallelize 1 millions int with list or ndarray: | before | after | improvements ------- | ------------ | ------------- | ------- list | 11.7 s | 0.8 s | 14x numpy.ndarray | 32 s | 0.7 s | 40x Author: Davies Liu <davies@databricks.com> Closes #4024 from davies/opt_numpy and squashes the following commits: 7618c7c [Davies Liu] improve performance of parallelize list/ndarray (cherry picked from commit 3c8650c12ad7a97852e7bd76153210493fd83e92) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 15 January 2015, 19:41:05 UTC
3813547 [SPARK-5254][MLLIB] remove developers section from spark.ml guide Forgot to remove this section in #4052. Author: Xiangrui Meng <meng@databricks.com> Closes #4053 from mengxr/SPARK-5254-update and squashes the following commits: f295bde [Xiangrui Meng] remove developers section from spark.ml guide (cherry picked from commit 6abc45e340d3be5f07236adc104db5f8dda0d514) Signed-off-by: Xiangrui Meng <meng@databricks.com> 15 January 2015, 02:54:28 UTC
47fb0d0 [SPARK-5254][MLLIB] Update the user guide to position spark.ml better The current statement in the user guide may deliver confusing messages to users. spark.ml contains high-level APIs for building ML pipelines. But it doesn't mean that spark.mllib is being deprecated. First of all, the pipeline API is in its alpha stage and we need to see more use cases from the community to stabilizes it, which may take several releases. Secondly, the components in spark.ml are simple wrappers over spark.mllib implementations. Neither the APIs or the implementations from spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs to build their ML pipelines, but we will keep supporting and adding features to spark.mllib. For example, there are many features in review at https://spark-prs.appspot.com/#mllib. So users should be comfortable with using spark.mllib features and expect more coming. The user guide needs to be updated to make the message clear. Author: Xiangrui Meng <meng@databricks.com> Closes #4052 from mengxr/SPARK-5254 and squashes the following commits: 6d5f1d3 [Xiangrui Meng] typo 0cc935b [Xiangrui Meng] update user guide to position spark.ml better (cherry picked from commit 13d2406781714daea2bbf3bfb7fec0dead10760c) Signed-off-by: Xiangrui Meng <meng@databricks.com> 15 January 2015, 01:50:41 UTC
f7bbe29 [SPARK-5234][ml]examples for ml don't have sparkContext.stop JIRA issue: https://issues.apache.org/jira/browse/SPARK-5234 simply add the call. Author: Yuhao Yang <yuhao@yuhaodevbox.sh.intel.com> Closes #4044 from hhbyyh/addscStop and squashes the following commits: c1f75ac [Yuhao Yang] add SparkContext.stop to 3 ml examples (cherry picked from commit 76389c5b99183e456ff85fd92ea68d95c4c13e82) Signed-off-by: Xiangrui Meng <meng@databricks.com> 14 January 2015, 19:55:38 UTC
1b6596e [SPARK-5223] [MLlib] [PySpark] fix MapConverter and ListConverter in MLlib It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector. Also, pickle may have better performance for larger object (less RPC). In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert. This PR should be ported into branch-1.2 Author: Davies Liu <davies@databricks.com> Closes #4023 from davies/listconvert and squashes the following commits: 55d4ab2 [Davies Liu] fix MapConverter and ListConverter in MLlib (cherry picked from commit 8ead999fd627b12837fb2f082a0e76e9d121d269) Signed-off-by: Xiangrui Meng <meng@databricks.com> 13 January 2015, 20:50:39 UTC
7809683 [SPARK-5131][Streaming][DOC]: There is a discrepancy in WAL implementation and configuration doc. There is a discrepancy in WAL implementation and configuration doc. Author: uncleGen <hustyugm@gmail.com> Closes #3930 from uncleGen/master-clean-doc and squashes the following commits: 3a4245f [uncleGen] doc typo 8e407d3 [uncleGen] doc typo (cherry picked from commit 39e333ec4350ddafe29ee0958c37eec07bec85df) Signed-off-by: Andrew Or <andrew@databricks.com> 13 January 2015, 18:07:26 UTC
6d23af6 [SPARK-5049][SQL] Fix ordering of partition columns in ParquetTableScan Followup to #3870. Props to rahulaggarwalguavus for identifying the issue. Author: Michael Armbrust <michael@databricks.com> Closes #3990 from marmbrus/SPARK-5049 and squashes the following commits: dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow (cherry picked from commit 5d9fa550820543ee1b0ce82997917745973a5d65) Signed-off-by: Michael Armbrust <michael@databricks.com> 12 January 2015, 23:19:35 UTC
5970f0b [SPARK-5078] Optionally read from SPARK_LOCAL_HOSTNAME Current spark lets you set the ip address using SPARK_LOCAL_IP, but then this is given to akka after doing a reverse DNS lookup. This makes it difficult to run spark in Docker. You can already change the hostname that is used programmatically, but it would be nice to be able to do this with an environment variable as well. Author: Michael Armbrust <michael@databricks.com> Closes #3893 from marmbrus/localHostnameEnv and squashes the following commits: 85045b6 [Michael Armbrust] Optionally read from SPARK_LOCAL_HOSTNAME (cherry picked from commit a3978f3e156e0ca67e978f1795b238ddd69ff9a6) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 12 January 2015, 19:58:21 UTC
558be07 [SPARK-5102][Core]subclass of MapStatus needs to be registered with Kryo CompressedMapStatus and HighlyCompressedMapStatus needs to be registered with Kryo, because they are subclass of MapStatus. Author: lianhuiwang <lianhuiwang09@gmail.com> Closes #4007 from lianhuiwang/SPARK-5102 and squashes the following commits: 9d2238a [lianhuiwang] remove register of MapStatus 05a285d [lianhuiwang] subclass of MapStatus needs to be registered with Kryo (cherry picked from commit ef9224e08010420b570c21a0b9208d22792a24fe) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 12 January 2015, 18:57:24 UTC
a1ee09e [SPARK-5200] Disable web UI in Hive ThriftServer tests Disables the Spark web UI in HiveThriftServer2Suite in order to prevent Jenkins test failures due to port contention. Author: Josh Rosen <joshrosen@databricks.com> Closes #3998 from JoshRosen/SPARK-5200 and squashes the following commits: a384416 [Josh Rosen] [SPARK-5200] Disable web UI in Hive Thriftserver tests. (cherry picked from commit 82fd38dcdcc9f7df18930c0e08cc8ec34eaee828) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suite.scala 12 January 2015, 18:48:36 UTC
056149d [SPARK-4951][Core] Fix the issue that a busy executor may be killed A few changes to fix this issue: 1. Handle the case that receiving `SparkListenerTaskStart` before `SparkListenerBlockManagerAdded`. 2. Don't add `executorId` to `removeTimes` when the executor is busy. 3. Use `HashMap.retain` to safely traverse the HashMap and remove items. 4. Use the same lock in ExecutorAllocationManager and ExecutorAllocationListener to fix the race condition in `totalPendingTasks`. 5. Move the blocking codes out of the message processing code in YarnSchedulerActor. Author: zsxwing <zsxwing@gmail.com> Closes #3783 from zsxwing/SPARK-4951 and squashes the following commits: d51fa0d [zsxwing] Add comments 2e365ce [zsxwing] Remove expired executors from 'removeTimes' and add idle executors back when a new executor joins 49f61a9 [zsxwing] Eliminate duplicate executor registered warnings d4c4e9a [zsxwing] Minor fixes for the code style 05f6238 [zsxwing] Move the blocking codes out of the message processing code 105ba3a [zsxwing] Fix the race condition in totalPendingTasks d5c615d [zsxwing] Fix the issue that a busy executor may be killed (cherry picked from commit 6942b974adad396cba2799eac1fa90448cea4da7) Signed-off-by: Andrew Or <andrew@databricks.com> 12 January 2015, 00:23:56 UTC
f04ff9d [SPARK-5181] do not print writing WAL log when WAL is disabled https://issues.apache.org/jira/browse/SPARK-5181 Currently, even the logManager is not created, we still see the log entry s"Writing to log $record" a simple fix to make log more accurate Author: CodingCat <zhunansjtu@gmail.com> Closes #3985 from CodingCat/SPARK-5181 and squashes the following commits: 0e27dc5 [CodingCat] do not print writing WAL log when WAL is disabled (cherry picked from commit f0d558b6e6ec0c97280d5844c98fb92c24954cbb) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 10 January 2015, 23:36:05 UTC
cce003d [SPARK-5187][SQL] Fix caching of tables with HiveUDFs in the WHERE clause Author: Michael Armbrust <michael@databricks.com> Closes #3987 from marmbrus/hiveUdfCaching and squashes the following commits: 8bca2fa [Michael Armbrust] [SPARK-5187][SQL] Fix caching of tables with HiveUDFs in the WHERE clause (cherry picked from commit 3684fd21e1ffdc0adaad8ff6b31394b637e866ce) Signed-off-by: Michael Armbrust <michael@databricks.com> 10 January 2015, 22:25:57 UTC
c6ea6d4 [SPARK-4943][SQL] Allow table name having dot for db/catalog The pull only fixes the parsing error and changes API to use tableIdentifier. Joining different catalog datasource related change is not done in this pull. Author: Alex Liu <alex_liu68@yahoo.com> Closes #3941 from alexliu68/SPARK-SQL-4943-3 and squashes the following commits: 343ae27 [Alex Liu] [SPARK-4943][SQL] refactoring according to review 29e5e55 [Alex Liu] [SPARK-4943][SQL] fix failed Hive CTAS tests 6ae77ce [Alex Liu] [SPARK-4943][SQL] fix TestHive matching error 3652997 [Alex Liu] [SPARK-4943][SQL] Allow table name having dot to support db/catalog ... (cherry picked from commit 4b39fd1e63188821fc84a13f7ccb6e94277f4be7) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateTableAsSelect.scala 10 January 2015, 21:42:36 UTC
09eef3b [SPARK-4925][SQL] Publish Spark SQL hive-thriftserver maven artifact Author: Alex Liu <alex_liu68@yahoo.com> Closes #3766 from alexliu68/SPARK-SQL-4925 and squashes the following commits: 3137b51 [Alex Liu] [SPARK-4925][SQL] Remove sql/hive-thriftserver module from pom.xml 15f2e38 [Alex Liu] [SPARK-4925][SQL] Publish Spark SQL hive-thriftserver maven artifact (cherry picked from commit 1e56eba5d906bef793dfd6f199db735a6116a764) Signed-off-by: Michael Armbrust <michael@databricks.com> 10 January 2015, 21:19:31 UTC
2f4e73d SPARK-5136 [DOCS] Improve documentation around setting up Spark IntelliJ project This PR simply points to the IntelliJ wiki page instead of also including IntelliJ notes in the docs. The intent however is to also update the wiki page with updated tips. This is the text I propose for the IntelliJ section on the wiki. I realize it omits some of the existing instructions on the wiki, about enabling Hive, but I think those are actually optional. ------ IntelliJ supports both Maven- and SBT-based projects. It is recommended, however, to import Spark as a Maven project. Choose "Import Project..." from the File menu, and select the `pom.xml` file in the Spark root directory. It is fine to leave all settings at their default values in the Maven import wizard, with two caveats. First, it is usually useful to enable "Import Maven projects automatically", sincchanges to the project structure will automatically update the IntelliJ project. Second, note the step that prompts you to choose active Maven build profiles. As documented above, some build configuration require specific profiles to be enabled. The same profiles that are enabled with `-P[profile name]` above may be enabled on this screen. For example, if developing for Hadoop 2.4 with YARN support, enable profiles `yarn` and `hadoop-2.4`. These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section. "Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources. Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field. It will work then although the option will come back when the project reimports. Author: Sean Owen <sowen@cloudera.com> Closes #3952 from srowen/SPARK-5136 and squashes the following commits: f3baa66 [Sean Owen] Point to new IJ / Eclipse wiki link 016b7df [Sean Owen] Point to IntelliJ wiki page instead of also including IntelliJ notes in the docs (cherry picked from commit 547df97715580f99ae573a49a86da12bf20cbc3d) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 09 January 2015, 17:36:06 UTC
71471bd [SPARK-4973][CORE] Local directory in the driver of client-mode continues remaining even if application finished when external shuffle is enabled When we enables external shuffle service, local directories in the driver of client-mode continue remaining even if application has finished. I think local directories for drivers should be deleted. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3811 from sarutak/SPARK-4973 and squashes the following commits: ad944ab [Kousuke Saruta] Fixed DiskBlockManager to cleanup local directory if it's the driver 43770da [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973 88feecd [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973 d99718e [Kousuke Saruta] Fixed SparkSubmit.scala and DiskBlockManager.scala in order to delete local directories of the driver of local-mode when external shuffle service is enabled (cherry picked from commit a00af6bec57b8df8b286aaa5897232475aef441c) Signed-off-by: Andrew Or <andrew@databricks.com> 08 January 2015, 21:43:20 UTC
755f9cc [SPARK-5130][Deploy]Take yarn-cluster as cluster mode in spark-submit https://issues.apache.org/jira/browse/SPARK-5130 Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3929 from WangTaoTheTonic/SPARK-5130 and squashes the following commits: c490648 [WangTaoTheTonic] take yarn-cluster as cluster mode in spark-submit (cherry picked from commit 0760787da885187b0c6dcd5c28753f0ab014d5ed) Signed-off-by: Andrew Or <andrew@databricks.com> 08 January 2015, 19:46:53 UTC
1770c51 [SPARK-5132][Core]Correct stage Attempt Id key in stageInfofromJson SPARK-5132: stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id Author: hushan[胡珊] <hushan@xiaomi.com> Closes #3932 from suyanNone/json-stage and squashes the following commits: 41419ab [hushan[胡珊]] Correct stage Attempt Id key in stageInfofromJson (cherry picked from commit d345ebebd554ac3faa4e870bd7800ed02e89da58) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 07 January 2015, 20:09:45 UTC
7a4be0b [YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HA Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry. The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this. So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #3771 from SaintBacchus/YarnHA and squashes the following commits: c02bfcc [huangzhaowei] Improve the comment of the funciton 'getDefaultFinalStatus' 0e69924 [huangzhaowei] Bug fix: fix the yarn-client code to support HA (cherry picked from commit 5fde66163fe460d6f64b145047f76cc4ee33601a) Signed-off-by: Thomas Graves <tgraves@apache.org> 07 January 2015, 14:11:14 UTC
db83acb [HOTFIX] Add missing SparkContext._ import to fix 1.2 build. This fixes a build break caused by a0bb88e0067688886be594d209fc48c91ed73a11 06 January 2015, 00:58:25 UTC
cf55a2b [SPARK-5089][PYSPARK][MLLIB] Fix vector convert This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans. The PR includes the fix, as well as a new test for the correct conversion behavior. davies Author: freeman <the.freeman.lab@gmail.com> Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits: 764db47 [freeman] Add a test for proper conversion behavior 704f97e [freeman] Return array after changing type (cherry picked from commit 6c6f32574023b8e43a24f2081ff17e6e446de2f3) Signed-off-by: Xiangrui Meng <meng@databricks.com> 05 January 2015, 21:11:47 UTC
f979205 [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environme... ...nt at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask - See the Jira Issue for more details. Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3741 from jongyoul/SPARK-4465 and squashes the following commits: 46ad71e [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed unused import 3d6631f [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed comments and adjusted indentations 2343f13 [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask (cherry picked from commit 1c0e7ce056c79e1db96f85b8c56a479b8b043970) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 05 January 2015, 20:05:45 UTC
a0bb88e [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery. Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat. In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output. This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable. Author: Josh Rosen <joshrosen@databricks.com> Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits: 36eaf35 [Josh Rosen] Add comment explaining use of transform() in test. 6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform() 7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming. e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic. 762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs. (cherry picked from commit 939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 05 January 2015, 04:27:09 UTC
67e2eb6 [SPARK-4631] unit test for MQTT Please review the unit test for MQTT Author: bilna <bilnap@am.amrita.edu> Author: Bilna P <bilna.p@gmail.com> Closes #3844 from Bilna/master and squashes the following commits: acea3a3 [bilna] Adding dependency with scope test 28681fa [bilna] Merge remote-tracking branch 'upstream/master' fac3904 [bilna] Correction in Indentation and coding style ed9db4c [bilna] Merge remote-tracking branch 'upstream/master' 4b34ee7 [Bilna P] Update MQTTStreamSuite.scala 04503cf [bilna] Added embedded broker service for mqtt test 89d804e [bilna] Merge remote-tracking branch 'upstream/master' fc8eb28 [bilna] Merge remote-tracking branch 'upstream/master' 4b58094 [Bilna P] Update MQTTStreamSuite.scala b1ac4ad [bilna] Added BeforeAndAfter 5f6bfd2 [bilna] Added BeforeAndAfter e8b6623 [Bilna P] Update MQTTStreamSuite.scala 5ca6691 [Bilna P] Update MQTTStreamSuite.scala 8616495 [bilna] [SPARK-4631] unit test for MQTT (cherry picked from commit e767d7ddac5c2330af553f2a74b8575dfc7afb67) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 05 January 2015, 03:39:03 UTC
9dbb62e [SPARK-4787] Stop SparkContext if a DAGScheduler init error occurs Author: Dale <tigerquoll@outlook.com> Closes #3809 from tigerquoll/SPARK-4787 and squashes the following commits: 5661e01 [Dale] [SPARK-4787] Ensure that call to stop() doesn't lose the exception by using a finally block. 2172578 [Dale] [SPARK-4787] Stop context properly if an exception occurs during DAGScheduler initialization. (cherry picked from commit 3fddc9468fa50e7683caa973fec6c52e1132268d) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 04 January 2015, 21:29:48 UTC
93617dd [SPARK-5058] Updated broken links Updated the broken link pointing to the KafkaWordCount example to the correct one. Author: sigmoidanalytics <mayur@sigmoidanalytics.com> Closes #3877 from sigmoidanalytics/patch-1 and squashes the following commits: 3e19b31 [sigmoidanalytics] Updated broken links (cherry picked from commit 342612b65f3d77c660383a332f0346872f076647) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 04 January 2015, 03:47:04 UTC
33f0b14 Fixed typos in streaming-kafka-integration.md Changed projrect to project :) Author: Akhil Das <akhld@darktech.ca> Closes #3876 from akhld/patch-1 and squashes the following commits: e0cf9ef [Akhil Das] Fixed typos in streaming-kafka-integration.md 02 January 2015, 23:18:05 UTC
da9a4b9 [HOTFIX] Bind web UI to ephemeral port in DriverSuite The job launched by DriverSuite should bind the web UI to an ephemeral port, since it looks like port contention in this test has caused a large number of Jenkins failures when many builds are started simultaneously. Our tests already disable the web UI, but this doesn't affect subprocesses launched by our tests. In this case, I've opted to bind to an ephemeral port instead of disabling the UI because disabling features in this test may mask its ability to catch certain bugs. See also: e24d3a9 Author: Josh Rosen <joshrosen@databricks.com> Closes #3873 from JoshRosen/driversuite-webui-port and squashes the following commits: 48cd05c [Josh Rosen] [HOTFIX] Bind web UI to ephemeral port in DriverSuite. (cherry picked from commit 012839807c3dc6e7c8c41ac6e956d52a550bb031) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 01 January 2015, 23:04:08 UTC
434ea00 [SPARK-5035] [Streaming] ReceiverMessage trait should extend Serializable Spark Streaming's ReceiverMessage trait should extend Serializable in order to fix a subtle bug that only occurs when running on a real cluster: If you attempt to send a fire-and-forget message to a remote Akka actor and that message cannot be serialized, then this seems to lead to more-or-less silent failures. As an optimization, Akka skips message serialization for messages sent within the same JVM. As a result, Spark's unit tests will never fail due to non-serializable Akka messages, but these will cause mostly-silent failures when running on a real cluster. Before this patch, here was the code for ReceiverMessage: ``` /** Messages sent to the NetworkReceiver. */ private[streaming] sealed trait ReceiverMessage private[streaming] object StopReceiver extends ReceiverMessage ``` Since ReceiverMessage does not extend Serializable and StopReceiver is a regular `object`, not a `case object`, StopReceiver will throw serialization errors. As a result, graceful receiver shutdown is broken on real clusters (and local-cluster mode) but works in local modes. If you want to reproduce this, try running the word count example from the Streaming Programming Guide in the Spark shell: ``` import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ val ssc = new StreamingContext(sc, Seconds(10)) // Create a DStream that will connect to hostname:port, like localhost:9999 val lines = ssc.socketTextStream("localhost", 9999) // Split each line into words val words = lines.flatMap(_.split(" ")) import org.apache.spark.streaming.StreamingContext._ // Count each word in each batch val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // Print the first ten elements of each RDD generated in this DStream to the console wordCounts.print() ssc.start() Thread.sleep(10000) ssc.stop(true, true) ``` Prior to this patch, this would work correctly in local mode but fail when running against a real cluster (it would report that some receivers were not shut down). Author: Josh Rosen <joshrosen@databricks.com> Closes #3857 from JoshRosen/SPARK-5035 and squashes the following commits: 71d0eae [Josh Rosen] [SPARK-5035] ReceiverMessage trait should extend Serializable. (cherry picked from commit fe6efacc0b865e9e827a1565877077000e63976e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 01 January 2015, 00:03:03 UTC
14dbd83 [SPARK-5028][Streaming]Add total received and processed records metrics to Streaming UI This is a follow-up work of [SPARK-4537](https://issues.apache.org/jira/browse/SPARK-4537). Adding total received records and processed records metrics back to UI. ![screenshot](https://dl.dropboxusercontent.com/u/19230832/screenshot.png) Author: jerryshao <saisai.shao@intel.com> Closes #3852 from jerryshao/SPARK-5028 and squashes the following commits: c8c4877 [jerryshao] Add total received and processed metrics to Streaming UI (cherry picked from commit fdc2aa4918fd4c510f04812b782cc0bfef9a2107) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 31 December 2014, 22:45:51 UTC
bd70ff9 [SPARK-4790][STREAMING] Fix ReceivedBlockTrackerSuite waits for old file... ...s to get deleted before continuing. Since the deletes are happening asynchronously, the getFileStatus call might throw an exception in older HDFS versions, if the delete happens between the time listFiles is called on the directory and getFileStatus is called on the file in the getFileStatus method. This PR addresses this by adding an option to delete the files synchronously and then waiting for the deletion to complete before proceeding. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #3726 from harishreedharan/spark-4790 and squashes the following commits: bbbacd1 [Hari Shreedharan] Call cleanUpOldLogs only once in the tests. 3255f17 [Hari Shreedharan] Add test for async deletion. Remove method from ReceiverTracker that does not take waitForCompletion. e4c83ec [Hari Shreedharan] Making waitForCompletion a mandatory param. Remove eventually from WALSuite since the cleanup method returns only after all files are deleted. af00fd1 [Hari Shreedharan] [SPARK-4790][STREAMING] Fix ReceivedBlockTrackerSuite waits for old files to get deleted before continuing. (cherry picked from commit 3610d3c615112faef98d94f04efaea602cc4aa8f) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 31 December 2014, 22:35:25 UTC
076de46 [HOTFIX] Disable Spark UI in SparkSubmitSuite tests This should fix a major cause of build breaks when running many parallel tests. 31 December 2014, 22:13:43 UTC
7c9c25b [SPARK-4298][Core] - The spark-submit cannot read Main-Class from Manifest. Resolves a bug where the `Main-Class` from a .jar file wasn't being read in properly. This was caused by the fact that the `primaryResource` object was a URI and needed to be normalized through a call to `.getPath` before it could be passed into the `JarFile` object. Author: Brennon York <brennon.york@capitalone.com> Closes #3561 from brennonyork/SPARK-4298 and squashes the following commits: 5e0fce1 [Brennon York] Use string interpolation for error messages, moved comment line from original code to above its necessary code segment 14daa20 [Brennon York] pushed mainClass assignment into match statement, removed spurious spaces, removed { } from case statements, removed return values c6dad68 [Brennon York] Set case statement to support multiple jar URI's and enabled the 'file' URI to load the main-class 8d20936 [Brennon York] updated to reset the error message back to the default a043039 [Brennon York] updated to split the uri and jar vals 8da7cbf [Brennon York] fixes SPARK-4298 (cherry picked from commit 8e14c5eb551ab06c94859c7f6d8c6b62b4d00d59) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 31 December 2014, 19:54:38 UTC
ad3dc81 [SPARK-1010] Clean up uses of System.setProperty in unit tests Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures. This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself). For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure. See the block comment at the top of the ResetSystemProperties class for more details. Author: Josh Rosen <joshrosen@databricks.com> Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits: 0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools 3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext 4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties 4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering. 0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite. 7a3d224 [Josh Rosen] Fix trait ordering 3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite 655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite 3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite 8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait. 633a84a [Josh Rosen] Remove use of system properties in FileServerSuite 25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite 1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite 5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite 0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite 51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite 60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite 14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite 628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite 9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite. 4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class. (cherry picked from commit 352ed6bbe3c3b67e52e298e7c535ae414d96beca) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 31 December 2014, 19:04:39 UTC
edc96d8 [SPARK-4813][Streaming] Fix the issue that ContextWaiter didn't handle 'spurious wakeup' Used `Condition` to rewrite `ContextWaiter` because it provides a convenient API `awaitNanos` for timeout. Author: zsxwing <zsxwing@gmail.com> Closes #3661 from zsxwing/SPARK-4813 and squashes the following commits: 52247f5 [zsxwing] Add explicit unit type be42bcf [zsxwing] Update as per review suggestion e06bd4f [zsxwing] Fix the issue that ContextWaiter didn't handle 'spurious wakeup' (cherry picked from commit 6a897829444e2ef273586511f93a40d36e64fb0b) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 30 December 2014, 22:39:36 UTC
7a24541 [SPARK-4386] Improve performance when writing Parquet files Convert type of RowWriteSupport.attributes to Array. Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared). Measurements on 575 column table showed this change made a 6x improvement in write times. Author: Michael Davies <Michael.BellDavies@gmail.com> Closes #3843 from MickDavies/SPARK-4386 and squashes the following commits: 892519d [Michael Davies] [SPARK-4386] Improve performance when writing Parquet files (cherry picked from commit 7425bec320227bf8818dc2844c12d5373d166364) Signed-off-by: Michael Armbrust <michael@databricks.com> 30 December 2014, 21:41:08 UTC
cde8a31 [SPARK-4908][SQL] Prevent multiple concurrent hive native commands This is just a quick fix that locks when calling `runHive`. If we can find a way to avoid the error without a global lock that would be better. Author: Michael Armbrust <michael@databricks.com> Closes #3834 from marmbrus/hiveConcurrency and squashes the following commits: bf25300 [Michael Armbrust] prevent multiple concurrent hive native commands (cherry picked from commit 480bd1d2edd1de06af607b0cf3ff3c0b16089add) Signed-off-by: Michael Armbrust <michael@databricks.com> 30 December 2014, 19:24:59 UTC
42809db [SPARK-4882] Register PythonBroadcast with Kryo so that PySpark works with KryoSerializer This PR fixes an issue where PySpark broadcast variables caused NullPointerExceptions if KryoSerializer was used. The fix is to register PythonBroadcast with Kryo so that it's deserialized with a KryoJavaSerializer. Author: Josh Rosen <joshrosen@databricks.com> Closes #3831 from JoshRosen/SPARK-4882 and squashes the following commits: 0466c7a [Josh Rosen] Register PythonBroadcast with Kryo. d5b409f [Josh Rosen] Enable registrationRequired, which would have caught this bug. 069d8a7 [Josh Rosen] Add failing test for SPARK-4882 (cherry picked from commit efa80a531ecd485f6cf0cdc24ffa42ba17eea46d) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 30 December 2014, 17:30:15 UTC
e20d632 [SPARK-4920][UI] add version on master and worker page for standalone mode Author: Zhang, Liye <liye.zhang@intel.com> Closes #3769 from liyezhang556520/spark-4920_WebVersion and squashes the following commits: 3bb7e0d [Zhang, Liye] add version on master and worker page (cherry picked from commit 9077e721cd36adfecd50cbd1fd7735d28e5be8b5) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 30 December 2014, 17:20:25 UTC
e81c869 SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions takeOrdered should skip reduce step in case mapped RDDs have no partitions. This prevents the mentioned exception : 4. run query SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC LIMIT 100; Error trace java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.reduce(RDD.scala:863) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136) Author: Yash Datta <Yash.Datta@guavus.com> Closes #3830 from saucam/fix_takeorder and squashes the following commits: 5974d10 [Yash Datta] SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions (cherry picked from commit 9bc0df6804f241aff24520d9c6ec54d9b11f5785) Signed-off-by: Reynold Xin <rxin@databricks.com> 29 December 2014, 21:50:34 UTC
7604666 [SPARK-4982][DOC] `spark.ui.retainedJobs` description is wrong in Spark UI configuration guide Author: wangxiaojing <u9jing@gmail.com> Closes #3818 from wangxiaojing/SPARK-4982 and squashes the following commits: fe2ad5f [wangxiaojing] change stages to jobs (cherry picked from commit 6645e52580747990321e22340ae742f26d2f2504) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 29 December 2014, 18:46:13 UTC
2cd446a [SPARK-4966][YARN]The MemoryOverhead value is setted not correctly Author: meiyoula <1039320815@qq.com> Closes #3797 from XuTingjun/MemoryOverhead and squashes the following commits: 5a780fc [meiyoula] Update ClientArguments.scala (cherry picked from commit 14fa87bdf4b89cd392270864ee063ce01bd31887) Signed-off-by: Thomas Graves <tgraves@apache.org> 29 December 2014, 14:21:19 UTC
23d64cf [SPARK-4952][Core]Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails Author: GuoQiang Li <witgo@qq.com> Closes #3788 from witgo/SPARK-4952 and squashes the following commits: d903529 [GuoQiang Li] Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails (cherry picked from commit 080ceb771a1e6b9f844cfd4f1baa01133c106888) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 27 December 2014, 07:31:43 UTC
back to top