https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
95d28ff [maven-release-plugin] prepare release v0.9.0-incubating 24 January 2014, 06:15:08 UTC
2ac96e7 Updating changes file 24 January 2014, 05:55:49 UTC
c91f44a Revert "[maven-release-plugin] prepare release v0.9.0-incubating" This reverts commit 0771df675363c69622404cb514bd751bc90526af. 24 January 2014, 05:53:42 UTC
5992157 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit e1dc5bedb48d2ac9b9e1c9b3b1a15c41b7d90ad8. 24 January 2014, 05:53:37 UTC
d0a105d Merge pull request #505 from JoshRosen/SPARK-1026 Deprecate mapPartitionsWithSplit in PySpark (SPARK-1026) This commit deprecates `mapPartitionsWithSplit` in PySpark (see [SPARK-1026](https://spark-project.atlassian.net/browse/SPARK-1026) and removes the remaining references to it from the docs. (cherry picked from commit 05be7047744c88e64e7e6bd973f9bcfacd00da5f) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 24 January 2014, 04:53:31 UTC
e66d4c2 Merge pull request #503 from pwendell/master Fix bug on read-side of external sort when using Snappy. This case wasn't handled correctly and this patch fixes it. (cherry picked from commit 3d6e75419330d27435becfdf8cfb0b6d20d56cf8) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 24 January 2014, 03:47:16 UTC
e8d3f2b Merge pull request #502 from pwendell/clone-1 Remove Hadoop object cloning and warn users making Hadoop RDD's. The code introduced in #359 used Hadoop's WritableUtils.clone() to duplicate objects when reading from Hadoop files. Some users have reported exceptions when cloning data in various file formats, including Avro and another custom format. This patch removes that functionality to ensure stability for the 0.9 release. Instead, it puts a clear warning in the documentation that copying may be necessary for Hadoop data sets. (cherry picked from commit c3196171f3dffde6c9e67e3d35c398a01fbba846) Conflicts: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala 24 January 2014, 03:20:22 UTC
7a62353 Merge pull request #501 from JoshRosen/cartesian-rdd-fixes Fix two bugs in PySpark cartesian(): SPARK-978 and SPARK-1034 This pull request fixes two bugs in PySpark's `cartesian()` method: - [SPARK-978](https://spark-project.atlassian.net/browse/SPARK-978): PySpark's cartesian method throws ClassCastException exception - [SPARK-1034](https://spark-project.atlassian.net/browse/SPARK-1034): Py4JException on PySpark Cartesian Result The JIRAs have more details describing the fixes. (cherry picked from commit cad3002fead89d3c9a8de4fa989e88f367bc0b05) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 24 January 2014, 03:09:25 UTC
51960b8 Merge pull request #496 from pwendell/master Fix bug in worker clean-up in UI Introduced in d5a96fec (/cc @aarondav). This should be picked into 0.8 and 0.9 as well. The bug causes old (zombie) workers on a node to not disappear immediately from the UI when a new one registers. (cherry picked from commit a1cd185122602c96fb8ae16c0b506702283bf6e2) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 23 January 2014, 03:37:50 UTC
828f7b4 Merge pull request #495 from srowen/GraphXCommonsMathDependency Fix graphx Commons Math dependency `graphx` depends on Commons Math (2.x) in `SVDPlusPlus.scala`. However the module doesn't declare this dependency. It happens to work because it is included by Hadoop artifacts. But, I can tell you this isn't true as of a month or so ago. Building versus recent Hadoop would fail. (That's how we noticed.) The simple fix is to declare the dependency, as it should be. But it's also worth noting that `commons-math` is the old-ish 2.x line, while `commons-math3` is where newer 3.x releases are. Drop-in replacement, but different artifact and package name. Changing this only usage to `commons-math3` works, tests pass, and isn't surprising that it does, so is probably also worth changing. (A comment in some test code also references `commons-math3`, FWIW.) It does raise another question though: `mllib` looks like it uses the `jblas` `DoubleMatrix` for general purpose vector/matrix stuff. Should `graphx` really use Commons Math for this? Beyond the tiny scope here but worth asking. (cherry picked from commit 3184facdc5b1e9ded89133f9b1e4985c9ac78c55) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 22 January 2014, 23:45:18 UTC
dc5857a Merge pull request #492 from skicavs/master fixed job name and usage information for the JavaSparkPi example (cherry picked from commit a1238bb5fcab763d32c729ea7ed99cb3c05c896f) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 22 January 2014, 22:33:25 UTC
dd533c9 Merge pull request #478 from sryza/sandy-spark-1033 SPARK-1033. Ask for cores in Yarn container requests Tested on a pseudo-distributed cluster against the Fair Scheduler and observed a worker taking more than a single core. (cherry picked from commit 576c4a4c502ccca5fcd6b3552dd93cc2f3c50666) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 22 January 2014, 22:15:58 UTC
e1dc5be [maven-release-plugin] prepare for next development iteration 21 January 2014, 10:30:40 UTC
0771df6 [maven-release-plugin] prepare release v0.9.0-incubating 21 January 2014, 10:30:33 UTC
334a848 Revert "[maven-release-plugin] prepare release v0.9.0-incubating" This reverts commit cd65c150d7e4be55695ca54a1709d577fdd509ba. 21 January 2014, 10:19:04 UTC
a3bd28a Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit ab188b57b75267b118a45dc1c5ce35c1839d2ad6. 21 January 2014, 10:15:50 UTC
ab188b5 [maven-release-plugin] prepare for next development iteration 21 January 2014, 10:10:12 UTC
cd65c15 [maven-release-plugin] prepare release v0.9.0-incubating 21 January 2014, 10:10:06 UTC
51b5e04 Revert "[maven-release-plugin] prepare release v0.9.0-incubating" This reverts commit 7653cf39e851cb63f1fda899cd7904ffab9a7a51. 21 January 2014, 09:29:55 UTC
7653cf3 [maven-release-plugin] prepare release v0.9.0-incubating 21 January 2014, 09:23:01 UTC
3254566 Updating CHANGES.txt file 21 January 2014, 08:13:09 UTC
6b31963 Revert "[maven-release-plugin] prepare release v0.9.0-incubating" This reverts commit a7760eff4ea6a474cab68896a88550f63bae8b0d. 21 January 2014, 08:12:35 UTC
808a9f0 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 50b88ffcc6f80c86438b19788ec0eaf8f3a10ee4. 21 January 2014, 08:12:32 UTC
b6fd3cd Merge pull request #480 from pwendell/0.9-fixes Handful of 0.9 fixes This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate. @mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in #477 and follow up discussion with him. (cherry picked from commit 77b986f6616e6f7e0be9e46bb355829686f9845b) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 21 January 2014, 08:12:01 UTC
e5f8917 Merge pull request #484 from tdas/run-example-fix Made run-example respect SPARK_JAVA_OPTS and SPARK_MEM. bin/run-example scripts was not passing Java properties set through the SPARK_JAVA_OPTS to the example. This is important for examples like Twitter** as the Twitter authentication information must be set through java properties. Hence added the same JAVA_OPTS code in run-example as it is in bin/spark-class script. Also added SPARK_MEM, in case someone wants to run the example with different amounts of memory. This can be removed if it is not tune with the intended semantics of the run-example scripts. @matei Please check this soon I want this to go in 0.9-rc4 (cherry picked from commit c67d3d8beb101fff2ea6397b759dd1bfdf9fcfa5) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 21 January 2014, 07:35:07 UTC
410ba06 Merge pull request #482 from tdas/streaming-example-fix Added StreamingContext.awaitTermination to streaming examples StreamingContext.start() currently starts a non-daemon thread which prevents termination of a Spark Streaming program even if main function has exited. Since the expected behavior of a streaming program is to run until explicitly killed, this was sort of fine when spark streaming applications are launched from the command line. However, when launched in Yarn-standalone mode, this did not work as the driver effectively got terminated when the main function exits. So SparkStreaming examples did not work on Yarn. This addition to the examples ensures that the examples work on Yarn and also ensures that everyone learns that StreamingContext.awaitTermination() being necessary for SparkStreaming programs to wait. The true bug-fix of making sure all threads by Spark Streaming are daemon threads is left for post-0.9. (cherry picked from commit 0367981d47761cdccd8a44fc6fe803079979c5e3) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 21 January 2014, 06:26:14 UTC
f137947 Merge pull request #483 from pwendell/gitignore Restricting /lib to top level directory in .gitignore This patch was proposed by Sean Mackrory. (cherry picked from commit 7373ffb5e794d3163d3f8d1801836c891e0d6cca) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 21 January 2014, 06:24:07 UTC
50b88ff [maven-release-plugin] prepare for next development iteration 19 January 2014, 21:15:39 UTC
a7760ef [maven-release-plugin] prepare release v0.9.0-incubating 19 January 2014, 21:15:33 UTC
130b543 Updating CHANGES.txt 19 January 2014, 21:00:44 UTC
303b33f Revert "[maven-release-plugin] prepare release v0.9.0-incubating" This reverts commit 00c847af1d4be2fe5fad887a57857eead1e517dc. 19 January 2014, 20:59:22 UTC
4b4011b Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 34ae65b06128077751ec2b923c9740a429d8299d. 19 January 2014, 20:59:20 UTC
94ae25d Merge pull request #470 from tgravescs/fix_spark_examples_yarn Only log error on missing jar to allow spark examples to jar. Right now to run the spark examples on Yarn you have to use the --addJars option and put the jar in hdfs. To make that nicer so the user doesn't have to specify the --addJars option change it to simply log an error instead of throwing. (cherry picked from commit 792d9084e2bc9f778a00a56fa7dcfe4084153aea) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 19 January 2014, 19:33:51 UTC
0f077b5 Merge pull request #458 from tdas/docs-update Updated java API docs for streaming, along with very minor changes in the code examples. Docs updated for: Scala: StreamingContext, DStream, PairDStreamFunctions Java: JavaStreamingContext, JavaDStream, JavaPairDStream Example updated: JavaQueueStream: Not use deprecated method ActorWordCount: Use the public interface the right way. (cherry picked from commit 256a3553c447db0865ea8807a8fdbccb66a97b28) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 19 January 2014, 18:30:29 UTC
34ae65b [maven-release-plugin] prepare for next development iteration 19 January 2014, 05:45:20 UTC
00c847a [maven-release-plugin] prepare release v0.9.0-incubating 19 January 2014, 05:45:13 UTC
eddd347 Updating CHANGES.txt file 19 January 2014, 05:31:37 UTC
91c9709 Revert "[maven-release-plugin] prepare release v0.9.0-incubating" This reverts commit 77c32470a1b02d6f1475bda2cfb9ae5bd4b53dde. 19 January 2014, 05:29:40 UTC
3368699 Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 4f8f86c2c66dc2f6a17d5b0e4fdeeb06a71ba52f. 19 January 2014, 05:29:31 UTC
4f8f86c [maven-release-plugin] prepare for next development iteration 19 January 2014, 01:14:28 UTC
77c3247 [maven-release-plugin] prepare release v0.9.0-incubating 19 January 2014, 01:14:22 UTC
49a2c81 Typo fix in build versions 19 January 2014, 00:58:44 UTC
a4b316f Rolling back versions for 0.9.0 release 19 January 2014, 00:37:23 UTC
03019d1 Merge pull request #459 from srowen/UpdaterL2Regularization Correct L2 regularized weight update with canonical form Per thread on the user@ mailing list, and comments from Ameet, I believe the weight update for L2 regularization needs to be corrected. See http://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAH3_EVMetuQuhj3__NdUniDLc4P-FMmmrmxw9TS14or8nT4BNQ%40mail.gmail.com%3E (cherry picked from commit fe8a3546f40394466a41fc750cb60f6fc73d8bbb) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 19 January 2014, 00:29:43 UTC
76147a2 Merge pull request #437 from mridulm/master Minor api usability changes - Expose checkpoint directory - since it is autogenerated now - null check for jars - Expose SparkHadoopUtil : so that configuration creation is abstracted even from user code to avoid duplication of functionality already in spark. (cherry picked from commit 73dfd42fba5e526cc57e2a2ed78be323b63cb8fa) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 19 January 2014, 00:24:16 UTC
4ac8cab Merge pull request #426 from mateiz/py-ml-tests Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+) We disabled these earlier because Jenkins didn't have these versions. (cherry picked from commit 4c16f79ce45a68ee613a3d565b0e8676b724f867) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 19 January 2014, 00:22:46 UTC
34e911c Merge pull request #462 from mateiz/conf-file-fix Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html This PR is for branch 0.9 but should be added into master too. 19 January 2014, 00:17:34 UTC
ff7201c Merge pull request #461 from pwendell/master Use renamed shuffle spill config in CoGroupedRDD.scala This one got missed when it was renamed. (cherry picked from commit aa981e4e97a11dbd5a4d012bfbdb395982968372) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 18 January 2014, 20:50:02 UTC
c8f9273 Remove Typesafe Config usage and conf files to fix nested property names With Typesafe Config we had the subtle problem of no longer allowing nested property names, which are used for a few of our properties: http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html 18 January 2014, 20:48:49 UTC
7b0d5a5 Merge pull request #451 from Qiuzhuang/master Fixed Window spark shell launch script error. JIRA SPARK-1029:https://spark-project.atlassian.net/browse/SPARK-1029 (cherry picked from commit d749d472b37448edb322bc7208a3db925c9a4fc2) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 17 January 2014, 07:18:48 UTC
4ae8a4b [maven-release-plugin] prepare for next development iteration 15 January 2014, 22:53:11 UTC
7348893 [maven-release-plugin] prepare release v0.9.0-incubating 15 January 2014, 22:53:02 UTC
7749b98 Change log for release 0.9.0-incubating 15 January 2014, 22:33:37 UTC
4ccedb3 Merge pull request #444 from mateiz/py-version Clarify that Python 2.7 is only needed for MLlib (cherry picked from commit 4f0c361b0e140f5f6879f019b2e1a16c683c705c) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 15 January 2014, 22:26:48 UTC
e3fa36f Merge pull request #442 from pwendell/standalone Workers should use working directory as spark home if it's not specified If users don't set SPARK_HOME in their environment file when launching an application, the standalone cluster should default to the spark home of the worker. (cherry picked from commit 59f475c79fc8fd6d3485e4d0adf6768b6a9225a4) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 15 January 2014, 21:56:04 UTC
29c76d9 Merge pull request #443 from tdas/filestream-fix Made some classes private[stremaing] and deprecated a method in JavaStreamingContext. Classes `RawTextHelper`, `RawTextSender` and `RateLimitedOutputStream` are not useful in the streaming API. There are not used by the core functionality and was there as a support classes for an obscure example. One of the classes is RawTextSender has a main function which can be executed using bin/spark-class even if it is made private[streaming]. In future, I will probably completely remove these classes. For the time being, I am just converting them to private[streaming]. Accessing underlying JavaSparkContext in JavaStreamingContext was through `JavaStreamingContext.sc` . This is deprecated and preferred method is `JavaStreamingContext.sparkContext` to keep it consistent with the `StreamingContext.sparkContext`. (cherry picked from commit 2a05403a7ced4ecf6084c96f582ee3a24f3cc874) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 15 January 2014, 21:55:48 UTC
aca40aa Merge pull request #441 from pwendell/graphx-build GraphX shouldn't list Spark as provided. I noticed this when building an application against GraphX to audit the released artifacts. (cherry picked from commit 5fecd2516dc8de28b76fe6e0fbdca7922cc28d1c) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 15 January 2014, 19:15:47 UTC
e12c374 Merge pull request #433 from markhamstra/debFix Updated Debian packaging (cherry picked from commit 494d3c077496735e6ebca3217de4f0cc6b6419f2) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 15 January 2014, 18:01:43 UTC
2f015c2 Merge pull request #436 from ankurdave/VertexId-case Rename VertexID -> VertexId in GraphX (cherry picked from commit 3d9e66d92ada4fa93dd0bd78cb4c80f8169e6393) Signed-off-by: Reynold Xin <rxin@apache.org> 15 January 2014, 07:17:28 UTC
2859cab Merge pull request #435 from tdas/filestream-fix Fixed the flaky tests by making SparkConf not serializable SparkConf was being serialized with CoGroupedRDD and Aggregator, which somehow caused OptionalJavaException while being deserialized as part of a ShuffleMapTask. SparkConf should not even be serializable (according to conversation with Matei). This change fixes that. @mateiz @pwendell (cherry picked from commit 139c24ef08e6ffb090975c9808a2cba304eb79e0) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 15 January 2014, 07:08:19 UTC
fbfbb33 Merge pull request #434 from rxin/graphxmaven Fixed SVDPlusPlusSuite in Maven build. This should go into 0.9.0 also. (cherry picked from commit 087487e90e4d6269d7a027f7cb718120f6c10505) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 15 January 2014, 07:06:29 UTC
863dd72 Reverting release plugin changes 15 January 2014, 07:05:23 UTC
2c6c07f Merge pull request #424 from jegonzal/GraphXProgrammingGuide Additional edits for clarity in the graphx programming guide. Added an overview of the Graph and GraphOps functions and fixed numerous typos. (cherry picked from commit 3a386e238984c48a6ac07974b92647beae1199b3) Signed-off-by: Reynold Xin <rxin@apache.org> 15 January 2014, 05:53:05 UTC
a075a45 Merge branch 'branch-0.9' of https://git-wip-us.apache.org/repos/asf/incubator-spark into branch-0.9 15 January 2014, 05:52:13 UTC
6fa4e02 Merge pull request #431 from ankurdave/graphx-caching-doc Describe caching and uncaching in GraphX programming guide (cherry picked from commit ad294db326f57beb98f9734e2b4c45d9da1a4c89) Signed-off-by: Reynold Xin <rxin@apache.org> 15 January 2014, 05:51:25 UTC
51131bf [maven-release-plugin] prepare for next development iteration 14 January 2014, 23:57:59 UTC
40c97af [maven-release-plugin] prepare release v0.9.0-incubating 14 January 2014, 23:57:53 UTC
ce66ca7 Small change to maven build 14 January 2014, 23:16:46 UTC
2f930d5 Merge pull request #428 from pwendell/writeable-objects Don't clone records for text files (cherry picked from commit 74b46acdc57293c103ab5dd5af931d0d0e32c0ed) Signed-off-by: Reynold Xin <rxin@apache.org> 14 January 2014, 23:00:11 UTC
329c9df Merge pull request #429 from ankurdave/graphx-examples-pom.xml Add GraphX dependency to examples/pom.xml (cherry picked from commit 193a0757c87b717e3b6b4f005ecdbb56b04ad9b4) Signed-off-by: Reynold Xin <rxin@apache.org> 14 January 2014, 22:53:36 UTC
a14933d Merge pull request #427 from pwendell/deprecate-aggregator Deprecate rather than remove old combineValuesByKey function (cherry picked from commit d601a76d1fdd25b95020b2e32bacde583cf6aa50) Signed-off-by: Reynold Xin <rxin@apache.org> 14 January 2014, 22:52:42 UTC
119b6c5 Merge pull request #425 from rxin/scaladoc API doc update & make Broadcast public In #413 Broadcast was mistakenly made private[spark]. I changed it to public again. Also exposing id in public given the R frontend requires that. Copied some of the documentation from the programming guide to API Doc for Broadcast and Accumulator. This should be cherry picked into branch-0.9 as well for 0.9.0 release. (cherry picked from commit 2ce23a55a3c4033873bb262919d89e5afabb9134) Signed-off-by: Reynold Xin <rxin@apache.org> 14 January 2014, 21:29:08 UTC
bf3b150 Merge pull request #423 from jegonzal/GraphXProgrammingGuide Improving the graphx-programming-guide This PR will track a few minor improvements to the content and formatting of the graphx-programming-guide. (cherry picked from commit 3fcc68bfa5e9ef4b7abfd5051b6847a833e1ad2f) Signed-off-by: Reynold Xin <rxin@apache.org> 14 January 2014, 17:45:22 UTC
1b4adc2 Merge pull request #420 from pwendell/header-files Add missing header files (cherry picked from commit fa75e5e1c50da7d1e6c6f41c2d6d591c1e8a025f) Signed-off-by: Patrick Wendell <pwendell@gmail.com> 14 January 2014, 09:19:24 UTC
b60840e Merge pull request #418 from pwendell/0.9-versions Version changes for release 0.9.0. 14 January 2014, 08:48:34 UTC
1d9c210 Version changes for release 0.9.0. 14 January 2014, 08:47:47 UTC
980250b Merge pull request #416 from tdas/filestream-fix Removed unnecessary DStream operations and updated docs Removed StreamingContext.registerInputStream and registerOutputStream - they were useless. InputDStream has been made to register itself, and just registering a DStream as output stream cause RDD objects to be created but the RDDs will not be computed at all.. Also made DStream.register() private[streaming] for the same reasons. Updated docs, specially added package documentation for streaming package. Also, changed NetworkWordCount's input storage level to use MEMORY_ONLY, replication on the local machine causes warning messages (as replication fails) which is scary for a new user trying out his/her first example. 14 January 2014, 08:05:37 UTC
f8bd828 Fixed loose ends in docs. 14 January 2014, 08:03:46 UTC
f8e239e Merge remote-tracking branch 'apache/master' into filestream-fix Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala 14 January 2014, 07:57:27 UTC
055be5c Merge pull request #415 from pwendell/shuffle-compress Enable compression by default for spills 14 January 2014, 07:26:44 UTC
0984647 Enable compression by default for spills 14 January 2014, 07:25:25 UTC
4e497db Removed StreamingContext.registerInputStream and registerOutputStream - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation. 14 January 2014, 07:23:46 UTC
fdaabdc Merge pull request #380 from mateiz/py-bayes Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file) 14 January 2014, 07:08:26 UTC
4a805af Merge pull request #367 from ankurdave/graphx GraphX: Unifying Graphs and Tables GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/. Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak. Tasks left: - [x] Graph-level uncache - [x] Uncache previous iterations in Pregel - [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release) - [x] - Describe GC issue with GraphLab - [ ] Write `docs/graphx-programming-guide.md` - [x] - Mention future Bagel support in docs - [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again. - [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx - [x] Make Graph serializable to work around capture in Spark shell - [x] Rename graph -> graphx in package name and subproject - [x] Remove standalone PageRank - [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~ 14 January 2014, 06:58:38 UTC
80e73ed Adding minimal additional functionality to EdgeRDD 14 January 2014, 06:56:57 UTC
945fe7a Merge pull request #408 from pwendell/external-serializers Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes. 14 January 2014, 06:56:12 UTC
4bafc4f adding documentation about EdgeRDD 14 January 2014, 06:55:54 UTC
68641bc Merge pull request #413 from rxin/scaladoc Adjusted visibility of various components and documentation for 0.9.0 release. 14 January 2014, 06:54:13 UTC
0ca0d4d Merge pull request #401 from andrewor14/master External sorting - Add number of bytes spilled to Web UI Additionally, update test suite for external sorting to induce spilling. 14 January 2014, 06:32:21 UTC
af645be Fix all code examples in guide 14 January 2014, 06:29:45 UTC
2cd9358 Finish 6f6f8c928ce493357d4d32e46971c5e401682ea8 14 January 2014, 06:29:23 UTC
08b9fec Merge pull request #409 from tdas/unpersist Automatically unpersisting RDDs that have been cleaned up from DStreams Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads. This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default. Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long. 14 January 2014, 06:29:03 UTC
76ebdae Fix bug in GraphLoader.edgeListFile that caused srcId > dstId 14 January 2014, 06:20:45 UTC
c6dbfd1 Edge object must be public for Edge case class 14 January 2014, 06:08:44 UTC
6f6f8c9 Wrap methods in the appropriate class/object declaration 14 January 2014, 05:55:35 UTC
67795db Write Graph Builders section in guide 14 January 2014, 05:45:11 UTC
e14a14b Remove K-Core and LDA sections from guide; they are unimplemented 14 January 2014, 05:12:58 UTC
c28e5a0 Improve scaladoc links 14 January 2014, 05:11:39 UTC
59e4384 Fix Pregel SSSP example in programming guide 14 January 2014, 05:02:38 UTC
c6023be Fix infinite loop in GraphGenerators.generateRandomEdges The loop occurred when numEdges < numVertices. This commit fixes it by allowing generateRandomEdges to generate a multigraph. 14 January 2014, 05:02:37 UTC
back to top