https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
65cc451 [SPARK-12363] [MLLIB] [BACKPORT-1.3] Remove setRun and fix PowerIterationClustering failed test ## What changes were proposed in this pull request? Backport JIRA-SPARK-12363 to branch-1.3. ## How was the this patch tested? Unit test. cc mengxr Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #11265 from viirya/backport-12363-1.3 and squashes the following commits: ec076dd [Liang-Chi Hsieh] Fix scala style. 7a3ef5f [Xiangrui Meng] use Graph instead of GraphImpl and update tests and example based on PIC paper b86018d [Liang-Chi Hsieh] Remove setRun and fix PowerIterationClustering failed test. 26 February 2016, 05:15:59 UTC
6ddde8e [SPARK-13464][STREAMING][PYSPARK] Fix failed streaming in pyspark in branch 1.3 JIRA: https://issues.apache.org/jira/browse/SPARK-13464 ## What changes were proposed in this pull request? During backport a mllib feature, I found that the clearly checkouted branch-1.3 codebase would fail at the test `test_reduce_by_key_and_window_with_none_invFunc` in pyspark/streaming. We should fix it. ## How was the this patch tested? Unit test `test_reduce_by_key_and_window_with_none_invFunc` is fixed. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11339 from viirya/fix-streaming-test-branch-1.3. 25 February 2016, 20:28:50 UTC
387d818 [SPARK-11812][PYSPARK] invFunc=None works properly with python's reduceByKeyAndWindow invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None, thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data. In addition, the docstring used wrong parameter names, also fixed. Author: David Tolpin <david.tolpin@gmail.com> Closes #9775 from dtolpin/master. (cherry picked from commit 599a8c6e2bf7da70b20ef3046f5ce099dfd637f8) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 19 November 2015, 21:59:06 UTC
5278ef0 [SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9803 from hhbyyh/w2vVocab. (cherry picked from commit e391abdf2cb6098a35347bd123b815ee9ac5b689) Signed-off-by: Xiangrui Meng <meng@databricks.com> 18 November 2015, 21:26:05 UTC
1bfa00d Fixed error in scaladoc of convertToCanonicalEdges The code convertToCanonicalEdges is such that srcIds are smaller than dstIds but the scaladoc suggested otherwise. Have fixed the same. Author: Gaurav Kumar <gauravkumar37@gmail.com> Closes #9666 from gauravkumar37/patch-1. (cherry picked from commit df0e318152165c8e50793aff13aaca5d2d9b8b9d) Signed-off-by: Reynold Xin <rxin@databricks.com> 12 November 2015, 20:14:35 UTC
b90e5cb [SPARK-11424] Guard against double-close() of RecordReaders (branch-1.3 backport) This is a branch-1.3 backport of #9382, a fix for SPARK-11424. Author: Josh Rosen <joshrosen@databricks.com> Closes #9423 from JoshRosen/hadoop-decompressor-pooling-fix-branch-1.3. 03 November 2015, 22:17:51 UTC
0ce1485 [SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes https://github.com/apache/spark/pull/9293 Author: Sean Owen <sowen@cloudera.com> Closes #9309 from srowen/SPARK-11302.2. (cherry picked from commit 826e1e304b57abbc56b8b7ffd663d53942ab3c7c) Signed-off-by: Xiangrui Meng <meng@databricks.com> 28 October 2015, 06:10:14 UTC
25203d9 [SPARK-10973] [ML] [PYTHON] Fix IndexError exception on SparseVector when asked for index after the last non-zero entry See https://github.com/apache/spark/pull/9009 for details. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9062 from zero323/SPARK-10973_1.3. 13 October 2015, 00:23:37 UTC
035d6d7 [SPARK-10980] [SQL] fix bug in create Decimal The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0. This bug exists since the beginning. Author: Davies Liu <davies@databricks.com> Closes #9014 from davies/fix_decimal. (cherry picked from commit 37526aca2430e36a931fbe6e01a152e701a1b94e) Signed-off-by: Davies Liu <davies.liu@gmail.com> Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala 07 October 2015, 22:54:07 UTC
9f4b926 Preparing development version 1.3.3-SNAPSHOT 24 September 2015, 01:59:20 UTC
5a13975 Preparing Spark release v1.3.2-rc1 24 September 2015, 01:59:15 UTC
392875a Update branch-1.3 for 1.3.2 release. Author: Reynold Xin <rxin@databricks.com> Closes #8894 from rxin/branch-1.3. 24 September 2015, 01:46:32 UTC
e54525f [SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator (branch-1.3 backport) This is a backport of #8544 to `branch-1.3` for inclusion in 1.3.2. Author: Josh Rosen <joshrosen@databricks.com> Closes #8790 from JoshRosen/SPARK-10381-1.3. 22 September 2015, 20:37:25 UTC
64730a3 [SPARK-10657] Remove SCP-based Jenkins log archiving As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SCP-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public links to view them. Per shaneknapp, we should remove this log syncing mechanism if it is no longer necessary; removing the need to SCP from the Jenkins workers to the masters is a desired step as part of some larger Jenkins infra refactoring. Author: Josh Rosen <joshrosen@databricks.com> Closes #8793 from JoshRosen/remove-jenkins-ssh-to-master. (cherry picked from commit f1c911552cf5d0d60831c79c1881016293aec66c) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 17 September 2015, 18:42:19 UTC
7494034 [SPARK-10642] [PYSPARK] Fix crash when calling rdd.lookup() on tuple keys JIRA: https://issues.apache.org/jira/browse/SPARK-10642 When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8796 from viirya/fix-pyrdd-lookup. (cherry picked from commit 136c77d8bbf48f7c45dd7c3fbe261a0476f455fe) Signed-off-by: Davies Liu <davies.liu@gmail.com> 17 September 2015, 17:02:58 UTC
8c8d7ab [SPARK-10556] Remove explicit Scala version for sbt project build files Previously, project/plugins.sbt explicitly set scalaVersion to 2.10.4. This can cause issues when using a version of sbt that is compiled against a different version of Scala (for example sbt 0.13.9 uses 2.10.5). Removing this explicit setting will cause build files to be compiled and run against the same version of Scala that sbt is compiled against. Note that this only applies to the project build files (items in project/), it is distinct from the version of Scala we target for the actual spark compilation. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #8709 from ahirreddy/sbt-scala-version-fix. (cherry picked from commit 9bbe33f318c866c0b13088917542715062f0787f) Signed-off-by: Sean Owen <sowen@cloudera.com> 11 September 2015, 12:06:55 UTC
d0d7ada [SPARK-6931] [PYSPARK] Cast Python time float values to int before serialization Python time values return a floating point value, need to cast to integer before serialize with struct.pack('!q', value) https://issues.apache.org/jira/browse/SPARK-6931 Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #8594 from BryanCutler/py-write_long-backport-6931-1.2. (cherry picked from commit 4862a80d2f992e41d30aa5121882f3452d8216b8) Signed-off-by: Davies Liu <davies.liu@gmail.com> 10 September 2015, 18:20:40 UTC
9fcd831 [MINOR] [MLLIB] [ML] [DOC] fixed typo: label for negative result should be 0.0 (original: 1.0) Small typo in the example for `LabelledPoint` in the MLLib docs. Author: Sean Paradiso <seanparadiso@gmail.com> Closes #8680 from sparadiso/docs_mllib_smalltypo. (cherry picked from commit 1dc7548c598c4eb4ecc7d5bb8962a735bbd2c0f7) Signed-off-by: Xiangrui Meng <meng@databricks.com> 10 September 2015, 05:10:09 UTC
29836e2 [SPARK-10353] [MLLIB] (1.3 backport) BLAS gemm not scaling when beta = 0.0 for some subset of matrix multiplications Apply fixes for alpha, beta parameter handling in gemm/gemv from #8525 to branch 1.3 CC mengxr brkyvz Author: Sean Owen <sowen@cloudera.com> Closes #8572 from srowen/SPARK-10353.2. 02 September 2015, 20:33:24 UTC
a58c1af [SPARK-100354] [MLLIB] fix some apparent memory issues in k-means|| initializaiton * do not cache first cost RDD * change following cost RDD cache level to MEMORY_AND_DISK * remove Vector wrapper to save a object per instance Further improvements will be addressed in SPARK-10329 cc: yu-iskw HuJiayin Author: Xiangrui Meng <meng@databricks.com> Closes #8526 from mengxr/SPARK-10354. (cherry picked from commit f0f563a3c43fc9683e6920890cce44611c0c5f4b) Signed-off-by: Xiangrui Meng <meng@databricks.com> 31 August 2015, 06:21:09 UTC
e8b0564 [SPARK-8400] [ML] Added check in ml.ALS for positive block size parameter setting Added check for positive block size with a note that -1 for auto-configuring is not supported Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #8363 from BryanCutler/ml.ALS-neg-blocksize-8400-1.3. 25 August 2015, 11:36:49 UTC
3d2eaf0 [SPARK-10169] [SQL] [BRANCH-1.3] Partial aggregation's plan is wrong when a grouping expression is used as an argument of the aggregate fucntion https://issues.apache.org/jira/browse/SPARK-10169 Author: Wenchen Fan <cloud0fan@outlook.com> Author: Yin Huai <yhuai@databricks.com> Closes #8380 from yhuai/aggTransformDown-branch1.3. 24 August 2015, 20:00:49 UTC
a98603f [SPARK-9801] [STREAMING] Check if file exists before deleting temporary files. Spark streaming deletes the temp file and backup files without checking if they exist or not Author: Hao Zhu <viadeazhu@gmail.com> Closes #8082 from viadea/master and squashes the following commits: 242d05f [Hao Zhu] [SPARK-9801][Streaming]No need to check the existence of those files fd143f2 [Hao Zhu] [SPARK-9801][Streaming]Check if backupFile exists before deleting backupFile files. 087daf0 [Hao Zhu] SPARK-9801 (cherry picked from commit 3c9802d9400bea802984456683b2736a450ee17e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 11 August 2015, 00:18:03 UTC
b104501 [SPARK-9633] [BUILD] SBT download locations outdated; need an update Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https. Follow up on https://github.com/apache/spark/pull/7792 Author: Sean Owen <sowen@cloudera.com> Closes #7956 from srowen/SPARK-9633 and squashes the following commits: caa40bd [Sean Owen] Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https. (cherry picked from commit 681e3024b6c2fcb54b42180d94d3ba3eed52a2d4) Signed-off-by: Sean Owen <sowen@cloudera.com> 06 August 2015, 22:44:25 UTC
384793d [SPARK-9607] [SPARK-9608] fix zinc-port handling in build/mvn - pass `$ZINC_PORT` to zinc status/shutdown commands - fix path check that sets `$ZINC_INSTALL_FLAG`, which was incorrectly causing zinc to be shutdown and restarted every time (with mismatched ports on those commands to boot) - pass `-DzincPort=${ZINC_PORT}` to maven, to use the correct zinc port when building Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #7944 from ryan-williams/zinc-status and squashes the following commits: 619c520 [Ryan Williams] fix zinc status/shutdown commands (cherry picked from commit e27a8c4cb3564f1b2d1ee5445dff341c8e0087b0) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 August 2015, 10:11:22 UTC
cd5d1be [SPARK-3190] [GRAPHX] Fix VertexRDD.count() overflow regression SPARK-3190 was originally fixed by 96df92906978c5f58e0cc8ff5eebe5b35a08be3b, but a5ef58113667ff73562ce6db381cff96a0b354b0 introduced a regression during refactoring. This commit fixes the regression. Author: Ankur Dave <ankurdave@gmail.com> Closes #7923 from ankurdave/SPARK-3190-reopening and squashes the following commits: a3e1b23 [Ankur Dave] Fix VertexRDD.count() overflow regression (cherry picked from commit 9e952ecbce670e9b532a1c664a4d03b66e404112) Signed-off-by: Reynold Xin <rxin@databricks.com> 04 August 2015, 06:07:59 UTC
265ec35 [SPARK-7563] (backport for 1.3) OutputCommitCoordinator.stop() should only run on the driver Backport of "[SPARK-7563] OutputCommitCoordinator.stop() should only run on the driver" for 1.3 Author: Sean Owen <sowen@cloudera.com> Closes #7865 from srowen/SPARK-7563-1.3 and squashes the following commits: f4479bc [Sean Owen] Backport of "[SPARK-7563] OutputCommitCoordinator.stop() should only run on the driver" for 1.3 03 August 2015, 12:59:00 UTC
cc5f711 [SPARK-9254] [BUILD] [HOTFIX] sbt-launch-lib.bash should support HTTP/HTTPS redirection Target file(s) can be hosted on CDN nodes. HTTP/HTTPS redirection must be supported to download these files. Author: Cheng Lian <lian@databricks.com> Closes #7597 from liancheng/spark-9254 and squashes the following commits: fd266ca [Cheng Lian] Uses `--fail' to make curl return non-zero value and remove garbage output when the download fails a7cbfb3 [Cheng Lian] Supports HTTP/HTTPS redirection (cherry picked from commit b55a36bc30a628d76baa721d38789fc219eccc27) Signed-off-by: Sean Owen <sowen@cloudera.com> 02 August 2015, 12:56:48 UTC
047a613 [SPARK-9507] [BUILD] Remove dependency reduced POM hack now that shade plugin is updated Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here See https://issues.apache.org/jira/browse/SPARK-8819 I verified that `mvn clean package -DskipTests` works with Maven 3.3.3. pwendell are you up for trying this for the 1.5.0 release? Author: Sean Owen <sowen@cloudera.com> Closes #7826 from srowen/SPARK-9507 and squashes the following commits: e0b0fd2 [Sean Owen] Update to shade plugin 2.4.1, which removes the need for the dependency-reduced-POM workaround and the 'release' profile. Fix management of shade plugin version so children inherit it; bump assembly plugin version while here (cherry picked from commit 6e5fd613ea4b9aa0ab485ba681277a51a4367168) Signed-off-by: Sean Owen <sowen@cloudera.com> # Conflicts: # dev/create-release/create-release.sh # pom.xml 31 July 2015, 21:05:57 UTC
f941482 [SPARK-9236] [CORE] Make defaultPartitioner not reuse a parent RDD's partitioner if it has 0 partitions See also comments on https://issues.apache.org/jira/browse/SPARK-9236 Author: François Garillot <francois@garillot.net> Closes #7616 from huitseeker/issue/SPARK-9236 and squashes the following commits: 217f902 [François Garillot] [SPARK-9236] Make defaultPartitioner not reuse a parent RDD's partitioner if it has 0 partitions (cherry picked from commit 6cd28cc21ed585ab8d1e0e7147a1a48b044c9c8e) Signed-off-by: Sean Owen <sowen@cloudera.com> 24 July 2015, 14:41:35 UTC
b20a9ab [SPARK-9175] [MLLIB] BLAS.gemm fails to update matrix C when alpha==0 and beta!=1 Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1 Also include unit tests to verify the fix. mengxr brkyvz Author: Meihua Wu <meihuawu@umich.edu> Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits: fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1 (cherry picked from commit ff3c72dbafa16c6158fc36619f3c38344c452ba0) Signed-off-by: Xiangrui Meng <meng@databricks.com> 21 July 2015, 00:04:05 UTC
c8b17da [SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc tests Several places in the PySpark SparseVector docs have one defined as: ``` SparseVector(4, [2, 4], [1.0, 2.0]) ``` The index 4 goes out of bounds (but this is not checked). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits: c806a65 [Joseph K. Bradley] fixed doc test e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests (cherry picked from commit a5d05819afcc9b19aeae4817d842205f32b34335) Signed-off-by: Xiangrui Meng <meng@databricks.com> 20 July 2015, 23:51:13 UTC
596a4cb Disable flaky test: ReceiverSuite "block generator throttling". 20 July 2015, 18:02:00 UTC
0163325 [SPARK-8865] [STREAMING] FIX BUG: check key in kafka params Author: guowei2 <guowei@growingio.com> Closes #7254 from guowei2/spark-8865 and squashes the following commits: 48ca17a [guowei2] fix contains key (cherry picked from commit 897700369f3aedf1a8fdb0984dd3d6d8e498e3af) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 09 July 2015, 22:02:47 UTC
960aec9 Revert "[SPARK-8781] Fix variables in published pom.xml are not resolved" This reverts commit 502e1fd68f9efc0311062146fa058dec3ef0e70b. Conflicts: pom.xml 07 July 2015, 02:28:45 UTC
5f1d1c0 [SPARK-8819] Fix build for maven 3.3.x This is a workaround for MSHADE-148, which leads to an infinite loop when building Spark with maven 3.3.x. This was originally caused by #6441, which added a bunch of test dependencies on the spark-core test module. Recently, it was revealed by #7193. This patch adds a `-Prelease` profile. If present, it will set `createDependencyReducedPom` to true. The consequences are: - If you are releasing Spark with this profile, you are fine as long as you use maven 3.2.x or before. - If you are releasing Spark without this profile, you will run into SPARK-8781. - If you are not releasing Spark but you are using this profile, you may run into SPARK-8819. - If you are not releasing Spark and you did not include this profile, you are fine. This is all documented in `pom.xml` and tested locally with both versions of maven. Author: Andrew Or <andrew@databricks.com> Closes #7219 from andrewor14/fix-maven-build and squashes the following commits: 1d37e87 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-maven-build 3574ae4 [Andrew Or] Review comments f39199c [Andrew Or] Create a -Prelease profile that flags `createDependencyReducedPom` Conflicts: dev/create-release/create-release.sh pom.xml 07 July 2015, 02:25:36 UTC
502e1fd [SPARK-8781] Fix variables in published pom.xml are not resolved The issue is summarized in the JIRA and is caused by this commit: 984ad60147c933f2d5a2040c87ae687c14eb1724. This patch reverts that commit and fixes the maven build in a different way. We limit the dependencies of `KinesisReceiverSuite` to avoid having to deal with the complexities in how maven deals with transitive test dependencies. Author: Andrew Or <andrew@databricks.com> Closes #7193 from andrewor14/fix-kinesis-pom and squashes the following commits: ca3d5d4 [Andrew Or] Limit kinesis test dependencies f24e09c [Andrew Or] Revert "[BUILD] Fix Maven build for Kinesis" (cherry picked from commit 82cf3315e690f4ac15b50edea6a3d673aa5be4c0) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: extras/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisReceiverSuite.scala 02 July 2015, 20:53:05 UTC
3a71cf9 [SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name Because implicit name of `pandas.columns` are Int, but `StructField` json expect `String`. So I think `pandas.columns` are should be convert to `String`. ### issue * [SPARK-8535 PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name](https://issues.apache.org/jira/browse/SPARK-8535) Author: x1- <viva008@gmail.com> Closes #7124 from x1-/SPARK-8535 and squashes the following commits: d68fd38 [x1-] modify unit-test using pandas. ea1897d [x1-] For implicit name of pandas.columns are Int, so should be convert to String. (cherry picked from commit b6e76edf3005c078b407f63b0a05d3a28c18c742) Signed-off-by: Davies Liu <davies@databricks.com> 01 July 2015, 03:38:45 UTC
d720426 [SPARK-8563] [MLLIB] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k I'm sorry that I made https://github.com/apache/spark/pull/6949 closed by mistake. I pushed codes again. And, I added a test code. > There is a bug that `U.numCols() = self.nCols` in `IndexedRowMatrix.computeSVD()` It should have been `U.numCols() = k = svd.U.numCols()` > ``` self = U * sigma * V.transpose (m x n) = (m x n) * (k x k) * (k x n) //ASIS --> (m x n) = (m x k) * (k x k) * (k x n) //TOBE ``` Author: lee19 <lee19@live.co.kr> Closes #6953 from lee19/MLlibBugfix and squashes the following commits: c1812a0 [lee19] [SPARK-8563] [MLlib] Used nRows instead of numRows() to reduce a burden. 4b9803b [lee19] [SPARK-8563] [MLlib] Fixed a build error. c2ccd89 [lee19] Added a unit test that validates matrix sizes of svd for [SPARK-8563][MLlib] 8373424 [lee19] [SPARK-8563][MLlib] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k (cherry picked from commit e72526227fdcf93b7a33375ef954746ac08753f5) Signed-off-by: Xiangrui Meng <meng@databricks.com> 30 June 2015, 21:08:18 UTC
0ce83db [SPARK-7810] [PYSPARK] solve python rdd socket connection problem Method "_load_from_socket" in rdd.py cannot load data from jvm socket when ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols. Author: Ai He <ai.he@ussuning.com> Author: AiHe <ai.he@ussuning.com> Closes #6338 from AiHe/pyspark-networking-issue and squashes the following commits: d4fc9c4 [Ai He] handle code review 2 e75c5c8 [Ai He] handle code review 5644953 [AiHe] solve python rdd socket connection problem to jvm (cherry picked from commit ecd3aacf2805bb231cfb44bab079319cfe73c3f1) Signed-off-by: Davies Liu <davies@databricks.com> 29 June 2015, 21:37:54 UTC
ac3591d [SPARK-8606] Prevent exceptions in RDD.getPreferredLocations() from crashing DAGScheduler If `RDD.getPreferredLocations()` throws an exception it may crash the DAGScheduler and SparkContext. This patch addresses this by adding a try-catch block. Author: Josh Rosen <joshrosen@databricks.com> Closes #7023 from JoshRosen/SPARK-8606 and squashes the following commits: 770b169 [Josh Rosen] Fix getPreferredLocations() DAGScheduler crash with try block. 44a9b55 [Josh Rosen] Add test of a buggy getPartitions() method 19aa9f7 [Josh Rosen] Add (failing) regression test for getPreferredLocations() DAGScheduler crash (cherry picked from commit 0b5abbf5f96a5f6bfd15a65e8788cf3fa96fe54c) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 27 June 2015, 21:54:38 UTC
88e303f [SPARK-8525] [MLLIB] fix LabeledPoint parser when there is a whitespace between label and features vector fix LabeledPoint parser when there is a whitespace between label and features vector, e.g. (y, [x1, x2, x3]) Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com> Closes #6954 from fe2s/SPARK-8525 and squashes the following commits: 0755b9d [Oleksiy Dyagilev] [SPARK-8525][MLLIB] addressing comment, removing dep on commons-lang c1abc2b [Oleksiy Dyagilev] [SPARK-8525][MLLIB] fix LabeledPoint parser when there is a whitespace on specific position (cherry picked from commit a8031183aff2e23de9204ddfc7e7f5edbf052a7e) Signed-off-by: Xiangrui Meng <meng@databricks.com> 23 June 2015, 20:17:27 UTC
716dcf6 [SPARK-8541] [PYSPARK] test the absolute error in approx doctests A minor change but one which is (presumably) visible on the public api docs webpage. Author: Scott Taylor <github@megatron.me.uk> Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits: fbed000 [Scott Taylor] test the absolute error in approx doctests (cherry picked from commit f0dcbe8a7c2de510b47a21eb45cde34777638758) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 23 June 2015, 06:39:39 UTC
45b4527 [SPARK-8095][BACKPORT] Resolve dependencies of --packages in local ivy cache Backported PR #6788 cc andrewor14 Author: Burak Yavuz <brkyvz@gmail.com> Closes #6923 from brkyvz/backport-local-ivy and squashes the following commits: eb17384 [Burak Yavuz] [SPARK-8095][BACKPORT] Resolve dependencies of --packages in local ivy cache 22 June 2015, 21:45:52 UTC
0b8dce0 [SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <sowen@cloudera.com> Closes #6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files (cherry picked from commit 4be53d0395d3c7f61eef6b7d72db078e2e1199a7) Signed-off-by: Andrew Or <andrew@databricks.com> 19 June 2015, 18:03:22 UTC
1d44147 [SPARK-8451] [SPARK-7287] SparkSubmitSuite should check exit code This patch also reenables the tests. Now that we have access to the log4j logs it should be easier to debug the flakiness. yhuai brkyvz Author: Andrew Or <andrew@databricks.com> Closes #6886 from andrewor14/spark-submit-suite-fix and squashes the following commits: 3f99ff1 [Andrew Or] Move destroy to finally block 9a62188 [Andrew Or] Re-enable ignored tests 2382672 [Andrew Or] Check for exit code (cherry picked from commit 68a2dca292776d4a3f988353ba55adc73a7c1aa2) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala 19 June 2015, 17:57:13 UTC
cf232f0 [SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array. I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits: 8557445 [Vyacheslav Baranov] Resolved review comments 4d5b954 [Vyacheslav Baranov] Resolved review comments eaf1e68 [Vyacheslav Baranov] Fixed failing test f9284fd [Vyacheslav Baranov] Resolved review comments 3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap (cherry picked from commit c13da20a55b80b8632d547240d2c8f97539969a1) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 June 2015, 08:44:15 UTC
5f1a8e7 [SPARK-8126] [BUILD] Make sure temp dir exists when running tests. If you ran "clean" at the top-level sbt project, the temp dir would go away, so running "test" without restarting sbt would fail. This fixes that by making sure the temp dir exists before running tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6805 from vanzin/SPARK-8126-fix and squashes the following commits: 12d7768 [Marcelo Vanzin] [SPARK-8126] [build] Make sure temp dir exists when running tests. (cherry picked from commit cebf2411847706a98dc8df9c754ef53d6d12a87c) Signed-off-by: Sean Owen <sowen@cloudera.com> 16 June 2015, 20:10:36 UTC
9480aa3 [SPARK-8126] [BUILD] Use custom temp directory during build. Even with all the efforts to cleanup the temp directories created by unit tests, Spark leaves a lot of garbage in /tmp after a test run. This change overrides java.io.tmpdir to place those files under the build directory instead. After an sbt full unit test run, I was left with > 400 MB of temp files. Since they're now under the build dir, it's much easier to clean them up. Also make a slight change to a unit test to make it not pollute the source directory with test data. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6674 from vanzin/SPARK-8126 and squashes the following commits: 0f8ad41 [Marcelo Vanzin] Make sure tmp dir exists when tests run. 643e916 [Marcelo Vanzin] [MINOR] [BUILD] Use custom temp directory during build. 09 June 2015, 06:57:06 UTC
582f437 Revert "[MINOR] [BUILD] Use custom temp directory during build." This reverts commit 5185ea9b4df3ee73807859b70ddfca8f02f1a659. 06 June 2015, 15:44:08 UTC
5185ea9 [MINOR] [BUILD] Use custom temp directory during build. Even with all the efforts to cleanup the temp directories created by unit tests, Spark leaves a lot of garbage in /tmp after a test run. This change overrides java.io.tmpdir to place those files under the build directory instead. After an sbt full unit test run, I was left with > 400 MB of temp files. Since they're now under the build dir, it's much easier to clean them up. Also make a slight change to a unit test to make it not pollute the source directory with test data. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6653 from vanzin/unit-test-tmp and squashes the following commits: 31e2dd5 [Marcelo Vanzin] Fix tests that depend on each other. aa92944 [Marcelo Vanzin] [minor] [build] Use custom temp directory during build. (cherry picked from commit b16b5434ff44c42e4b3a337f9af147669ba44896) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 June 2015, 12:16:05 UTC
5b96b69 [SPARK-7205] [SPARK-7224] [SPARK-7306] Backport packages fixes Main motivation is to fix the flaky `SparkSubmitUtilsSuite` in branch-1.3. brkyvz Author: Burak Yavuz <brkyvz@gmail.com> Closes #6657 from andrewor14/backport-pr-5892-1.3 and squashes the following commits: f4f7fa8 [Burak Yavuz] [SPARK-7224] [SPARK-7306] mock repository generator for --packages tests without nio.Path e696c21 [Burak Yavuz] [SPARK-7205] Support `.ivy2/local` and `.m2/repositories/` in --packages 04 June 2015, 23:51:29 UTC
5e77d69 [SPARK-8098] [WEBUI] Show correct length of bytes on log page The log page should only show desired length of bytes. Currently it shows bytes from the startIndex to the end of the file. The "Next" button on the page is always disabled. Author: Carson Wang <carson.wang@intel.com> Closes #6640 from carsonwang/logpage and squashes the following commits: 58cb3fd [Carson Wang] Show correct length of bytes on log page (cherry picked from commit 63bc0c4430680cce230dd7a10d34da0492351446) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 04 June 2015, 23:25:28 UTC
3e8b040 [HOTFIX] [BUILD] Fix Maven build; add core test jar This was added later in ee11be258251adf900680927ba200bf46512cc04 in branch-1.4 but not branch-1.3. This patch adds a step in the maven jar plugin to make a jar for core tests that other modules can depend on. 04 June 2015, 20:43:43 UTC
ce137b8 [BUILD] Fix Maven build for Kinesis A necessary dependency that is transitively referenced is not provided, causing compilation failures in builds that provide the kinesis-asl profile. 04 June 2015, 03:48:09 UTC
7445996 [MINOR] [UI] Improve confusing message on log page It's good practice to check if the input path is in the directory we expect to avoid potentially confusing error messages. 03 June 2015, 21:48:39 UTC
e5747ee [SPARK-7558] Demarcate tests in unit-tests.log (1.3) This includes the following commits: original: 9eb222c hotfix1: 8c99793 hotfix2: a4f2412 scalastyle check: 609c492 --- Original patch #6441 Branch-1.4 patch #6598 Author: Andrew Or <andrew@databricks.com> Closes #6602 from andrewor14/demarcate-tests-1.3 and squashes the following commits: a75ff8f [Andrew Or] Fix hive-thrift server log4j problem f782edd [Andrew Or] [SPARK-7558] Guard against direct uses of FunSuite / FunSuiteLike 2b7a4f4 [Andrew Or] Fix tests? fec05c2 [Andrew Or] Fix tests 5342d50 [Andrew Or] Various whitespace changes (minor) 9af2756 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite 192a47c [Andrew Or] Fix log message 95ff5eb [Andrew Or] Add core tests as dependencies in all modules 8dffa0e [Andrew Or] Introduce base abstract class for all test suites 03 June 2015, 17:38:56 UTC
bbd3772 [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robust The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4` It fails in my system since I have version `1.10` :P Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6579 from MechCoder/np_ver and squashes the following commits: 15430f8 [MechCoder] fix syntax error 893fb7e [MechCoder] remove equal to e35f0d4 [MechCoder] minor e89376c [MechCoder] Better checking 22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust (cherry picked from commit 452eb82dd722e5dfd00ee47bb8b6353933b0016e) Signed-off-by: Xiangrui Meng <meng@databricks.com> 03 June 2015, 06:25:42 UTC
476b87d [MINOR] [UI] Improve error message on log page Currently if a bad log type if specified, then we get blank. We should provide a more informative error message. 02 June 2015, 18:38:37 UTC
ad5daa3 [SPARK-7946] [MLLIB] DecayFactor wrongly set in StreamingKMeans Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6497 from MechCoder/spark-7946 and squashes the following commits: 2fdd0a3 [MechCoder] Add non-regression test 8c988c6 [MechCoder] [SPARK-7946] DecayFactor wrongly set in StreamingKMeans (cherry picked from commit 6181937f315480543d28e542d43269cfa591e9d0) Signed-off-by: Xiangrui Meng <meng@databricks.com> 29 May 2015, 18:36:58 UTC
d09a053 [DOCS] Fixing broken "IDE setup" link in the Building Spark documentation. The location of the IDE setup information has changed, so this just updates the link on the Building Spark page. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6467 from dusenberrymw/Fix_Broken_Link_On_Building_Spark_Doc and squashes the following commits: 75c533a [Mike Dusenberry] Fixing broken "IDE setup" link in the Building Spark documentation by pointing to new location. 28 May 2015, 22:18:44 UTC
7a3feb5 [MINOR] Fix the a minor bug in PageRank Example. Fix the bug that entering only 1 arg will cause array out of bounds exception in PageRank example. Author: Li Yao <hnkfliyao@gmail.com> Closes #6455 from lastland/patch-1 and squashes the following commits: de06128 [Li Yao] Fix the bug that entering only 1 arg will cause array out of bounds exception. 28 May 2015, 20:41:59 UTC
33e1539 [MINOR] [CORE] Warn about caching if dynamic allocation is enabled (1.3) This is a resubmit of #5751 for branch-1.3. The previous cherry-pick caused a build break that was later [reverted](https://github.com/apache/spark/commit/2254576e10ee433423aa8accf2d84f12ec20fc97). Originally written by vanzin. Author: Andrew Or <andrew@databricks.com> Closes #6421 from andrewor14/warn-da-cache-1.3 and squashes the following commits: 25cbb53 [Andrew Or] If DA is enabled, warn about caching 28 May 2015, 19:40:13 UTC
68387e3 [SPARK-7883] [DOCS] [MLLIB] Fixing broken trainImplicit Scala example in MLlib Collaborative Filtering documentation. Fixing broken trainImplicit Scala example in MLlib Collaborative Filtering documentation to match one of the possible ALS.trainImplicit function signatures. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6422 from dusenberrymw/Fix_MLlib_Collab_Filtering_trainImplicit_Example and squashes the following commits: 36492f4 [Mike Dusenberry] Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation to match one of the possible ALS.trainImplicit function signatures. (cherry picked from commit 0463428b6e8f364f0b1f39445a60cd85ae7c07bc) Signed-off-by: Xiangrui Meng <meng@databricks.com> 27 May 2015, 01:09:14 UTC
f26e382 [SPARK-7624] Revert "[SPARK-4939] revive offers periodically in LocalBackend" in 1.3 branch This reverts commit e196da840978b61b0222a5fc9b59b5511cf04606. Author: Davies Liu <davies@databricks.com> Closes #6337 from davies/revert_revive and squashes the following commits: be73f96 [Davies Liu] Revert "[SPARK-4939] revive offers periodically in LocalBackend" 22 May 2015, 23:00:01 UTC
a64e097 [SPARK-7744] [DOCS] [MLLIB] Distributed matrix" section in MLlib "Data Types" documentation should be reordered. The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix. This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6270 from dusenberrymw/Reorder_MLlib_Data_Types_Distributed_matrix_docs and squashes the following commits: 6313bab [Mike Dusenberry] The documentation for BlockMatrix should come after RowMatrix, IndexedRowMatrix, and CoordinateMatrix, as BlockMatrix references the later three types, and RowMatrix is considered the "basic" distributed matrix. This will improve comprehensibility of the "Distributed matrix" section, especially for the new reader. (cherry picked from commit 3860520633770cc5719b2cdebe6dc3608798386d) Signed-off-by: Xiangrui Meng <meng@databricks.com> 20 May 2015, 00:18:29 UTC
fc1b4a4 [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners PR per [SPARK-7621](https://issues.apache.org/jira/browse/SPARK-7621), which makes both `KafkaReceiver` and `ReliableKafkaReceiver` report its errors to the `ReceiverTracker`, which in turn will add the events to the bus to fire off any registered `StreamingListener`s. Author: jerluc <jeremyalucas@gmail.com> Closes #6204 from jerluc/master and squashes the following commits: 82439a5 [jerluc] [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners 19 May 2015, 01:21:10 UTC
0d4cd30 [SPARK-7566][SQL] Add type to HiveContext.analyzer This makes HiveContext.analyzer overrideable. Author: Santiago M. Mola <santiago.mola@sap.com> Closes #6177 from smola/patch-batch-1.3 and squashes the following commits: c11a428 [Santiago M. Mola] [SPARK-7566][SQL] Add type to HiveContext.analyzer 19 May 2015, 01:11:28 UTC
0a63103 [SPARK-7660] Wrap SnappyOutputStream to work around snappy-java bug This patch wraps `SnappyOutputStream` to ensure that `close()` is idempotent and to guard against write-after-`close()` bugs. This is a workaround for https://github.com/xerial/snappy-java/issues/107, a bug where a non-idempotent `close()` method can lead to stream corruption. We can remove this workaround if we upgrade to a snappy-java version that contains my fix for this bug, but in the meantime this patch offers a backportable Spark fix. Author: Josh Rosen <joshrosen@databricks.com> Closes #6176 from JoshRosen/SPARK-7660-wrap-snappy and squashes the following commits: 8b77aae [Josh Rosen] Wrap SnappyOutputStream to fix SPARK-7660 (cherry picked from commit f2cc6b5bccc3a70fd7d69183b1a068800831fe19) Signed-off-by: Josh Rosen <joshrosen@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java 17 May 2015, 16:38:31 UTC
91442fd [SPARK-6197][CORE] handle json exception when hisotry file not finished writing For details, please refer to [SPARK-6197](https://issues.apache.org/jira/browse/SPARK-6197) Author: Zhang, Liye <liye.zhang@intel.com> Closes #4927 from liyezhang556520/jsonParseError and squashes the following commits: 5cbdc82 [Zhang, Liye] without unnecessary wrap 2b48831 [Zhang, Liye] small changes with sean owen's comments 2973024 [Zhang, Liye] handle json exception when file not finished writing 16 May 2015, 11:49:55 UTC
d618df2 [SPARK-7651] [MLLIB] [PYSPARK] GMM predict, predictSoft should raise error on bad input In the Python API for Gaussian Mixture Model, predict() and predictSoft() methods should raise an error when the input argument is not an RDD. Author: FlytxtRnD <meethu.mathew@flytxt.com> Closes #6180 from FlytxtRnD/GmmPredictException and squashes the following commits: 4b6aa11 [FlytxtRnD] Raise error if the input to predict()/predictSoft() is not an RDD (cherry picked from commit 8f4aaba0e4e3350ab152a476d08ff60e9495c6d2) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> Conflicts: python/pyspark/mllib/clustering.py 15 May 2015, 21:53:17 UTC
4d77058 [SPARK-5412] [DEPLOY] Cannot bind Master to a specific hostname as per the documentation Pass args to start-master.sh through to start-daemon.sh, as other scripts do, so that things like --host have effect on start-master.sh as per docs Author: Sean Owen <sowen@cloudera.com> Closes #6185 from srowen/SPARK-5412 and squashes the following commits: b3ce9da [Sean Owen] Pass args to start-master.sh through to start-daemon.sh, as other scripts do, so that things like --host have effect on start-master.sh as per docs (cherry picked from commit 8ab1450d3995b0c3ef64c5991b88c258e17bcb12) Signed-off-by: Andrew Or <andrew@databricks.com> 15 May 2015, 18:31:01 UTC
da56e64 [SPARK-7668] [MLLIB] Preserve isTransposed property for Matrix after calling map function JIRA: https://issues.apache.org/jira/browse/SPARK-7668 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6188 from viirya/fix_matrix_map and squashes the following commits: 2a7cc97 [Liang-Chi Hsieh] Preserve isTransposed property for Matrix after calling map function. (cherry picked from commit f96b85ab44b82736363764ea39ee62884007f4a3) Signed-off-by: Xiangrui Meng <meng@databricks.com> 15 May 2015, 17:03:56 UTC
3baa82f [SPARK-7278] [PySpark] DateType should find datetime.datetime acceptable DateType should not be restricted to `datetime.date` but accept `datetime.datetime` objects as well. Could someone with a little more insight verify this? Author: ksonj <kson@siberie.de> Closes #6057 from ksonj/dates and squashes the following commits: 68a158e [ksonj] DateType should find datetime.datetime acceptable too (cherry picked from commit 5d7d4f887d509e6d037d8fc5247d2e5f8a4563c9) Signed-off-by: Reynold Xin <rxin@databricks.com> 14 May 2015, 22:46:31 UTC
9445814 [SPARK-7522] [EXAMPLES] Removed angle brackets from dataFormat option Applying this fix to branch 1.3, mengxr Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #6111 from BryanCutler/dataFormat-option-1_3-7522 and squashes the following commits: 1a4c814 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option 13 May 2015, 11:04:36 UTC
5f121fb [SPARK-7552] [STREAMING] [BACKPORT] Close WAL files correctly when iteration is finished tdas Author: jerryshao <saisai.shao@intel.com> Closes #6069 from jerryshao/SPARK-7552-1.3-backpport and squashes the following commits: 72b9fb9 [jerryshao] Close WAL files correctly when iteration is finished 12 May 2015, 20:05:35 UTC
92fe5b6 [SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization p… …roblem Pick up newer version of dependency with fix for SPARK-2018. The update involved patching the ning/compress LZF library to handle big endian systems correctly. Credit goes to gireeshpunathil for diagnosing the problem, and cowtowncoder for fixing it. Spark tests run clean for me. Author: Tim Ellison <t.p.ellison@gmail.com> Closes #6077 from tellison/UpgradeLZF and squashes the following commits: ad8d4ef [Tim Ellison] [SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization problem (cherry picked from commit 5438f49ccf374fed16bc2b7fc1556e4c0095b14c) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 May 2015, 19:49:00 UTC
b152c6c [SPARK-7331] [SQL] Re-use HiveConf in HiveQl Author: nitin2goyal <nitin2goyal@gmail.com> Closes #6037 from nitin2goyal/dev-nitin-1.3 and squashes the following commits: 414b80a [nitin2goyal] [SPARK-7331][SQL] Re-use HiveConf in HiveQl 12 May 2015, 02:05:35 UTC
2de111a Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket. So users that are interested in this can track it easily. Author: Reynold Xin <rxin@databricks.com> Closes #6067 from rxin/SPARK-7550 and squashes the following commits: ee0e34c [Reynold Xin] Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket. (cherry picked from commit 87229c95c6b597f5b84e36d518b9830e3ba63424) Signed-off-by: Reynold Xin <rxin@databricks.com> 12 May 2015, 01:11:01 UTC
8f99a49 [SPARK-7084] improve saveAsTable documentation Author: madhukar <phatak.dev@gmail.com> Closes #5654 from phatak-dev/master and squashes the following commits: 386f407 [madhukar] #5654 updated for all the methods 2c997c5 [madhukar] Merge branch 'master' of https://github.com/apache/spark 00bc819 [madhukar] Merge branch 'master' of https://github.com/apache/spark 2a802c6 [madhukar] #5654 updated the doc according to comments 866e8df [madhukar] [SPARK-7084] improve saveAsTable documentation (cherry picked from commit 57255dcd794222f4db5df1e549ebc7b896cebfdc) Signed-off-by: Reynold Xin <rxin@databricks.com> 12 May 2015, 00:06:21 UTC
d4eb590 [SQL] Show better error messages for incorrect join types in DataFrames. As a follow-up to https://github.com/apache/spark/pull/5944 Author: Reynold Xin <rxin@databricks.com> Closes #6064 from rxin/jointype-better-error and squashes the following commits: 7629bf7 [Reynold Xin] [SQL] Show better error messages for incorrect join types in DataFrames. (cherry picked from commit 4f4dbb030c208caba18f314a1ef1751696627d26) Signed-off-by: Reynold Xin <rxin@databricks.com> 12 May 2015, 00:02:28 UTC
2dc3ca6 [SPARK-7341] [STREAMING] [TESTS] Fix the flaky test: org.apache.spark.streaming.InputStreamsSuite.socket input stream (backport for branch 1.3) Remove non-deterministic "Thread.sleep" and use deterministic strategies to fix the flaky failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2127/testReport/junit/org.apache.spark.streaming/InputStreamsSuite/socket_input_stream/ Author: zsxwing <zsxwing@gmail.com> Closes #5995 from zsxwing/SPARK-7341-branch-1.3 and squashes the following commits: 0f09c2a [zsxwing] [SPARK-7341] [STREAMING] [TESTS] Fix the flaky test: org.apache.spark.stre... 11 May 2015, 17:48:54 UTC
f2b138d [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector Issue appears when one tries to create DataFrame using sqlContext.load("jdbc"...) statement when "dbtable" contains query with renamed columns. If original column is used in SQL query once the resulting DataFrame will contain non-renamed column. If original column is used in SQL query several times with different aliases, sqlContext.load will fail. Original implementation of JDBCRDD.resolveTable uses getColumnName to detect column names in RDD schema. Suggested implementation uses getColumnLabel to handle column renames in SQL statement which is aware of SQL "AS" statement. Readings: http://stackoverflow.com/questions/4271152/getcolumnlabel-vs-getcolumnname http://stackoverflow.com/questions/12259829/jdbc-getcolumnname-getcolumnlabel-db2 Official documentation unfortunately a bit misleading in definition of "suggested title" purpose however clearly defines behavior of AS keyword in SQL statement. http://docs.oracle.com/javase/7/docs/api/java/sql/ResultSetMetaData.html getColumnLabel - Gets the designated column's suggested title for use in printouts and displays. The suggested title is usually specified by the SQL AS clause. If a SQL AS is not specified, the value returned from getColumnLabel will be the same as the value returned by the getColumnName method. Author: Oleg Sidorkin <oleg.sidorkin@gmail.com> Closes #6032 from osidorkin/master and squashes the following commits: 10fc44b [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (resolved scala style test error) 2aaf6f7 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite (renamed fields in JDBC query) b7d5b22 [Oleg Sidorkin] [SPARK-7345][SQL] Regression test for JDBCSuite 09559a0 [Oleg Sidorkin] [SPARK-7345][SQL] Spark cannot detect renamed columns using JDBC connector (cherry picked from commit d7a37bcaf123389fb0828eefb92659c6d9cb3460) Signed-off-by: Reynold Xin <rxin@databricks.com> 10 May 2015, 08:31:56 UTC
6f01711 Revert "[SPARK-7490] [CORE] [Minor] MapOutputTracker.deserializeMapStatuses: close input streams" This reverts commit ef4a0ea7bab19d5ec8ecad3ff4f8556361abeebe. 09 May 2015, 02:07:11 UTC
ef4a0ea [SPARK-7490] [CORE] [Minor] MapOutputTracker.deserializeMapStatuses: close input streams GZIPInputStream allocates native memory that is not freed until close() or when the finalizer runs. It is best to close() these streams explicitly. stephenh made the same change for serializeMapStatuses in commit b0d884f0. This is the same change for deserialize. (I ran the unit test suite! it seems to have passed. I did not make a JIRA since this seems "trivial", and the guidelines suggest it is not required for trivial changes) Author: Evan Jones <ejones@twitter.com> Closes #5982 from evanj/master and squashes the following commits: 0d76e85 [Evan Jones] [CORE] MapOutputTracker.deserializeMapStatuses: close input streams (cherry picked from commit 25889d8d97094325f10fbf52f3b36412f212eeb2) Signed-off-by: Sean Owen <sowen@cloudera.com> 08 May 2015, 21:01:19 UTC
7fd212b [SPARK-7436] Fixed instantiation of custom recovery mode factory and added tests Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #5975 from jacek-lewandowski/SPARK-7436-1.3 and squashes the following commits: 3988817 [Jacek Lewandowski] SPARK-7436: Fixed instantiation of custom recovery mode factory and added tests 08 May 2015, 18:38:47 UTC
edcd364 [SPARK-7330] [SQL] avoid NPE at jdbc rdd Thank nadavoosh point this out in #5590 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5877 from adrian-wang/jdbcrdd and squashes the following commits: cc11900 [Daoyuan Wang] avoid NPE in jdbcrdd (cherry picked from commit ed9be06a4797bbb678355b361054c8872ac20b75) Signed-off-by: Yin Huai <yhuai@databricks.com> 07 May 2015, 17:17:58 UTC
cbf232d [SPARK-5456] [SQL] fix decimal compare for jdbc rdd Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5803 from adrian-wang/decimalcompare and squashes the following commits: aef0e96 [Daoyuan Wang] add null handle ec455b9 [Daoyuan Wang] fix decimal compare for jdbc rdd (cherry picked from commit 150f671c286c57deaf37ab1d8f837d68b5be82a0) Signed-off-by: Reynold Xin <rxin@databricks.com> 06 May 2015, 17:05:28 UTC
9278b7a [SPARK-5074] [CORE] [TESTS] Fix the flakey test 'run shuffle with map stage failure' in DAGSchedulerSuite Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/2240/testReport/junit/org.apache.spark.scheduler/DAGSchedulerSuite/run_shuffle_with_map_stage_failure/ This is because many tests share the same `JobListener`. Because after each test, `scheduler` isn't stopped. So actually it's still running. When running the test `run shuffle with map stage failure`, some previous test may trigger [ResubmitFailedStages](https://github.com/apache/spark/blob/ebc25a4ddfe07a67668217cec59893bc3b8cf730/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1120) logic, and report `jobFailed` and override the global `failure` variable. This PR uses `after` to call `scheduler.stop()` for each test. Author: zsxwing <zsxwing@gmail.com> Closes #5903 from zsxwing/SPARK-5074 and squashes the following commits: 1e6f13e [zsxwing] Fix the flakey test 'run shuffle with map stage failure' in DAGSchedulerSuite (cherry picked from commit 5ffc73e68b3a6ea30c25931e9e0495a4c7e5654c) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 May 2015, 14:05:45 UTC
b34b5bd [MINOR] [BUILD] Declare ivy dependency in root pom. Without this, any dependency that pulls ivy transitively may override the version and potentially cause issue. In my machine, the hive tests were pulling an old version of ivy, and subsequently failing with a "NoSuchMethodError". Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5893 from vanzin/ivy-dep-fix and squashes the following commits: ea2112d [Marcelo Vanzin] [minor] [build] Declare ivy dependency in root pom. (cherry picked from commit c5790a2f772168351c18bb0da51a124cee89a06f) Signed-off-by: Sean Owen <sowen@cloudera.com> 05 May 2015, 07:56:39 UTC
702356c [SPARK-7323] [SPARK CORE] Use insertAll instead of insert while merging combiners in reducer Author: Mridul Muralidharan <mridulm@yahoo-inc.com> Closes #5862 from mridulm/optimize_aggregator and squashes the following commits: 61cf43a [Mridul Muralidharan] Use insertAll instead of insert - much more expensive to do it per tuple (cherry picked from commit da303526e54e9a0adfedb49417f383cde7870a69) Signed-off-by: Sean Owen <sowen@cloudera.com> 02 May 2015, 22:06:01 UTC
98ac39d [SPARK-6954] [YARN] ExecutorAllocationManager can end up requesting a ne... ...gative n... ...umber of executors Author: Sandy Ryza <sandycloudera.com> Closes #5704 from sryza/sandy-spark-6954 and squashes the following commits: b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors 6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors Author: Sandy Ryza <sandy@cloudera.com> Closes #5856 from sryza/sandy-backport-6954 and squashes the following commits: 1cb517a [Sandy Ryza] [SPARK-6954] [YARN] ExecutorAllocationManager can end up requesting a negative n... 02 May 2015, 09:59:07 UTC
d726949 Limit help option regex Added word-boundary delimiters so that embedded text such as "-h" within command line options and values doesn't trigger the usage script and exit. Author: Chris Biow <chris.biow@10gen.com> Closes #5816 from cbiow/patch-1 and squashes the following commits: 36b3726 [Chris Biow] Limit help option regex (cherry picked from commit c8c481da18688e684d4e34f14c5afa0b5d37a618) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 May 2015, 18:27:12 UTC
f64b994 [SPARK-7196][SQL] Support precision and scale of decimal type for JDBC JIRA: https://issues.apache.org/jira/browse/SPARK-7196 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5777 from viirya/jdbc_precision and squashes the following commits: f40f5e6 [Liang-Chi Hsieh] Support precision and scale for NUMERIC type. 49acbf9 [Liang-Chi Hsieh] Add unit test. a509e19 [Liang-Chi Hsieh] Support precision and scale of decimal type for JDBC. (cherry picked from commit 6702324b60f99dab55912c08ccd3d03805f6b7b2) Signed-off-by: Reynold Xin <rxin@databricks.com> 30 April 2015, 22:14:04 UTC
ec196ab [SPARK-5529] [CORE] Add expireDeadHosts in HeartbeatReceiver If a blockManager has not send heartBeat more than 120s, BlockManagerMasterActor will remove it. But coarseGrainedSchedulerBackend can only remove executor after an DisassociatedEvent. We should expireDeadHosts at HeartbeatReceiver. Author: Hong Shen <hongshentencent.com> Closes #4363 from shenh062326/my_change3 and squashes the following commits: 2c9a46a [Hong Shen] Change some code style. 1a042ff [Hong Shen] Change some code style. 2dc456e [Hong Shen] Change some code style. d221493 [Hong Shen] Fix test failed 7448ac6 [Hong Shen] A minor change in sparkContext and heartbeatReceiver b904aed [Hong Shen] Fix failed test 52725af [Hong Shen] Remove assert in SparkContext.killExecutors 5bedcb8 [Hong Shen] Remove assert in SparkContext.killExecutors a858fb5 [Hong Shen] A minor change in HeartbeatReceiver 3e221d9 [Hong Shen] A minor change in HeartbeatReceiver 6bab7aa [Hong Shen] Change a code style. 07952f3 [Hong Shen] Change configs name and code style. ce9257e [Hong Shen] Fix test failed bccd515 [Hong Shen] Fix test failed 8e77408 [Hong Shen] Fix test failed c1dfda1 [Hong Shen] Fix test failed e197e20 [Hong Shen] Fix test failed fb5df97 [Hong Shen] Remove ExpireDeadHosts in BlockManagerMessages b5c0441 [Hong Shen] Remove expireDeadHosts in BlockManagerMasterActor c922cb0 [Hong Shen] Add expireDeadHosts in HeartbeatReceiver Author: Hong Shen <hongshen@tencent.com> Closes #5793 from alexrovner/SPARK-5529-backport-1.3-v2 and squashes the following commits: f238f94 [Hong Shen] [SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver 30 April 2015, 16:05:27 UTC
3bce87e [HOTFIX] Disabling flaky test (fix in progress as part of SPARK-7224) 30 April 2015, 08:03:43 UTC
ae461e7 [SPARK-7234][SQL] Fix DateType mismatch when codegen on. Author: 云峤 <chensong.cs@alibaba-inc.com> Closes #5778 from kaka1992/fix_codegenon_datetype_mismatch and squashes the following commits: 1ad4cff [云峤] SPARK-7234 fix dateType mismatch (cherry picked from commit 7143f6e9718bae9cffa0a73df03ba8c9860ee129) Signed-off-by: Reynold Xin <rxin@databricks.com> 30 April 2015, 01:23:48 UTC
2dd17d4 [SPARK-7229] [SQL] SpecificMutableRow should take integer type as internal representation for Date Author: Cheng Hao <hao.cheng@intel.com> Closes #5772 from chenghao-intel/specific_row and squashes the following commits: 2cd064d [Cheng Hao] scala style issue 60347a2 [Cheng Hao] SpecificMutableRow should take integer type as internal representation for DateType (cherry picked from commit f8cbb0a4b37b0d4ba49515d888cb52dea9eb01f1) Signed-off-by: Reynold Xin <rxin@databricks.com> 29 April 2015, 23:23:47 UTC
3a41a13 [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input See JIRA: https://issues.apache.org/jira/browse/SPARK-7155 SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following: ```scala sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) ``` will throw ``` org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt ``` However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly. That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned. This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input. A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile(). Note: The contribution is my original work and that I license the work to the project under the project's open source license. Author: yongtang <yongtang@users.noreply.github.com> Closes #5708 from yongtang/SPARK-7155 and squashes the following commits: 654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary. 26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency. 73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input. (cherry picked from commit 3fc6cfd079d8cdd35574605cb9a4178ca7f2613d) Signed-off-by: Sean Owen <sowen@cloudera.com> 29 April 2015, 22:56:00 UTC
5b893bd [SPARK-7181] [CORE] fix inifite loop in Externalsorter's mergeWithAggregation see [SPARK-7181](https://issues.apache.org/jira/browse/SPARK-7181). Author: Qiping Li <liqiping1991@gmail.com> Closes #5737 from chouqin/externalsorter and squashes the following commits: 2924b93 [Qiping Li] fix inifite loop in Externalsorter's mergeWithAggregation (cherry picked from commit 7f4b583733714bbecb43fb0823134bf2ec720a17) Signed-off-by: Sean Owen <sowen@cloudera.com> 29 April 2015, 22:52:28 UTC
back to top