https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
933d2c1 Preparing Spark release v2.0.1-rc4 28 September 2016, 23:27:45 UTC
0a69477 [SPARK-17641][SQL] Collect_list/Collect_set should not collect null values. ## What changes were proposed in this pull request? We added native versions of `collect_set` and `collect_list` in Spark 2.0. These currently also (try to) collect null values, this is different from the original Hive implementation. This PR fixes this by adding a null check to the `Collect.update` method. ## How was this patch tested? Added a regression test to `DataFrameAggregateSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15208 from hvanhovell/SPARK-17641. (cherry picked from commit 7d09232028967978d9db314ec041a762599f636b) Signed-off-by: Reynold Xin <rxin@databricks.com> 28 September 2016, 23:25:31 UTC
d358298 [SPARK-17673][SQL] Incorrect exchange reuse with RowDataSourceScan (backport) This backports https://github.com/apache/spark/pull/15273 to branch-2.0 Also verified the test passes after the patch was applied. rxin Author: Eric Liang <ekl@databricks.com> Closes #15282 from ericl/spark-17673-2. 28 September 2016, 23:19:06 UTC
4c694e4 [SPARK-17644][CORE] Do not add failedStages when abortStage for fetch failure | Time |Thread 1 , Job1 | Thread 2 , Job2 | |:-------------:|:-------------:|:-----:| | 1 | abort stage due to FetchFailed | | | 2 | failedStages += failedStage | | | 3 | | task failed due to FetchFailed | | 4 | | can not post ResubmitFailedStages because failedStages is not empty | Then job2 of thread2 never resubmit the failed stage and hang. We should not add the failedStages when abortStage for fetch failure added unit test Author: w00228970 <wangfei1@huawei.com> Author: wangfei <wangfei_hello@126.com> Closes #15213 from scwf/dag-resubmit. (cherry picked from commit 46d1203bf2d01b219c4efc7e0e77a844c0c664da) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 28 September 2016, 19:08:56 UTC
4d73d5c [MINOR][PYSPARK][DOCS] Fix examples in PySpark documentation ## What changes were proposed in this pull request? This PR proposes to fix wrongly indented examples in PySpark documentation ``` - >>> json_sdf = spark.readStream.format("json")\ - .schema(sdf_schema)\ - .load(tempfile.mkdtemp()) + >>> json_sdf = spark.readStream.format("json") \\ + ... .schema(sdf_schema) \\ + ... .load(tempfile.mkdtemp()) ``` ``` - people.filter(people.age > 30).join(department, people.deptId == department.id)\ + people.filter(people.age > 30).join(department, people.deptId == department.id) \\ ``` ``` - >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])), \ - LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))] + >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])), + ... LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))] ``` ``` - >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 4.56e-7)])), \ - LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))] + >>> examples = [LabeledPoint(1.1, Vectors.sparse(3, [(0, -1.23), (2, 4.56e-7)])), + ... LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))] ``` ``` - ... for x in iterator: - ... print(x) + ... for x in iterator: + ... print(x) ``` ## How was this patch tested? Manually tested. **Before** ![2016-09-26 8 36 02](https://cloud.githubusercontent.com/assets/6477701/18834471/05c7a478-8431-11e6-94bb-09aa37b12ddb.png) ![2016-09-26 9 22 16](https://cloud.githubusercontent.com/assets/6477701/18834472/06c8735c-8431-11e6-8775-78631eab0411.png) <img width="601" alt="2016-09-27 2 29 27" src="https://cloud.githubusercontent.com/assets/6477701/18861294/29c0d5b4-84bf-11e6-99c5-3c9d913c125d.png"> <img width="1056" alt="2016-09-27 2 29 58" src="https://cloud.githubusercontent.com/assets/6477701/18861298/31694cd8-84bf-11e6-9e61-9888cb8c2089.png"> <img width="1079" alt="2016-09-27 2 30 05" src="https://cloud.githubusercontent.com/assets/6477701/18861301/359722da-84bf-11e6-97f9-5f5365582d14.png"> **After** ![2016-09-26 9 29 47](https://cloud.githubusercontent.com/assets/6477701/18834467/0367f9da-8431-11e6-86d9-a490d3297339.png) ![2016-09-26 9 30 24](https://cloud.githubusercontent.com/assets/6477701/18834463/f870fae0-8430-11e6-9482-01fc47898492.png) <img width="515" alt="2016-09-27 2 28 19" src="https://cloud.githubusercontent.com/assets/6477701/18861305/3ff88b88-84bf-11e6-902c-9f725e8a8b10.png"> <img width="652" alt="2016-09-27 3 50 59" src="https://cloud.githubusercontent.com/assets/6477701/18863053/592fbc74-84ca-11e6-8dbf-99cf57947de8.png"> <img width="709" alt="2016-09-27 3 51 03" src="https://cloud.githubusercontent.com/assets/6477701/18863060/601607be-84ca-11e6-80aa-a401df41c321.png"> Author: hyukjinkwon <gurwls223@gmail.com> Closes #15242 from HyukjinKwon/minor-example-pyspark. (cherry picked from commit 2190037757a81d3172f75227f7891d968e1f0d90) Signed-off-by: Sean Owen <sowen@cloudera.com> 28 September 2016, 10:19:18 UTC
1b02f88 [SPARK-17666] Ensure that RecordReaders are closed by data source file scans (backport) This is a branch-2.0 backport of #15245. ## What changes were proposed in this pull request? This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed. This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed. ## How was this patch tested? Tested manually for now. Author: Josh Rosen <joshrosen@databricks.com> Closes #15271 from JoshRosen/SPARK-17666-backport. 28 September 2016, 07:59:00 UTC
2cd327e [SPARK-17056][CORE] Fix a wrong assert regarding unroll memory in MemoryStore ## What changes were proposed in this pull request? There is an assert in MemoryStore's putIteratorAsValues method which is used to check if unroll memory is not released too much. This assert looks wrong. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14642 from viirya/fix-unroll-memory. (cherry picked from commit e7bce9e1876de6ee975ccc89351db58119674aef) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 27 September 2016, 23:01:04 UTC
98bbc44 [SPARK-17618] Guard against invalid comparisons between UnsafeRow and other formats This patch ports changes from #15185 to Spark 2.x. In that patch, a correctness bug in Spark 1.6.x which was caused by an invalid `equals()` comparison between an `UnsafeRow` and another row of a different format. Spark 2.x is not affected by that specific correctness bug but it can still reap the error-prevention benefits of that patch's changes, which modify ``UnsafeRow.equals()` to throw an IllegalArgumentException if it is called with an object that is not an `UnsafeRow`. Author: Josh Rosen <joshrosen@databricks.com> Closes #15265 from JoshRosen/SPARK-17618-master. (cherry picked from commit 2f84a686604b298537bfd4d087b41594d2aa7ec6) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 27 September 2016, 21:17:43 UTC
f459490 [Docs] Update spark-standalone.md to fix link Corrected a link to the configuration.html page, it was pointing to a page that does not exist (configurations.html). Documentation change, verified in preview. Author: Andrew Mills <ammills01@users.noreply.github.com> Closes #15244 from ammills01/master. (cherry picked from commit 00be16df642317137f17d2d7d2887c41edac3680) Signed-off-by: Andrew Or <andrewor14@gmail.com> 26 September 2016, 20:41:33 UTC
8a58f2e [SPARK-17652] Fix confusing exception message while reserving capacity ## What changes were proposed in this pull request? This minor patch fixes a confusing exception message while reserving additional capacity in the vectorized parquet reader. ## How was this patch tested? Exisiting Unit Tests Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #15225 from sameeragarwal/error-msg. (cherry picked from commit 7c7586aef9243081d02ea5065435234b5950ab66) Signed-off-by: Yin Huai <yhuai@databricks.com> 26 September 2016, 20:21:23 UTC
cf53241 [SPARK-17649][CORE] Log how many Spark events got dropped in LiveListenerBus ## What changes were proposed in this pull request? Log how many Spark events got dropped in LiveListenerBus so that the user can get insights on how to set a correct event queue size. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15220 from zsxwing/SPARK-17649. (cherry picked from commit bde85f8b70138a51052b613664facbc981378c38) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 26 September 2016, 17:44:44 UTC
88ba2e1 [SPARK-17650] malformed url's throw exceptions before bricking Executors ## What changes were proposed in this pull request? When a malformed URL was sent to Executors through `sc.addJar` and `sc.addFile`, the executors become unusable, because they constantly throw `MalformedURLException`s and can never acknowledge that the file or jar is just bad input. This PR tries to fix that problem by making sure MalformedURLs can never be submitted through `sc.addJar` and `sc.addFile`. Another solution would be to blacklist bad files and jars on Executors. Maybe fail the first time, and then ignore the second time (but print a warning message). ## How was this patch tested? Unit tests in SparkContextSuite Author: Burak Yavuz <brkyvz@gmail.com> Closes #15224 from brkyvz/SPARK-17650. (cherry picked from commit 59d87d24079bc633e63ce032f0a5ddd18a3b02cb) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 26 September 2016, 05:57:42 UTC
ed54576 [SPARK-10835][ML] Word2Vec should accept non-null string array, in addition to existing null string array ## What changes were proposed in this pull request? To match Tokenizer and for compatibility with Word2Vec, output a nullable string array type in NGram ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #15179 from srowen/SPARK-10835. (cherry picked from commit f3fe55439e4c865c26502487a1bccf255da33f4a) Signed-off-by: Sean Owen <sowen@cloudera.com> 24 September 2016, 07:06:56 UTC
9e91a10 [SPARK-15703][SCHEDULER][CORE][WEBUI] Make ListenerBus event queue size configurable (branch 2.0) ## What changes were proposed in this pull request? Backport #14269 to 2.0. ## How was this patch tested? Jenkins. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #15222 from zsxwing/SPARK-15703-2.0. 23 September 2016, 21:59:53 UTC
5bc5b49 Preparing development version 2.0.2-SNAPSHOT 23 September 2016, 21:38:13 UTC
9d28cc1 Preparing Spark release v2.0.1-rc3 23 September 2016, 21:38:07 UTC
b111a81 [SPARK-17651][SPARKR] Set R package version number along with mvn This PR sets the R package version while tagging releases. Note that since R doesn't accept `-SNAPSHOT` in version number field, we remove that while setting the next version Tested manually by running locally Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #15223 from shivaram/sparkr-version-change. (cherry picked from commit 7c382524a959a2bc9b3d2fca44f6f0b41aba4e3c) Signed-off-by: Reynold Xin <rxin@databricks.com> 23 September 2016, 21:36:01 UTC
452e468 [SPARK-17577][CORE][2.0 BACKPORT] Update SparkContext.addFile to make it work well on Windows ## What changes were proposed in this pull request? Update ```SparkContext.addFile``` to correct the use of ```URI``` and ```Path```, then it can work well on Windows. This is used for branch-2.0 backport, more details at #15131. ## How was this patch tested? Backport, checked by appveyor. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15217 from yanboliang/uri-2.0. 23 September 2016, 19:50:22 UTC
1a8ea00 [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio ## What changes were proposed in this pull request? Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala). ``` if (args.isR && clusterManager == YARN) { val sparkRPackagePath = RUtils.localSparkRPackagePath if (sparkRPackagePath.isEmpty) { printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.") } val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE) if (!sparkRPackageFile.exists()) { printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.") } val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString // Distribute the SparkR package. // Assigns a symbol link name "sparkr" to the shipped package. args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr") // Distribute the R package archive containing all the built R packages. if (!RUtils.rPackages.isEmpty) { val rPackageFile = RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE) if (!rPackageFile.exists()) { printErrorAndExit("Failed to zip all the built R packages.") } val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString // Assigns a symbol link name "rpkg" to the shipped package. args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg") } } ``` So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster. ## How was this patch tested? Verify it manually in R Studio using the following code. ``` Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark") .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths())) library(SparkR) sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1")) df <- as.DataFrame(mtcars) head(df) ``` … Author: Jeff Zhang <zjffdu@apache.org> Closes #14784 from zjffdu/SPARK-17210. (cherry picked from commit f62ddc5983a08d4d54c0a9a8210dd6cbec555671) Signed-off-by: Felix Cheung <felixcheung@apache.org> 23 September 2016, 18:38:21 UTC
d3f90e7 [SPARK-17640][SQL] Avoid using -1 as the default batchId for FileStreamSource.FileEntry ## What changes were proposed in this pull request? Avoid using -1 as the default batchId for FileStreamSource.FileEntry so that we can make sure not writing any FileEntry(..., batchId = -1) into the log. This also avoids people misusing it in future (#15203 is an example). ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15206 from zsxwing/cleanup. (cherry picked from commit 62ccf27ab4b55e734646678ae78b7e812262d14b) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 23 September 2016, 06:35:15 UTC
54d4eee [SPARK-16240][ML] ML persistence backward compatibility for LDA - 2.0 backport ## What changes were proposed in this pull request? Allow Spark 2.x to load instances of LDA, LocalLDAModel, and DistributedLDAModel saved from Spark 1.6. Backport of https://github.com/apache/spark/pull/15034 for branch-2.0 ## How was this patch tested? I tested this manually, saving the 3 types from 1.6 and loading them into master (2.x). In the future, we can add generic tests for testing backwards compatibility across all ML models in SPARK-15573. Author: Gayathri Murali <gayathri.m.softie@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #15205 from jkbradley/lda-backward-2.0. 23 September 2016, 05:44:20 UTC
22216d6 [SPARK-17502][17609][SQL][BACKPORT][2.0] Fix Multiple Bugs in DDL Statements on Temporary Views ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/15054 and https://github.com/apache/spark/pull/15160 to Spark 2.0. - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for partition-related ALTER TABLE commands. However, it always reports a confusing error message. For example, ``` Partition spec is invalid. The spec (a, b) must match the partition spec () defined in table '`testview`'; ``` - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET TBLPROPERTIES`. However, it reports a missing table property. For example, ``` Attempted to unset non-existent property 'p' in table '`testView`'; ``` - When `ANALYZE TABLE` is called on a view or a temporary view, we should issue an error message. However, it reports a strange error: ``` ANALYZE TABLE is not supported for Project ``` - When inserting into a temporary view that is generated from `Range`, we will get the following error message: ``` assertion failed: No plan for 'InsertIntoTable Range (0, 10, step=1, splits=Some(1)), false, false +- Project [1 AS 1#20] +- OneRowRelation$ ``` This PR is to fix the above four issues. There is no place in Spark SQL that need `SessionCatalog.tableExists` to check temp views, so this PR makes `SessionCatalog.tableExists` only check permanent table/view and removes some hacks. ### How was this patch tested? Added multiple test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #15174 from gatorsmile/PR15054Backport. 23 September 2016, 01:56:40 UTC
c393d86 Preparing development version 2.0.2-SNAPSHOT 23 September 2016, 00:43:58 UTC
04141ad Preparing Spark release v2.0.1-rc2 23 September 2016, 00:43:50 UTC
c2cb841 [SPARK-17599][SPARK-17569] Backport and to Spark 2.0 branch ## What changes were proposed in this pull request? This Backports PR #15153 and PR #15122 to Spark 2.0 branch for Structured Streaming. It is structured a bit differently because similar code paths already existed in the 2.0 branch. The unit test makes sure that both behaviors don't break. Author: Burak Yavuz <brkyvz@gmail.com> Closes #15202 from brkyvz/backports-to-streaming. 23 September 2016, 00:22:04 UTC
0a593db [SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate We currently cannot execute an aggregate that contains a single distinct aggregate function and an one or more non-partially plannable aggregate functions, for example: ```sql select grp, collect_list(col1), count(distinct col2) from tbl_a group by 1 ``` This is a regression from Spark 1.6. This is caused by the fact that the single distinct aggregation code path assumes that all aggregates can be planned in two phases (is partially aggregatable). This PR works around this issue by triggering the `RewriteDistinctAggregates` in such cases (this is similar to the approach taken in 1.6). Created `RewriteDistinctAggregatesSuite` which checks if the aggregates with distinct aggregate functions get rewritten into two `Aggregates` and an `Expand`. Added a regression test to `DataFrameAggregateSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15187 from hvanhovell/SPARK-17616. (cherry picked from commit 0d634875026ccf1eaf984996e9460d7673561f80) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 22 September 2016, 23:22:31 UTC
47fc0b9 [SPARK-17638][STREAMING] Stop JVM StreamingContext when the Python process is dead ## What changes were proposed in this pull request? When the Python process is dead, the JVM StreamingContext is still running. Hence we will see a lot of Py4jException before the JVM process exits. It's better to stop the JVM StreamingContext to avoid those annoying logs. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15201 from zsxwing/stop-jvm-ssc. (cherry picked from commit 3cdae0ff2f45643df7bc198cb48623526c7eb1a6) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 22 September 2016, 21:26:53 UTC
243bdb1 [SPARK-17613] S3A base paths with no '/' at the end return empty DataFrames Consider you have a bucket as `s3a://some-bucket` and under it you have files: ``` s3a://some-bucket/file1.parquet s3a://some-bucket/file2.parquet ``` Getting the parent path of `s3a://some-bucket/file1.parquet` yields `s3a://some-bucket/` and the ListingFileCatalog uses this as the key in the hash map. When catalog.allFiles is called, we use `s3a://some-bucket` (no slash at the end) to get the list of files, and we're left with an empty list! This PR fixes this by adding a `/` at the end of the `URI` iff the given `Path` doesn't have a parent, i.e. is the root. This is a no-op if the path already had a `/` at the end, and is handled through the Hadoop Path, path merging semantics. Unit test in `FileCatalogSuite`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #15169 from brkyvz/SPARK-17613. (cherry picked from commit 85d609cf25c1da2df3cd4f5d5aeaf3cbcf0d674c) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 22 September 2016, 20:06:15 UTC
f14f47f Skip building R vignettes if Spark is not built ## What changes were proposed in this pull request? When we build the docs separately we don't have the JAR files from the Spark build in the same tree. As the SparkR vignettes need to launch a SparkContext to be built, we skip building them if JAR files don't exist ## How was this patch tested? To test this we can run the following: ``` build/mvn -DskipTests -Psparkr clean ./R/create-docs.sh ``` You should see a line `Skipping R vignettes as Spark JARs not found` at the end Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #15200 from shivaram/sparkr-vignette-skip. (cherry picked from commit 9f24a17c59b1130d97efa7d313c06577f7344338) Signed-off-by: Reynold Xin <rxin@databricks.com> 22 September 2016, 18:54:51 UTC
b25a8e6 [SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS. ## What changes were proposed in this pull request? Modified the documentation to clarify that `build/mvn` and `pom.xml` always add Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely ignore warnings about `-XX:MaxPermSize` that may result from compiling or running tests with Java 8. ## How was this patch tested? Rebuilt HTML documentation, made sure that building-spark.html displays correctly in a browser. Author: frreiss <frreiss@us.ibm.com> Closes #15005 from frreiss/fred-17421a. (cherry picked from commit 646f383465c123062cbcce288a127e23984c7c7f) Signed-off-by: Sean Owen <sowen@cloudera.com> 22 September 2016, 09:31:28 UTC
e8b26be Preparing development version 2.0.2-SNAPSHOT 22 September 2016, 04:09:19 UTC
00f2e28 Preparing Spark release v2.0.1-rc1 22 September 2016, 04:09:08 UTC
053b20a Bump doc version for release 2.0.1. 22 September 2016, 04:06:47 UTC
ec377e7 [SPARK-17494][SQL] changePrecision() on compact decimal should respect rounding mode ## What changes were proposed in this pull request? Floor()/Ceil() of decimal is implemented using changePrecision() by passing a rounding mode, but the rounding mode is not respected when the decimal is in compact mode (could fit within a Long). This Update the changePrecision() to respect rounding mode, which could be ROUND_FLOOR, ROUND_CEIL, ROUND_HALF_UP, ROUND_HALF_EVEN. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #15154 from davies/decimal_round. (cherry picked from commit 8bde03bf9a0896ea59ceaa699df7700351a130fb) Signed-off-by: Reynold Xin <rxin@databricks.com> 22 September 2016, 04:02:42 UTC
966abd6 [SPARK-17627] Mark Streaming Providers Experimental All of structured streaming is experimental in its first release. We missed the annotation on two of the APIs. Author: Michael Armbrust <michael@databricks.com> Closes #15188 from marmbrus/experimentalApi. (cherry picked from commit 3497ebe511fee67e66387e9e737c843a2939ce45) Signed-off-by: Reynold Xin <rxin@databricks.com> 22 September 2016, 03:59:52 UTC
59e6ab1 [SPARK-17512][CORE] Avoid formatting to python path for yarn and mesos cluster mode ## What changes were proposed in this pull request? Yarn and mesos cluster mode support remote python path (HDFS/S3 scheme) by their own mechanism, it is not necessary to check and format the python when running on these modes. This is a potential regression compared to 1.6, so here propose to fix it. ## How was this patch tested? Unit test to verify SparkSubmit arguments, also with local cluster verification. Because of lack of `MiniDFSCluster` support in Spark unit test, there's no integration test added. Author: jerryshao <sshao@hortonworks.com> Closes #15137 from jerryshao/SPARK-17512. (cherry picked from commit 8c3ee2bc42e6320b9341cebdba51a00162c897ea) Signed-off-by: Andrew Or <andrewor14@gmail.com> 21 September 2016, 21:57:33 UTC
cd0bd89 [SPARK-17418] Prevent kinesis-asl-assembly artifacts from being published This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds. The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases. Author: Josh Rosen <joshrosen@databricks.com> Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly. (cherry picked from commit d7ee12211a99efae6f7395e47089236838461d61) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 21 September 2016, 18:38:55 UTC
45bccdd [BACKPORT 2.0][MINOR][BUILD] Fix CheckStyle Error ## What changes were proposed in this pull request? This PR is to fix the code style errors. ## How was this patch tested? Manual. Before: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[153] (sizes) LineLength: Line is longer than 100 characters (found 107). [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[196] (sizes) LineLength: Line is longer than 100 characters (found 108). [ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[239] (sizes) LineLength: Line is longer than 100 characters (found 115). [ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[119] (sizes) LineLength: Line is longer than 100 characters (found 107). [ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[129] (sizes) LineLength: Line is longer than 100 characters (found 104). [ERROR] src/main/java/org/apache/spark/network/util/LevelDBProvider.java:[124,11] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions. [ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[184] (regexp) RegexpSingleline: No trailing whitespace allowed. [ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[304] (regexp) RegexpSingleline: No trailing whitespace allowed. ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15175 from Sherry302/javastylefix. 21 September 2016, 14:18:02 UTC
65295ba [SPARK-17617][SQL] Remainder(%) expression.eval returns incorrect result on double value ## What changes were proposed in this pull request? Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision. This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted. ### Before change ``` scala> -5083676433652386516D % 10 res2: Double = -6.0 scala> spark.sql("select -5083676433652386516D % 10 as a").show +---+ | a| +---+ |0.0| +---+ ``` ### After change ``` scala> spark.sql("select -5083676433652386516D % 10 as a").show +----+ | a| +----+ |-6.0| +----+ ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #15171 from clockfly/SPARK-17617. (cherry picked from commit 3977223a3268aaf6913a325ee459139a4a302b1c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 September 2016, 08:53:55 UTC
726f057 [SPARK-17513][SQL] Make StreamExecution garbage-collect its metadata ## What changes were proposed in this pull request? This PR modifies StreamExecution such that it discards metadata for batches that have already been fully processed. I used the purge method that was added as part of SPARK-17235. This is a resubmission of 15126, which was based on work by frreiss in #15067, but fixed the test case along with some typos. ## How was this patch tested? A new test case in StreamingQuerySuite. The test case would fail without the changes in this pull request. Author: petermaxlee <petermaxlee@gmail.com> Closes #15166 from petermaxlee/SPARK-17513-2. (cherry picked from commit 976f3b1227c1a9e0b878e010531285fdba57b6a7) Signed-off-by: Reynold Xin <rxin@databricks.com> 21 September 2016, 02:08:15 UTC
8d8e233 [SPARK-15698][SQL][STREAMING] Add the ability to remove the old MetadataLog in FileStreamSource (branch-2.0) ## What changes were proposed in this pull request? Backport #13513 to branch 2.0. ## How was this patch tested? Jenkins Author: jerryshao <sshao@hortonworks.com> Closes #15163 from zsxwing/SPARK-15698-spark-2.0. 20 September 2016, 19:53:30 UTC
2bd37ce [SPARK-17549][SQL] Revert "[] Only collect table size stat in driver for cached relation." This reverts commit 39e2bad6a866d27c3ca594d15e574a1da3ee84cc because of the problem mentioned at https://issues.apache.org/jira/browse/SPARK-17549?focusedCommentId=15505060&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15505060 Author: Yin Huai <yhuai@databricks.com> Closes #15157 from yhuai/revert-SPARK-17549. (cherry picked from commit 9ac68dbc5720026ea92acc61d295ca64d0d3d132) Signed-off-by: Yin Huai <yhuai@databricks.com> 20 September 2016, 18:54:13 UTC
e76f4f4 [SPARK-17051][SQL] we should use hadoopConf in InsertIntoHiveTable ## What changes were proposed in this pull request? Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf` ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14634 from cloud-fan/bug. (cherry picked from commit eb004c66200da7df36dd0a9a11999fc352197916) Signed-off-by: Yin Huai <yhuai@databricks.com> 20 September 2016, 16:53:38 UTC
643f161 Revert "[SPARK-17513][SQL] Make StreamExecution garbage-collect its metadata" This reverts commit 5456a1b4fcd85d0d7f2f1cc64e44967def0950bf. 20 September 2016, 12:44:44 UTC
5456a1b [SPARK-17513][SQL] Make StreamExecution garbage-collect its metadata ## What changes were proposed in this pull request? This PR modifies StreamExecution such that it discards metadata for batches that have already been fully processed. I used the purge method that was added as part of SPARK-17235. This is based on work by frreiss in #15067, but fixed the test case along with some typos. ## How was this patch tested? A new test case in StreamingQuerySuite. The test case would fail without the changes in this pull request. Author: petermaxlee <petermaxlee@gmail.com> Author: frreiss <frreiss@us.ibm.com> Closes #15126 from petermaxlee/SPARK-17513. (cherry picked from commit be9d57fc9d8b10e4234c01c06ed43fd7dd12c07b) Signed-off-by: Reynold Xin <rxin@databricks.com> 20 September 2016, 05:19:58 UTC
7026eb8 [SPARK-17160] Properly escape field names in code-generated error messages This patch addresses a corner-case escaping bug where field names which contain special characters were unsafely interpolated into error message string literals in generated Java code, leading to compilation errors. This patch addresses these issues by using `addReferenceObj` to store the error messages as string fields rather than inline string constants. Author: Josh Rosen <joshrosen@databricks.com> Closes #15156 from JoshRosen/SPARK-17160. (cherry picked from commit e719b1c045ba185d242d21bbfcdee2c84dafc587) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 20 September 2016, 03:21:25 UTC
c02bc92 [SPARK-17100] [SQL] fix Python udf in filter on top of outer join ## What changes were proposed in this pull request? In optimizer, we try to evaluate the condition to see whether it's nullable or not, but some expressions are not evaluable, we should check that before evaluate it. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #15103 from davies/udf_join. (cherry picked from commit d8104158a922d86dd4f00e50d5d7dddc7b777a21) Signed-off-by: Davies Liu <davies.liu@gmail.com> 19 September 2016, 20:24:25 UTC
fef3ec1 [SPARK-16439] [SQL] bring back the separator in SQL UI ## What changes were proposed in this pull request? Currently, the SQL metrics looks like `number of rows: 111111111111`, it's very hard to read how large the number is. So a separator was added by #12425, but removed by #14142, because the separator is weird in some locales (for example, pl_PL), this PR will add that back, but always use "," as the separator, since the SQL UI are all in English. ## How was this patch tested? Existing tests. ![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png) Author: Davies Liu <davies@databricks.com> Closes #15106 from davies/metric_sep. (cherry picked from commit e0632062635c37cbc77df7ebd2a1846655193e12) Signed-off-by: Davies Liu <davies.liu@gmail.com> 19 September 2016, 18:49:34 UTC
d6191a0 [SPARK-17438][WEBUI] Show Application.executorLimit in the application page ## What changes were proposed in this pull request? This PR adds `Application.executorLimit` to the applicatino page ## How was this patch tested? Checked the UI manually. Screenshots: 1. Dynamic allocation is disabled <img width="484" alt="screen shot 2016-09-07 at 4 21 49 pm" src="https://cloud.githubusercontent.com/assets/1000778/18332029/210056ea-7518-11e6-9f52-76d96046c1c0.png"> 2. Dynamic allocation is enabled. <img width="466" alt="screen shot 2016-09-07 at 4 25 30 pm" src="https://cloud.githubusercontent.com/assets/1000778/18332034/2c07700a-7518-11e6-8fce-aebe25014902.png"> Author: Shixiong Zhu <shixiong@databricks.com> Closes #15001 from zsxwing/fix-core-info. (cherry picked from commit 80d6655921bea9b1bb27c1d95c2b46654e7a8cca) Signed-off-by: Andrew Or <andrew@databricks.com> 19 September 2016, 18:01:02 UTC
f56035b [SPARK-17473][SQL] fixing docker integration tests error due to different versions of jars. ## What changes were proposed in this pull request? Docker tests are using older version of jersey jars (1.19), which was used in older releases of spark. In 2.0 releases Spark was upgraded to use 2.x verison of Jersey. After upgrade to new versions, docker tests are failing with AbstractMethodError. Now that spark is upgraded to 2.x jersey version, using of shaded docker jars may not be required any more. Removed the exclusions/overrides of jersey related classes from pom file, and changed the docker-client to use regular jar instead of shaded one. ## How was this patch tested? Tested using existing docker-integration-tests Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #15114 from sureshthalamati/docker_testfix-spark-17473. (cherry picked from commit cdea1d1343d02f0077e1f3c92ca46d04a3d30414) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 19 September 2016, 17:29:57 UTC
c4660d6 [SPARK-17589][TEST][2.0] Fix test case `create external table` in MetastoreDataSourcesSuite ### What changes were proposed in this pull request? This PR is to fix a test failure on the branch 2.0 builds: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.0-test-maven-hadoop-2.7/711/ ``` Error Message "Table `default`.`createdJsonTable` already exists.;" did not contain "Table default.createdJsonTable already exists." We should complain that createdJsonTable already exists ``` ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #15145 from gatorsmile/fixTestCase. 19 September 2016, 17:21:33 UTC
ac06039 [SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not relative to a calendar ## What changes were proposed in this pull request? Clarify that slide and window duration are absolute, and not relative to a calendar. ## How was this patch tested? Doc build (no functional change) Author: Sean Owen <sowen@cloudera.com> Closes #15142 from srowen/SPARK-17297. (cherry picked from commit d720a4019460b6c284d0473249303c349df60a1f) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 September 2016, 08:38:36 UTC
27ce39c [SPARK-17571][SQL] AssertOnQuery.condition should always return Boolean value ## What changes were proposed in this pull request? AssertOnQuery has two apply constructor: one that accepts a closure that returns boolean, and another that accepts a closure that returns Unit. This is actually very confusing because developers could mistakenly think that AssertOnQuery always require a boolean return type and verifies the return result, when indeed the value of the last statement is ignored in one of the constructors. This pull request makes the two constructor consistent and always require boolean value. It will overall make the test suites more robust against developer errors. As an evidence for the confusing behavior, this change also identified a bug with an existing test case due to file system time granularity. This pull request fixes that test case as well. ## How was this patch tested? This is a test only change. Author: petermaxlee <petermaxlee@gmail.com> Closes #15127 from petermaxlee/SPARK-17571. (cherry picked from commit 8f0c35a4d0dd458719627be5f524792bf244d70a) Signed-off-by: Reynold Xin <rxin@databricks.com> 18 September 2016, 22:22:08 UTC
151f808 [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly ## Problem CSV in Spark 2.0.0: - does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6; - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903. ## What changes were proposed in this pull request? This patch makes changes to read all empty values back as `null`s. ## How was this patch tested? New test cases. Author: Liwei Lin <lwlin7@gmail.com> Closes #14118 from lw-lin/csv-cast-null. (cherry picked from commit 1dbb725dbef30bf7633584ce8efdb573f2d92bca) Signed-off-by: Sean Owen <sowen@cloudera.com> 18 September 2016, 18:26:08 UTC
6c67d86 [SPARK-17586][BUILD] Do not call static member via instance reference ## What changes were proposed in this pull request? This PR fixes a warning message as below: ``` [WARNING] .../UnsafeInMemorySorter.java:284: warning: [static] static method should be qualified by type name, TaskMemoryManager, instead of by an expression [WARNING] currentPageNumber = memoryManager.decodePageNumber(recordPointer) ``` by referencing the static member via class not instance reference. ## How was this patch tested? Existing tests should cover this - Jenkins tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15141 from HyukjinKwon/SPARK-17586. (cherry picked from commit 7151011b38a841d9d4bc2e453b9a7cfe42f74f8f) Signed-off-by: Sean Owen <sowen@cloudera.com> 18 September 2016, 18:18:59 UTC
5619f09 [SPARK-17546][DEPLOY] start-* scripts should use hostname -f ## What changes were proposed in this pull request? Call `hostname -f` to get fully qualified host name ## How was this patch tested? Jenkins tests of course, but also verified output of command on OS X and Linux Author: Sean Owen <sowen@cloudera.com> Closes #15129 from srowen/SPARK-17546. (cherry picked from commit 342c0e65bec4b9a715017089ab6ea127f3c46540) Signed-off-by: Sean Owen <sowen@cloudera.com> 18 September 2016, 15:22:40 UTC
cf728b0 [SPARK-17541][SQL] fix some DDL bugs about table management when same-name temp view exists In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, `loopupRelation`, etc) that handle both temp views and metastore tables/views. This brings some bugs to DDL commands that want to handle temp view only or metastore table/view only. These bugs are: 1. `CREATE TABLE USING` will fail if a same-name temp view exists 2. `Catalog.dropTempView`will un-cache and drop metastore table if a same-name table exists 3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp view exists. These bug fixes are pulled out from https://github.com/apache/spark/pull/14962 and targets both master and 2.0 branch new regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes #15099 from cloud-fan/fix-view. (cherry picked from commit 3fe630d314cf50d69868b7707ac8d8d2027080b8) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 September 2016, 13:50:05 UTC
5fd354b [SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size which is O(n) This PR fixes all the instances which was fixed in the previous PR. To make sure, I manually debugged and also checked the Scala source. `length` in [LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57) is O(n). Also, `size` calls `length` via [SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106). For debugging, I have created these as below: ```scala ArrayBuffer(1, 2, 3) Array(1, 2, 3) List(1, 2, 3) Seq(1, 2, 3) ``` and then called `size` and `length` for each to debug. I ran the bash as below on Mac ```bash find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | grep "src/main" find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | grep "src/main" ``` and then checked each. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15093 from HyukjinKwon/SPARK-17480-followup. (cherry picked from commit 86c2d393a56bf1e5114bc5a781253c0460efb8af) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 September 2016, 21:27:22 UTC
0cfc046 Revert "[SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size which is O(n)" This reverts commit a3bba372abce926351335d0a2936b70988f19b23. 17 September 2016, 21:18:40 UTC
bec0770 [SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes() ## What changes were proposed in this pull request? `MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read. This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer). The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of `PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`. ## How was this patch tested? The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug. In addition, I added a new `PartiallySerializedBlockSuite` to unit test that component. Author: Josh Rosen <joshrosen@databricks.com> Closes #15043 from JoshRosen/partially-serialized-block-values-iterator-bugfix. (cherry picked from commit 8faa5217b44e8d52eab7eb2d53d0652abaaf43cd) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 17 September 2016, 18:46:39 UTC
a3bba37 [SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls List.length/size which is O(n) This PR fixes all the instances which was fixed in the previous PR. To make sure, I manually debugged and also checked the Scala source. `length` in [LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57) is O(n). Also, `size` calls `length` via [SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106). For debugging, I have created these as below: ```scala ArrayBuffer(1, 2, 3) Array(1, 2, 3) List(1, 2, 3) Seq(1, 2, 3) ``` and then called `size` and `length` for each to debug. I ran the bash as below on Mac ```bash find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | grep "src/main" find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | grep "src/main" ``` and then checked each. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15093 from HyukjinKwon/SPARK-17480-followup. (cherry picked from commit 86c2d393a56bf1e5114bc5a781253c0460efb8af) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 September 2016, 16:06:44 UTC
ec2b736 [SPARK-17575][DOCS] Remove extra table tags in configuration document ## What changes were proposed in this pull request? Remove extra table tags in configurations document. ## How was this patch tested? Run all test cases and generate document. Before with extra tag its look like below ![config-wrong1](https://cloud.githubusercontent.com/assets/8075390/18608239/c602bb60-7d01-11e6-875e-f38558997dd3.png) ![config-wrong2](https://cloud.githubusercontent.com/assets/8075390/18608241/cf3b672c-7d01-11e6-935e-1e73f9e6e578.png) After removing tags its looks like below ![config](https://cloud.githubusercontent.com/assets/8075390/18608245/e156eb8e-7d01-11e6-98aa-3be68d4d1961.png) ![config2](https://cloud.githubusercontent.com/assets/8075390/18608247/e84eecd4-7d01-11e6-9738-a3f7ff8fe834.png) Author: sandy <phalodi@gmail.com> Closes #15130 from phalodi/SPARK-17575. (cherry picked from commit bbe0b1d623741decce98827130cc67eb1fff1240) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 September 2016, 15:25:14 UTC
eb2675d [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes #15105 from willb/fix/findSynonyms. (cherry picked from commit 25cbbe6ca334140204e7035ab8b9d304da9b8a8a) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 September 2016, 11:50:09 UTC
c9bd67e [SPARK-17561][DOCS] DataFrameWriter documentation formatting problems Fix `<ul> / <li>` problems in SQL scaladoc. Scaladoc build and manual verification of generated HTML. Author: Sean Owen <sowen@cloudera.com> Closes #15117 from srowen/SPARK-17561. (cherry picked from commit b9323fc9381a09af510f542fd5c86473e029caf6) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 September 2016, 11:43:30 UTC
3ca0dc0 [SPARK-17567][DOCS] Use valid url to Spark RDD paper https://issues.apache.org/jira/browse/SPARK-17567 ## What changes were proposed in this pull request? Documentation (http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) contains broken link to Spark paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I found it elsewhere (https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) and I hope it is the same one. It should be uploaded to and linked from some Apache controlled storage, so it won't break again. ## How was this patch tested? Tested manually on local laptop. Author: Xin Ren <iamshrek@126.com> Closes #15121 from keypointt/SPARK-17567. (cherry picked from commit f15d41be3ce7569736ccbf2ffe1bec265865f55d) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 September 2016, 11:30:36 UTC
9ff158b Correct fetchsize property name in docs ## What changes were proposed in this pull request? Replace `fetchSize` with `fetchsize` in the docs. ## How was this patch tested? I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See also [`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38) for the definition of the property. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #14975 from darabos/patch-3. (cherry picked from commit 69cb0496974737347e2650cda436b39bbd51e581) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 September 2016, 11:29:01 UTC
3fce125 [SPARK-17549][SQL] Only collect table size stat in driver for cached relation. The existing code caches all stats for all columns for each partition in the driver; for a large relation, this causes extreme memory usage, which leads to gc hell and application failures. It seems that only the size in bytes of the data is actually used in the driver, so instead just colllect that. In executors, the full stats are still kept, but that's not a big problem; we expect the data to be distributed and thus not really incur in too much memory pressure in each individual executor. There are also potential improvements on the executor side, since the data being stored currently is very wasteful (e.g. storing boxed types vs. primitive types for stats). But that's a separate issue. On a mildly related change, I'm also adding code to catch exceptions in the code generator since Janino was breaking with the test data I tried this patch on. Tested with unit tests and by doing a count a very wide table (20k columns) with many partitions. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #15112 from vanzin/SPARK-17549. (cherry picked from commit 39e2bad6a866d27c3ca594d15e574a1da3ee84cc) Signed-off-by: Yin Huai <yhuai@databricks.com> 16 September 2016, 21:03:08 UTC
5ad4395 [SPARK-17558] Bump Hadoop 2.7 version from 2.7.2 to 2.7.3 ## What changes were proposed in this pull request? This patch bumps the Hadoop version in hadoop-2.7 profile from 2.7.2 to 2.7.3, which was recently released and contained a number of bug fixes. ## How was this patch tested? The change should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #15115 from rxin/SPARK-17558. (cherry picked from commit dca771bec6edb1cd8fc75861d364e0ba9dccf7c3) Signed-off-by: Reynold Xin <rxin@databricks.com> 16 September 2016, 18:24:40 UTC
9c23f44 [SPARK-17484] Prevent invalid block locations from being reported after put() exceptions ## What changes were proposed in this pull request? If a BlockManager `put()` call failed after the BlockManagerMaster was notified of a block's availability then incomplete cleanup logic in a `finally` block would never send a second block status method to inform the master of the block's unavailability. This, in turn, leads to fetch failures and used to be capable of causing complete job failures before #15037 was fixed. This patch addresses this issue via multiple small changes: - The `finally` block now calls `removeBlockInternal` when cleaning up from a failed `put()`; in addition to removing the `BlockInfo` entry (which was _all_ that the old cleanup logic did), this code (redundantly) tries to remove the block from the memory and disk stores (as an added layer of defense against bugs lower down in the stack) and optionally notifies the master of block removal (which now happens during exception-triggered cleanup). - When a BlockManager receives a request for a block that it does not have it will now notify the master to update its block locations. This ensures that bad metadata pointing to non-existent blocks will eventually be fixed. Note that I could have implemented this logic in the block manager client (rather than in the remote server), but that would introduce the problem of distinguishing between transient and permanent failures; on the server, however, we know definitively that the block isn't present. - Catch `NonFatal` instead of `Exception` to avoid swallowing `InterruptedException`s thrown from synchronous block replication calls. This patch depends upon the refactorings in #15036, so that other patch will also have to be backported when backporting this fix. For more background on this issue, including example logs from a real production failure, see [SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484). ## How was this patch tested? Two new regression tests in BlockManagerSuite. Author: Josh Rosen <joshrosen@databricks.com> Closes #15085 from JoshRosen/SPARK-17484. (cherry picked from commit 1202075c95eabba0ffebc170077df798f271a139) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 15 September 2016, 18:54:39 UTC
0169c2e [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as a decimal number token when parsing SQL string ## What changes were proposed in this pull request? The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a fully qualified identifier as a decimal number token. For example, table identifier `default.123_table` is wrongly tokenized as ``` default // Matches lexer rule IDENTIFIER .123 // Matches lexer rule DECIMAL_VALUE _TABLE // Matches lexer rule IDENTIFIER ``` The correct tokenization for `default.123_table` should be: ``` default // Matches lexer rule IDENTIFIER, . // Matches a single dot 123_TABLE // Matches lexer rule IDENTIFIER ``` This PR fix the Antlr grammar so that it can tokenize fully qualified identifier correctly: 1. Fully qualified table name can be parsed correctly. For example, `select * from database.123_suffix`. 2. Fully qualified column name can be parsed correctly, for example `select a.123_suffix from a`. ### Before change #### Case 1: Failed to parse fully qualified column name ``` scala> spark.sql("select a.123_column from a").show org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.123' expecting {<EOF>, ... , IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8) == SQL == select a.123_column from a --------^^^ ``` #### Case 2: Failed to parse fully qualified table name ``` scala> spark.sql("select * from default.123_table") org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '.123' expecting {<EOF>, ... IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21) == SQL == select * from default.123_table ---------------------^^^ ``` ### After Change #### Case 1: fully qualified column name, no ParseException thrown ``` scala> spark.sql("select a.123_column from a").show ``` #### Case 2: fully qualified table name, no ParseException thrown ``` scala> spark.sql("select * from default.123_table") ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #15006 from clockfly/SPARK-17364. (cherry picked from commit a6b8182006d0c3dda67c06861067ca78383ecf1b) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 15 September 2016, 18:54:01 UTC
abb89c4 [SPARK-17483] Refactoring in BlockManager status reporting and block removal This patch makes three minor refactorings to the BlockManager: - Move the `if (info.tellMaster)` check out of `reportBlockStatus`; this fixes an issue where a debug logging message would incorrectly claim to have reported a block status to the master even though no message had been sent (in case `info.tellMaster == false`). This also makes it easier to write code which unconditionally sends block statuses to the master (which is necessary in another patch of mine). - Split `removeBlock()` into two methods, the existing method and an internal `removeBlockInternal()` method which is designed to be called by internal code that already holds a write lock on the block. This is also needed by a followup patch. - Instead of calling `getCurrentBlockStatus()` in `removeBlock()`, just pass `BlockStatus.empty`; the block status should always be empty following complete removal of a block. These changes were originally authored as part of a bug fix patch which is targeted at branch-2.0 and master; I've split them out here into their own separate PR in order to make them easier to review and so that the behavior-changing parts of my other patch can be isolated to their own PR. Author: Josh Rosen <joshrosen@databricks.com> Closes #15036 from JoshRosen/cache-failure-race-conditions-refactorings-only. (cherry picked from commit 3d40896f410590c0be044b3fa7e5d32115fac05e) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 15 September 2016, 18:51:43 UTC
62ab536 [SPARK-17114][SQL] Fix aggregates grouped by literals with empty input ## What changes were proposed in this pull request? This PR fixes an issue with aggregates that have an empty input, and use a literals as their grouping keys. These aggregates are currently interpreted as aggregates **without** grouping keys, this triggers the ungrouped code path (which aways returns a single row). This PR fixes the `RemoveLiteralFromGroupExpressions` optimizer rule, which changes the semantics of the Aggregate by eliminating all literal grouping keys. ## How was this patch tested? Added tests to `SQLQueryTestSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15101 from hvanhovell/SPARK-17114-3. (cherry picked from commit d403562eb4b5b1d804909861d3e8b75d8f6323b9) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 15 September 2016, 18:24:29 UTC
e77a437 [SPARK-17547] Ensure temp shuffle data file is cleaned up after error SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file. This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed. Author: Josh Rosen <joshrosen@databricks.com> Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer. (cherry picked from commit 5b8f7377d54f83b93ef2bfc2a01ca65fae6d3032) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 15 September 2016, 18:23:17 UTC
a09c258 [SPARK-17317][SPARKR] Add SparkR vignette to branch 2.0 ## What changes were proposed in this pull request? This PR adds SparkR vignette to branch 2.0, which works as a friendly guidance going through the functionality provided by SparkR. ## How was this patch tested? R unit test. Author: junyangq <qianjunyang@gmail.com> Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Junyang Qian <junyangq@databricks.com> Closes #15100 from junyangq/SPARKR-vignette-2.0. 15 September 2016, 17:00:36 UTC
5c2bc83 [SPARK-17521] Error when I use sparkContext.makeRDD(Seq()) ## What changes were proposed in this pull request? when i use sc.makeRDD below ``` val data3 = sc.makeRDD(Seq()) println(data3.partitions.length) ``` I got an error: Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required We can fix this bug just modify the last line ,do a check of seq.size ``` def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope { assertNotStopped() val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, defaultParallelism), indexToPrefs) } ``` ## How was this patch tested? manual tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: codlife <1004910847@qq.com> Author: codlife <wangjianfei15@otcaix.iscas.ac.cn> Closes #15077 from codlife/master. (cherry picked from commit 647ee05e5815bde361662a9286ac602c44b4d4e6) Signed-off-by: Sean Owen <sowen@cloudera.com> 15 September 2016, 08:38:22 UTC
bb2bdb4 [SPARK-17465][SPARK CORE] Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak The expression like `if (memoryMap(taskAttemptId) == 0) memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and `releasePendingUnrollMemoryForThisTask` should be called after release memory operation, whatever `memoryToRelease` is > 0 or not. If the memory of a task has been set to 0 when calling a `releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` method, the key in the memory map corresponding to that task will never be removed from the hash map. See the details in [SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465). Author: Xing SHI <shi-kou@indetail.co.jp> Closes #15022 from saturday-shi/SPARK-17465. 14 September 2016, 21:00:57 UTC
fffcec9 [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value can be read thread-safely ## What changes were proposed in this pull request? Make CollectionAccumulator and SetAccumulator's value can be read thread-safely to fix the ConcurrentModificationException reported in [JIRA](https://issues.apache.org/jira/browse/SPARK-17463). ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15063 from zsxwing/SPARK-17463. (cherry picked from commit e33bfaed3b160fbc617c878067af17477a0044f5) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 14 September 2016, 20:34:27 UTC
fab77da [SPARK-17511] Yarn Dynamic Allocation: Avoid marking released container as Failed Due to race conditions, the ` assert(numExecutorsRunning <= targetNumExecutors)` can fail causing `AssertionError`. So removed the assertion, instead moved the conditional check before launching new container: ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489) at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` This was manually tested using a large ForkAndJoin job with Dynamic Allocation enabled to validate the failing job succeeds, without any such exception. Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #15069 from kishorvpatil/SPARK-17511. (cherry picked from commit ff6e4cbdc80e2ad84c5d70ee07f323fad9374e3e) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com> 14 September 2016, 19:33:40 UTC
6fe5972 [SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <joshrosen@databricks.com> Closes #15068 from JoshRosen/pyspark-collect-limit. (cherry picked from commit 6d06ff6f7e2dd72ba8fe96cd875e83eda6ebb2a9) Signed-off-by: Davies Liu <davies.liu@gmail.com> 14 September 2016, 17:10:11 UTC
5493107 [SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages ## What changes were proposed in this pull request? Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki. ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #15075 from srowen/SPARK-17445. (cherry picked from commit dc0a4c916151c795dc41b5714e9d23b4937f4636) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 September 2016, 09:14:57 UTC
c6ea748 [SPARK-17480][SQL] Improve performance by removing or caching List.length which is O(n) ## What changes were proposed in this pull request? Scala's List.length method is O(N) and it makes the gatherCompressibilityStats function O(N^2). Eliminate the List.length calls by writing it in Scala way. https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36 As suggested. Extended the fix to HiveInspectors and AggregationIterator classes as well. ## How was this patch tested? Profiled a Spark job and found that CompressibleColumnBuilder is using 39% of the CPU. Out of this 39% CompressibleColumnBuilder->gatherCompressibilityStats is using 23% of it. 6.24% of the CPU is spend on List.length which is called inside gatherCompressibilityStats. After this change we started to save 6.24% of the CPU. Author: Ergin Seyfe <eseyfe@fb.com> Closes #15032 from seyfe/gatherCompressibilityStats. (cherry picked from commit 4cea9da2ae88b40a5503111f8f37051e2372163e) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 September 2016, 08:51:22 UTC
12ebfbe [SPARK-17525][PYTHON] Remove SparkContext.clearFiles() from the PySpark API as it was removed from the Scala API prior to Spark 2.0.0 ## What changes were proposed in this pull request? This pull request removes the SparkContext.clearFiles() method from the PySpark API as the method was removed from the Scala API in 8ce645d4eeda203cf5e100c4bdba2d71edd44e6a. Using that method in PySpark leads to an exception as PySpark tries to call the non-existent method on the JVM side. ## How was this patch tested? Existing tests (though none of them tested this particular method). Author: Sami Jaktholm <sjakthol@outlook.com> Closes #15081 from sjakthol/pyspark-sc-clearfiles. (cherry picked from commit b5bfcddbfbc2e79d3d0fbd43942716946e6c4ba3) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 September 2016, 08:38:39 UTC
c142645 [SPARK-17531] Don't initialize Hive Listeners for the Execution Client ## What changes were proposed in this pull request? If a user provides listeners inside the Hive Conf, the configuration for these listeners are passed to the Hive Execution Client as well. This may cause issues for two reasons: 1. The Execution Client will actually generate garbage 2. The listener class needs to be both in the Spark Classpath and Hive Classpath This PR empties the listener configurations in `HiveUtils.newTemporaryConfiguration` so that the execution client will not contain the listener confs, but the metadata client will. ## How was this patch tested? Unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15086 from brkyvz/null-listeners. (cherry picked from commit 72edc7e958271cedb01932880550cfc2c0631204) Signed-off-by: Yin Huai <yhuai@databricks.com> 13 September 2016, 22:12:16 UTC
b17f10c [SPARK-17515] CollectLimit.execute() should perform per-partition limits CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see #15068 for one example). Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes #15070 from JoshRosen/SPARK-17515. (cherry picked from commit 3f6a2bb3f7beac4ce928eb660ee36258b5b9e8c8) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 13 September 2016, 11:00:11 UTC
1f72e77 [SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec ## What changes were proposed in this pull request? When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]]. Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc JoshRosen ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #15030 from davies/all_expr. (cherry picked from commit a91ab705e8c124aa116c3e5b1f3ba88ce832dcde) Signed-off-by: Davies Liu <davies.liu@gmail.com> 12 September 2016, 23:35:52 UTC
a3fc576 [SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job ## What changes were proposed in this pull request? In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring). In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable. Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`. ## How was this patch tested? Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method. I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master). Author: Josh Rosen <joshrosen@databricks.com> Closes #15037 from JoshRosen/SPARK-17485. (cherry picked from commit f9c580f11098d95f098936a0b90fa21d71021205) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 12 September 2016, 22:44:20 UTC
37f45bf [SPARK-14818] Post-2.0 MiMa exclusion and build changes This patch makes a handful of post-Spark-2.0 MiMa exclusion and build updates. It should be merged to master and a subset of it should be picked into branch-2.0 in order to test Spark 2.0.1-SNAPSHOT. - Remove the ` sketch`, `mllibLocal`, and `streamingKafka010` from the list of excluded subprojects so that MiMa checks them. - Remove now-unnecessary special-case handling of the Kafka 0.8 artifact in `mimaSettings`. - Move the exclusion added in SPARK-14743 from `v20excludes` to `v21excludes`, since that patch was only merged into master and not branch-2.0. - Add exclusions for an API change introduced by SPARK-17096 / #14675. - Add missing exclusions for the `o.a.spark.internal` and `o.a.spark.sql.internal` packages. Author: Josh Rosen <joshrosen@databricks.com> Closes #15061 from JoshRosen/post-2.0-mima-changes. (cherry picked from commit 7c51b99a428a965ff7d136e1cdda20305d260453) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 12 September 2016, 22:34:49 UTC
0a36e36 [SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache the whole RDD in memory ## What changes were proposed in this pull request? MemoryStore may throw OutOfMemoryError when trying to cache a super big RDD that cannot fit in memory. ``` scala> sc.parallelize(1 to 1000000000, 100).map(x => new Array[Long](1000)).cache().count() java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer to store all input values that it has read so far before transferring the values to storage memory cache. The problem is that when the input RDD is too big for caching in memory, the temporary unrolling memory SizeTrackingVector is not garbage collected in time. As SizeTrackingVector can occupy all available storage memory, it may cause the executor JVM to run out of memory quickly. More info can be found at https://issues.apache.org/jira/browse/SPARK-17503 ## How was this patch tested? Unit test and manual test. ### Before change Heap memory consumption <img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png"> Heap dump <img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png"> ### After change Heap memory consumption <img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png"> Author: Sean Zhong <seanzhong@databricks.com> Closes #15056 from clockfly/memory_store_leak. (cherry picked from commit 1742c3ab86d75ce3d352f7cddff65e62fb7c8dd4) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 12 September 2016, 18:33:37 UTC
3052152 [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never read, increasing the memory consumption of the web UI. We should remove this field. Author: Josh Rosen <joshrosen@databricks.com> Closes #15038 from JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData. (cherry picked from commit 72eec70bdbf6fb67c977463db5d8d95dd3040ae8) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 12 September 2016, 04:51:33 UTC
d293062 [SPARK-17336][PYSPARK] Fix appending multiple times to PYTHONPATH from spark-config.sh ## What changes were proposed in this pull request? During startup of Spark standalone, the script file spark-config.sh appends to the PYTHONPATH and can be sourced many times, causing duplicates in the path. This change adds a env flag that is set when the PYTHONPATH is appended so it will happen only one time. ## How was this patch tested? Manually started standalone master/worker and verified PYTHONPATH has no duplicate entries. Author: Bryan Cutler <cutlerb@gmail.com> Closes #15028 from BryanCutler/fix-duplicate-pythonpath-SPARK-17336. (cherry picked from commit c76baff0cc4775c2191d075cc9a8176e4915fec8) Signed-off-by: Sean Owen <sowen@cloudera.com> 11 September 2016, 09:19:49 UTC
bde5452 [SPARK-17439][SQL] Fixing compression issues with approximate quantiles and adding more tests This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors. This PR adds 8 unit tests that were failing without the fix. Author: Timothy Hunter <timhunter@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #15002 from thunterdb/ml-1783. (cherry picked from commit 180796ecb3a00facde2d98affdb5aa38dd258875) Signed-off-by: Sean Owen <sowen@cloudera.com> 11 September 2016, 08:56:54 UTC
c2378a6 [SPARK-17396][CORE] Share the task support between UnionRDD instances. ## What changes were proposed in this pull request? Share the ForkJoinTaskSupport between UnionRDD instances to avoid creating a huge number of threads if lots of RDDs are created at the same time. ## How was this patch tested? This uses existing UnionRDD tests. Author: Ryan Blue <blue@apache.org> Closes #14985 from rdblue/SPARK-17396-use-shared-pool. (cherry picked from commit 6ea5055fa734d435b5f148cf52d3385a57926b60) Signed-off-by: Sean Owen <sowen@cloudera.com> 10 September 2016, 09:20:23 UTC
6f02f40 [SPARK-17354] [SQL] Partitioning by dates/timestamps should work with Parquet vectorized reader This PR fixes `ColumnVectorUtils.populate` so that Parquet vectorized reader can read partitioned table with dates/timestamps. This works fine with Parquet normal reader. This is being only called within [VectorizedParquetRecordReader.java#L185](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L185). When partition column types are explicitly given to `DateType` or `TimestampType` (rather than inferring the type of partition column), this fails with the exception below: ``` 16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362) ... ``` Unit tests in `SQLQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14919 from HyukjinKwon/SPARK-17354. (cherry picked from commit f7d2143705c8c1baeed0bc62940f9dba636e705b) Signed-off-by: Davies Liu <davies.liu@gmail.com> 09 September 2016, 21:26:25 UTC
a7f1c18 [SPARK-17456][CORE] Utility for parsing Spark versions ## What changes were proposed in this pull request? This patch adds methods for extracting major and minor versions as Int types in Scala from a Spark version string. Motivation: There are many hacks within Spark's codebase to identify and compare Spark versions. We should add a simple utility to standardize these code paths, especially since there have been mistakes made in the past. This will let us add unit tests as well. Currently, I want this functionality to check Spark versions to provide backwards compatibility for ML model persistence. ## How was this patch tested? Unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #15017 from jkbradley/version-parsing. (cherry picked from commit 65b814bf50e92e2e9b622d1602f18bacd217181c) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 09 September 2016, 17:42:41 UTC
c6e0dd1 [SPARK-17442][SPARKR] Additional arguments in write.df are not passed to data source ## What changes were proposed in this pull request? additional options were not passed down in write.df. ## How was this patch tested? unit tests falaki shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15010 from felixcheung/testreadoptions. (cherry picked from commit f0d21b7f90cdcce353ab6fc279b9cc376e46e536) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 08 September 2016, 15:23:08 UTC
e169085 [SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade branch-2.0 version of this patch. The differences are in the YarnShuffleService for finding the location to put the DB. branch-2.0 does not use the yarn nm recovery path like master does. Tested in manually on 8 node yarn cluster and ran unit tests. Manually tests verified DB created properly and it found them if already existed. Verified that during rolling upgrade credentials were reloaded and running application was not affected. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #14997 from tgravescs/SPARK-16711-branch2.0. 08 September 2016, 13:16:19 UTC
28377da [SPARK-17339][CORE][BRANCH-2.0] Do not use path to get a filesystem in hadoopFile and newHadoopFile APIs ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/14960 ## How was this patch tested? AppVeyor - https://ci.appveyor.com/project/HyukjinKwon/spark/build/86-backport-SPARK-17339-r Author: hyukjinkwon <gurwls223@gmail.com> Closes #15008 from HyukjinKwon/backport-SPARK-17339. 08 September 2016, 04:22:32 UTC
067752c [SPARK-16533][CORE] - backport driver deadlock fix to 2.0 ## What changes were proposed in this pull request? Backport changes from #14710 and #14925 to 2.0 Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Angus Gerry <angolon@gmail.com> Closes #14933 from angolon/SPARK-16533-2.0. 07 September 2016, 23:43:05 UTC
078ac0e [SPARK-17370] Shuffle service files not invalidated when a slave is lost ## What changes were proposed in this pull request? DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime. However, it also doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files. The proposed fix is to also invalidate shuffle files when an executor is lost due to a `SlaveLost` event. ## How was this patch tested? Unit tests, also verified on an actual cluster that slave loss invalidates shuffle files immediately as expected. cc mateiz Author: Eric Liang <ekl@databricks.com> Closes #14931 from ericl/sc-4439. (cherry picked from commit 649fa4bf1d6fc9271ae56b6891bc93ebf57858d1) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 07 September 2016, 20:01:18 UTC
e6caceb [MINOR][SQL] Fixing the typo in unit test ## What changes were proposed in this pull request? Fixing the typo in the unit test of CodeGenerationSuite.scala ## How was this patch tested? Ran the unit test after fixing the typo and it passes Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #14989 from vundela/typo_fix. (cherry picked from commit 76ad89e9241fb2dece95dd445661dd95ee4ef699) Signed-off-by: Sean Owen <sowen@cloudera.com> 07 September 2016, 11:41:14 UTC
back to top