https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
cd0a083 Preparing Spark release v2.1.0-rc5 16 December 2016, 01:57:04 UTC
b23220f [MINOR] Handle fact that mv is different on linux, mac Follow up to https://github.com/apache/spark/commit/ae853e8f3bdbd16427e6f1ffade4f63abaf74abb as `mv` throws an error on the Jenkins machines if source and destinations are the same. Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16302 from shivaram/sparkr-no-mv-fix. (cherry picked from commit 5a44f18a2a114bdd37b6714d81f88cb68148f0c9) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 16 December 2016, 01:13:43 UTC
62a6577 Preparing development version 2.1.1-SNAPSHOT 16 December 2016, 00:18:29 UTC
ec31726 Preparing Spark release v2.1.0-rc4 16 December 2016, 00:18:20 UTC
ae853e8 [MINOR] Only rename SparkR tar.gz if names mismatch ## What changes were proposed in this pull request? For release builds the R_PACKAGE_VERSION and VERSION are the same (e.g., 2.1.0). Thus `cp` throws an error which causes the build to fail. ## How was this patch tested? Manually by executing the following script ``` set -o pipefail set -e set -x touch a R_PACKAGE_VERSION=2.1.0 VERSION=2.1.0 if [ "$R_PACKAGE_VERSION" != "$VERSION" ]; then cp a a fi ``` Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16299 from shivaram/sparkr-cp-fix. (cherry picked from commit 9634018c4d6d5a4f2c909f7227d91e637107b7f4) Signed-off-by: Reynold Xin <rxin@databricks.com> 16 December 2016, 00:15:56 UTC
08e4272 [SPARK-18868][FLAKY-TEST] Deflake StreamingQueryListenerSuite: single listener, check trigger... ## What changes were proposed in this pull request? Use `recentProgress` instead of `lastProgress` and filter out last non-zero value. Also add eventually to the latest assertQuery similar to first `assertQuery` ## How was this patch tested? Ran test 1000 times Author: Burak Yavuz <brkyvz@gmail.com> Closes #16287 from brkyvz/SPARK-18868. (cherry picked from commit 9c7f83b0289ba4550b156e6af31cf7c44580eb12) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 15 December 2016, 23:46:10 UTC
a7364a8 Preparing development version 2.1.1-SNAPSHOT 15 December 2016, 22:46:09 UTC
ef2ccf9 Preparing Spark release v2.1.0-rc3 15 December 2016, 22:46:00 UTC
b6a81f4 [SPARK-18888] partitionBy in DataStreamWriter in Python throws _to_seq not defined ## What changes were proposed in this pull request? `_to_seq` wasn't imported. ## How was this patch tested? Added partitionBy to existing write path unit test Author: Burak Yavuz <brkyvz@gmail.com> Closes #16297 from brkyvz/SPARK-18888. 15 December 2016, 22:28:29 UTC
900ce55 [SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource ## What changes were proposed in this pull request? When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16251 from zsxwing/newest-first. (cherry picked from commit 68a6dc974b25e6eddef109f6fd23ae4e9775ceca) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 15 December 2016, 21:18:06 UTC
e430915 [SPARK-18870] Disallowed Distinct Aggregations on Streaming Datasets ## What changes were proposed in this pull request? Check whether Aggregation operators on a streaming subplan have aggregate expressions with isDistinct = true. ## How was this patch tested? Added unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16289 from tdas/SPARK-18870. (cherry picked from commit 4f7292c87512a7da3542998d0e5aa21c27a511e9) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 15 December 2016, 19:54:47 UTC
2a8de2e [SPARK-18849][ML][SPARKR][DOC] vignettes final check update ## What changes were proposed in this pull request? doc cleanup ## How was this patch tested? ~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~ Output html here: https://felixcheung.github.io/sparkr-vignettes.html Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16286 from felixcheung/rvignettespass. (cherry picked from commit 7d858bc5ce870a28a559f4e81dcfc54cbd128cb7) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 15 December 2016, 05:52:01 UTC
d399a29 [SPARK-18875][SPARKR][DOCS] Fix R API doc generation by adding `DESCRIPTION` file ## What changes were proposed in this pull request? Since Apache Spark 1.4.0, R API document page has a broken link on `DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR aims to fix that. - Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html - Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html ## How was this patch tested? Manual. ```bash cd docs SKIP_SCALADOC=1 jekyll build ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16292 from dongjoon-hyun/SPARK-18875. (cherry picked from commit ec0eae486331c3977505d261676b77a33c334216) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 15 December 2016, 05:29:30 UTC
b14fc39 [SPARK-18869][SQL] Add TreeNode.p that returns BaseType ## What changes were proposed in this pull request? After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce a function that returns the BaseType. ## How was this patch tested? N/A - this is a developer only feature used for interactive debugging. As long as it compiles, it should be good to go. I tested this in spark-shell. Author: Reynold Xin <rxin@databricks.com> Closes #16288 from rxin/SPARK-18869. (cherry picked from commit 5d510c693aca8c3fd3364b4453160bc8585ffc8e) Signed-off-by: Reynold Xin <rxin@databricks.com> 15 December 2016, 05:08:51 UTC
cb2c842 [SPARK-18856][SQL] non-empty partitioned table should not report zero size ## What changes were proposed in this pull request? In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #16280 from cloud-fan/bug. (cherry picked from commit d6f11a12a146a863553c5a5e2023d79d4375ef3f) Signed-off-by: Reynold Xin <rxin@databricks.com> 15 December 2016, 05:04:06 UTC
0d94201 [SPARK-18865][SPARKR] SparkR vignettes MLP and LDA updates ## What changes were proposed in this pull request? When do the QA work, I found that the following issues: 1). `spark.mlp` doesn't include an example; 2). `spark.mlp` and `spark.lda` have redundant parameter explanations; 3). `spark.lda` document misses default values for some parameters. I also changed the `spark.logit` regParam in the examples, as we discussed in #16222. ## How was this patch tested? Manual test Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16284 from wangmiao1981/ks. (cherry picked from commit 324388531648de20ee61bd42518a068d4789925c) Signed-off-by: Felix Cheung <felixcheung@apache.org> 15 December 2016, 01:07:39 UTC
280c35a [SPARK-18854][SQL] numberedTreeString and apply(i) inconsistent for subqueries ## What changes were proposed in this pull request? This is a bug introduced by subquery handling. numberedTreeString (which uses generateTreeString under the hood) numbers trees including innerChildren (used to print subqueries), but apply (which uses getNodeNumbered) ignores innerChildren. As a result, apply(i) would return the wrong plan node if there are subqueries. This patch fixes the bug. ## How was this patch tested? Added a test case in SubquerySuite.scala to test both the depth-first traversal of numbering as well as making sure the two methods are consistent. Author: Reynold Xin <rxin@databricks.com> Closes #16277 from rxin/SPARK-18854. (cherry picked from commit ffdd1fcd1e8f4f6453d5b0517c0ce82766b8e75f) Signed-off-by: Reynold Xin <rxin@databricks.com> 15 December 2016, 00:12:20 UTC
d0d9c57 [SPARK-18795][ML][SPARKR][DOC] Added KSTest section to SparkR vignettes ## What changes were proposed in this pull request? Added short section for KSTest. Also added logreg model to list of ML models in vignette. (This will be reorganized under SPARK-18849) ![screen shot 2016-12-14 at 1 37 31 pm](https://cloud.githubusercontent.com/assets/5084283/21202140/7f24e240-c202-11e6-9362-458208bb9159.png) ## How was this patch tested? Manually tested example locally. Built vignettes locally. Author: Joseph K. Bradley <joseph@databricks.com> Closes #16283 from jkbradley/ksTest-vignette. (cherry picked from commit 78627425708a0afbe113efdf449e8622b43b652d) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 14 December 2016, 22:10:48 UTC
c4de90f [SPARK-18852][SS] StreamingQuery.lastProgress should be null when recentProgress is empty ## What changes were proposed in this pull request? Right now `StreamingQuery.lastProgress` throws NoSuchElementException and it's hard to be used in Python since Python user will just see Py4jError. This PR just makes it return null instead. ## How was this patch tested? `test("lastProgress should be null when recentProgress is empty")` Author: Shixiong Zhu <shixiong@databricks.com> Closes #16273 from zsxwing/SPARK-18852. (cherry picked from commit 1ac6567bdb03d7cc5c5f3473827a102280cb1030) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 14 December 2016, 21:36:55 UTC
e8866f9 [SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics ## What changes were proposed in this pull request? This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #16274 from rxin/SPARK-18853. (cherry picked from commit 5d799473696a15fddd54ec71a93b6f8cb169810c) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 14 December 2016, 20:23:01 UTC
af12a21 [SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side post-filter for FileFormat datasources ## What changes were proposed in this pull request? Currently, `FileSourceStrategy` does not handle the case when the pushed-down filter is `Literal(null)` and removes it at the post-filter in Spark-side. For example, the codes below: ```scala val df = Seq(Tuple1(Some(true)), Tuple1(None), Tuple1(Some(false))).toDF() df.filter($"_1" === "true").explain(true) ``` shows it keeps `null` properly. ``` == Parsed Logical Plan == 'Filter ('_1 = true) +- LocalRelation [_1#17] == Analyzed Logical Plan == _1: boolean Filter (cast(_1#17 as double) = cast(true as double)) +- LocalRelation [_1#17] == Optimized Logical Plan == Filter (isnotnull(_1#17) && null) +- LocalRelation [_1#17] == Physical Plan == *Filter (isnotnull(_1#17) && null) << Here `null` is there +- LocalTableScan [_1#17] ``` However, when we read it back from Parquet, ```scala val path = "/tmp/testfile" df.write.parquet(path) spark.read.parquet(path).filter($"_1" === "true").explain(true) ``` `null` is removed at the post-filter. ``` == Parsed Logical Plan == 'Filter ('_1 = true) +- Relation[_1#11] parquet == Analyzed Logical Plan == _1: boolean Filter (cast(_1#11 as double) = cast(true as double)) +- Relation[_1#11] parquet == Optimized Logical Plan == Filter (isnotnull(_1#11) && null) +- Relation[_1#11] parquet == Physical Plan == *Project [_1#11] +- *Filter isnotnull(_1#11) << Here `null` is missing +- *FileScan parquet [_1#11] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/tmp/testfile], PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean> ``` This PR fixes it to keep it properly. In more details, ```scala val partitionKeyFilters = ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet))) ``` This keeps this `null` in `partitionKeyFilters` as `Literal` always don't have `children` and `references` is being empty which is always the subset of `partitionSet`. And then in ```scala val afterScanFilters = filterSet -- partitionKeyFilters ``` `null` is always removed from the post filter. So, if the referenced fields are empty, it should be applied into data columns too. After this PR, it becomes as below: ``` == Parsed Logical Plan == 'Filter ('_1 = true) +- Relation[_1#276] parquet == Analyzed Logical Plan == _1: boolean Filter (cast(_1#276 as double) = cast(true as double)) +- Relation[_1#276] parquet == Optimized Logical Plan == Filter (isnotnull(_1#276) && null) +- Relation[_1#276] parquet == Physical Plan == *Project [_1#276] +- *Filter (isnotnull(_1#276) && null) +- *FileScan parquet [_1#276] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-a5d59bdb-5b..., PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean> ``` ## How was this patch tested? Unit test in `FileSourceStrategySuite` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16184 from HyukjinKwon/SPARK-18753. (cherry picked from commit 89ae26dcdb73266fbc3a8b6da9f5dff30dc4ec95) Signed-off-by: Cheng Lian <lian@databricks.com> 14 December 2016, 19:29:23 UTC
16d4bd4 [SPARK-18730] Post Jenkins test report page instead of the full console output page to GitHub ## What changes were proposed in this pull request? Currently, the full console output page of a Spark Jenkins PR build can be as large as several megabytes. It takes a relatively long time to load and may even freeze the browser for quite a while. This PR makes the build script to post the test report page link to GitHub instead. The test report page is way more concise and is usually the first page I'd like to check when investigating a Jenkins build failure. Note that for builds that a test report is not available (ongoing builds and builds that fail before test execution), the test report link automatically redirects to the build page. ## How was this patch tested? N/A. Author: Cheng Lian <lian@databricks.com> Closes #16163 from liancheng/jenkins-test-report. (cherry picked from commit ba4aab9b85688141d3d0c185165ec7a402c9fbba) Signed-off-by: Reynold Xin <rxin@databricks.com> 14 December 2016, 18:57:08 UTC
f999312 [SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32 ## What changes were proposed in this pull request? Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis to Analysis to fix a regression caused by SPARK-18504. This problem can be reproduced with a simple script now. Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p") Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c") sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show The requirements are: 1. We need to reference the same table twice in both the parent and the subquery. Here is the table c. 2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent. 3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504. ## How was this patch tested? SubquerySuite and a simplified version of TPCDS-Q32 Author: Nattavut Sutyanyong <nsy.can@gmail.com> Closes #16246 from nsyca/18814. (cherry picked from commit cccd64393ea633e29d4a505fb0a7c01b51a79af8) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 14 December 2016, 10:09:45 UTC
8ef0059 [MINOR][SPARKR] fix kstest example error and add unit test ## What changes were proposed in this pull request? While adding vignettes for kstest, I found some errors in the example: 1. There is a typo of kstest; 2. print.summary.KStest doesn't work with the example; Fix the example errors; Add a new unit test for print.summary.KStest; ## How was this patch tested? Manual test; Add new unit test; Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16259 from wangmiao1981/ks. (cherry picked from commit f2ddabfa09fda26ff0391d026dd67545dab33e01) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 14 December 2016, 02:52:22 UTC
019d1fa [SPARK-18588][TESTS] Ignore KafkaSourceStressForDontFailOnDataLossSuite ## What changes were proposed in this pull request? Disable KafkaSourceStressForDontFailOnDataLossSuite for now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16275 from zsxwing/ignore-flaky-test. (cherry picked from commit e104e55c16e229e521c517393b8163cbc3bbf85a) Signed-off-by: Reynold Xin <rxin@databricks.com> 14 December 2016, 02:36:42 UTC
5693ac8 [SPARK-18793][SPARK-18794][R] add spark.randomForest/spark.gbt to vignettes ## What changes were proposed in this pull request? Mention `spark.randomForest` and `spark.gbt` in vignettes. Keep the content minimal since users can type `?spark.randomForest` to see the full doc. cc: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #16264 from mengxr/SPARK-18793. (cherry picked from commit 594b14f1ebd0b3db9f630e504be92228f11b4d9f) Signed-off-by: Xiangrui Meng <meng@databricks.com> 14 December 2016, 00:59:15 UTC
25b9758 [SPARK-18834][SS] Expose event time stats through StreamingQueryProgress ## What changes were proposed in this pull request? - Changed `StreamingQueryProgress.watermark` to `StreamingQueryProgress.queryTimestamps` which is a `Map[String, String]` containing the following keys: "eventTime.max", "eventTime.min", "eventTime.avg", "processingTime", "watermark". All of them UTC formatted strings. - Renamed `StreamingQuery.timestamp` to `StreamingQueryProgress.triggerTimestamp` to differentiate from `queryTimestamps`. It has the timestamp of when the trigger was started. ## How was this patch tested? Updated tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16258 from tdas/SPARK-18834. (cherry picked from commit c68fb426d4ac05414fb402aa1f30f4c98df103ad) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 13 December 2016, 22:15:15 UTC
f672bfd [SPARK-18843][CORE] Fix timeout in awaitResultInForkJoinSafely (branch 2.1, 2.0) ## What changes were proposed in this pull request? This PR fixes the timeout value in `awaitResultInForkJoinSafely` for 2.1 and 2.0. Master has been fixed by https://github.com/apache/spark/pull/16230. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16268 from zsxwing/SPARK-18843. 13 December 2016, 22:09:25 UTC
292a37f [SPARK-18816][WEB UI] Executors Logs column only ran visibility check on initial table load ## What changes were proposed in this pull request? When I added a visibility check for the logs column on the executors page in #14382 the method I used only ran the check on the initial DataTable creation and not subsequent page loads. I moved the check out of the table definition and instead it runs on each page load. The jQuery DataTable functionality used is the same. ## How was this patch tested? Tested Manually No visible UI changes to screenshot. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #16256 from ajbozarth/spark18816. (cherry picked from commit aebf44e50b6b04b848829adbbe08b0f74f31eb32) Signed-off-by: Sean Owen <sowen@cloudera.com> 13 December 2016, 21:38:04 UTC
d5c4a5d [SPARK-18840][YARN] Avoid throw exception when getting token renewal interval in non HDFS security environment ## What changes were proposed in this pull request? Fix `java.util.NoSuchElementException` when running Spark in non-hdfs security environment. In the current code, we assume `HDFS_DELEGATION_KIND` token will be found in Credentials. But in some cloud environments, HDFS is not required, so we should avoid this exception. ## How was this patch tested? Manually verified in local environment. Author: jerryshao <sshao@hortonworks.com> Closes #16265 from jerryshao/SPARK-18840. (cherry picked from commit 43298d157d58d5d03ffab818f8cdfc6eac783c55) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 13 December 2016, 18:37:56 UTC
207107b [SPARK-18835][SQL] Don't expose Guava types in the JavaTypeInference API. This avoids issues during maven tests because of shading. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #16260 from vanzin/SPARK-18835. (cherry picked from commit f280ccf449f62a00eb4042dfbcf7a0715850fd4c) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 13 December 2016, 18:02:29 UTC
9f0e3be [SPARK-18797][SPARKR] Update spark.logit in sparkr-vignettes ## What changes were proposed in this pull request? spark.logit is added in 2.1. We need to update spark-vignettes to reflect the changes. This is part of SparkR QA work. ## How was this patch tested? Manual build html. Please see attached image for the result. ![test](https://cloud.githubusercontent.com/assets/5033592/21032237/01b565fe-bd5d-11e6-8b59-4de4b6ef611d.jpeg) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16222 from wangmiao1981/veg. (cherry picked from commit 2aa16d03db79a642cbe21f387441c34fc51a8236) Signed-off-by: Xiangrui Meng <meng@databricks.com> 13 December 2016, 06:41:20 UTC
9dc5fa5 [SPARK-18796][SS] StreamingQueryManager should not block when starting a query ## What changes were proposed in this pull request? Major change in this PR: - Add `pendingQueryNames` and `pendingQueryIds` to track that are going to start but not yet put into `activeQueries` so that we don't need to hold a lock when starting a query. Minor changes: - Fix a potential NPE when the user sets `checkpointLocation` using SQLConf but doesn't specify a query name. - Add missing docs in `StreamingQueryListener` ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16220 from zsxwing/SPARK-18796. (cherry picked from commit 417e45c58484a6b984ad2ce9ba8f47aa0a9983fd) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 13 December 2016, 06:31:42 UTC
1aeb7f4 [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshots ## What changes were proposed in this pull request? Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL` ## How was this patch tested? unit test, manually testing - snapshot build url - download when spark jar not cached - when spark jar is cached - RC build url - download when spark jar not cached - when spark jar is cached - multiple cached spark versions - starting with sparkR shell To use this, ``` SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R ``` then in R, ``` library(SparkR) # or specify lib.loc sparkR.session() ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16248 from felixcheung/rinstallurl. (cherry picked from commit 8a51cfdcad5f8397558ed2e245eb03650f37ce66) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 12 December 2016, 22:40:52 UTC
523071f [SPARK-18681][SQL] Fix filtering to compatible with partition keys of type int ## What changes were proposed in this pull request? Cloudera put `/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml` as the configuration file for the Hive Metastore Server, where `hive.metastore.try.direct.sql=false`. But Spark isn't reading this configuration file and get default value `hive.metastore.try.direct.sql=true`. As mallman said, we should use `getMetaConf` method to obtain the original configuration from Hive Metastore Server. I have tested this method few times and the return value is always consistent with Hive Metastore Server. ## How was this patch tested? The existing tests. Author: Yuming Wang <wgyumg@gmail.com> Closes #16122 from wangyum/SPARK-18681. (cherry picked from commit 90abfd15f4b3f612a7b0ff65f03bf319c78a0243) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 12 December 2016, 22:39:13 UTC
3501160 [DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed ## What changes were proposed in this pull request? This PR clarifies where accumulators will be displayed. ## How was this patch tested? No testing. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bill Chambers <bill@databricks.com> Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #16180 from anabranch/improve-acc-docs. (cherry picked from commit 70ffff21f769b149bee787fe5901d9844a4d97b8) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 December 2016, 13:33:32 UTC
63693c1 [SPARK-18790][SS] Keep a general offset history of stream batches ## What changes were proposed in this pull request? Instead of only keeping the minimum number of offsets around, we should keep enough information to allow us to roll back n batches and reexecute the stream starting from a given point. In particular, we should create a config in SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure that we keep enough log files in the following places to roll back the specified number of batches: the offsets that are present in each batch versions of the state store the files lists stored for the FileStreamSource the metadata log stored by the FileStreamSink marmbrus zsxwing ## How was this patch tested? The following tests were added. ### StreamExecution offset metadata Test added to StreamingQuerySuite that ensures offset metadata is garbage collected according to minBatchesRetain ### CompactibleFileStreamLog Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged starting before the first compaction file that proceeds the current batch id - minBatchesToRetain. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tyson Condie <tcondie@gmail.com> Closes #16219 from tcondie/offset_hist. (cherry picked from commit 83a42897ae90d84a54373db386a985e3e2d5903a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 12 December 2016, 07:38:57 UTC
d5f1416 [SPARK-18628][ML] Update Scala param and Python param to have quotes ## What changes were proposed in this pull request? Updated Scala param and Python param to have quotes around the options making it easier for users to read. ## How was this patch tested? Manually checked the docstrings Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16242 from krishnakalyan3/doc-string. (cherry picked from commit c802ad87182520662be51eb611ea1c64f4874c4e) Signed-off-by: Sean Owen <sowen@cloudera.com> 11 December 2016, 09:28:26 UTC
d4c03f8 [SQL][MINOR] simplify a test to fix the maven tests ## What changes were proposed in this pull request? After https://github.com/apache/spark/pull/15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in https://github.com/apache/spark/pull/15620#discussion_r91829129 , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change Author: Wenchen Fan <wenchen@databricks.com> Closes #16244 from cloud-fan/minor. (cherry picked from commit 9abd05b6b94eda31c47bce1f913af988c35f1cb1) Signed-off-by: Sean Owen <sowen@cloudera.com> 11 December 2016, 09:12:55 UTC
de21ca4 [SPARK-18815][SQL] Fix NPE when collecting column stats for string/binary column having only null values ## What changes were proposed in this pull request? During column stats collection, average and max length will be null if a column of string/binary type has only null values. To fix this, I use default size when avg/max length is null. ## How was this patch tested? Add a test for handling null columns Author: wangzhenhua <wangzhenhua@huawei.com> Closes #16243 from wzhfy/nullStats. (cherry picked from commit a29ee55aaadfe43ac9abb0eaf8b022b1e6d7babb) Signed-off-by: Reynold Xin <rxin@databricks.com> 11 December 2016, 05:25:35 UTC
5151daf [SPARK-3359][DOCS] Fix greater-than symbols in Javadoc to allow building with Java 8 ## What changes were proposed in this pull request? The API documentation build was failing when using Java 8 due to incorrect character `>` in Javadoc. Replace `>` with literals in Javadoc to allow the build to pass. ## How was this patch tested? Documentation was built and inspected manually to ensure it still displays correctly in the browser ``` cd docs && jekyll serve ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes #16201 from michalsenkyr/javadoc8-gt-fix. (cherry picked from commit 114324832abce1fbb2c5f5b84a66d39dd2d4398a) Signed-off-by: Sean Owen <sowen@cloudera.com> 10 December 2016, 19:54:17 UTC
83822df [MINOR][DOCS] Remove Apache Spark Wiki address ## What changes were proposed in this pull request? According to the notice of the following Wiki front page, we can remove the obsolete wiki pointer safely in `README.md` and `docs/index.md`, too. These two lines are the last occurrence of that links. ``` All current wiki content has been merged into pages at http://spark.apache.org as of November 2016. Each page links to the new location of its information on the Spark web site. Obsolete wiki content is still hosted here, but carries a notice that it is no longer current. ``` ## How was this patch tested? Manual. - `README.md`: https://github.com/dongjoon-hyun/spark/tree/remove_wiki_from_readme - `docs/index.md`: ``` cd docs SKIP_API=1 jekyll build ``` ![screen shot 2016-12-09 at 2 53 29 pm](https://cloud.githubusercontent.com/assets/9700541/21067323/517252e2-be1f-11e6-85b1-2a4471131c5d.png) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16239 from dongjoon-hyun/remove_wiki_from_readme. (cherry picked from commit f3a3fed76cb74ecd0f46031f337576ce60f54fb2) Signed-off-by: Sean Owen <sowen@cloudera.com> 10 December 2016, 16:40:25 UTC
2b36f49 [SPARK-17460][SQL] Make sure sizeInBytes in Statistics will not overflow ## What changes were proposed in this pull request? 1. In SparkStrategies.canBroadcast, I will add the check plan.statistics.sizeInBytes >= 0 2. In LocalRelations.statistics, when calculate the statistics, I will change the size to BigInt so it won't overflow. ## How was this patch tested? I will add a test case to make sure the statistics.sizeInBytes won't overflow. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #16175 from huaxingao/spark-17460. (cherry picked from commit c5172568b59b4cf1d3dc7ed8c17a9bea2ea2ab79) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 December 2016, 14:42:11 UTC
b020ce4 [SPARK-18811] StreamSource resolution should happen in stream execution thread ## What changes were proposed in this pull request? When you start a stream, if we are trying to resolve the source of the stream, for example if we need to resolve partition columns, this could take a long time. This long execution time should not block the main thread where `query.start()` was called on. It should happen in the stream execution thread possibly before starting any triggers. ## How was this patch tested? Unit test added. Made sure test fails with no code changes. Author: Burak Yavuz <brkyvz@gmail.com> Closes #16238 from brkyvz/SPARK-18811. (cherry picked from commit 63c9159870ee274c68e24360594ca01d476b9ace) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 10 December 2016, 06:50:10 UTC
8bf56cc [SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods with void return values ## What changes were proposed in this pull request? Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE. example: ``` > setLogLevel("WARN") NULL ``` We should fix this to make the result more clear. Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it. ## How was this patch tested? manually - I didn't find a expect_*() method in testthat for this Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16237 from felixcheung/rinvis. (cherry picked from commit 3e11d5bfef2f05bd6d42c4d6188eae6d63c963ef) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 10 December 2016, 03:06:28 UTC
e45345d [SPARK-18812][MLLIB] explain "Spark ML" ## What changes were proposed in this pull request? There has been some confusion around "Spark ML" vs. "MLlib". This PR adds some FAQ-like entries to the MLlib user guide to explain "Spark ML" and reduce the confusion. I check the [Spark FAQ page](http://spark.apache.org/faq.html), which seems too high-level for the content here. So I added it to the MLlib user guide instead. cc: mateiz Author: Xiangrui Meng <meng@databricks.com> Closes #16241 from mengxr/SPARK-18812. (cherry picked from commit d2493a203e852adf63dde4e1fc993e8d11efec3d) Signed-off-by: Xiangrui Meng <meng@databricks.com> 10 December 2016, 01:34:58 UTC
562507e [SPARK-18745][SQL] Fix signed integer overflow due to toInt cast ## What changes were proposed in this pull request? This PR avoids that a result of a cast `toInt` is negative due to signed integer overflow (e.g. 0x0000_0000_1???????L.toInt < 0 ). This PR performs casts after we can ensure the value is within range of signed integer (the result of `max(array.length, ???)` is always integer). ## How was this patch tested? Manually executed query68 of TPC-DS with 100TB Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #16235 from kiszk/SPARK-18745. (cherry picked from commit d60ab5fd9b6af9aa5080a2d13b3589d8b79c5c5c) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 09 December 2016, 22:13:50 UTC
eb2d9bf [MINOR][SPARKR] Fix SparkR regex in copy command Fix SparkR package copy regex. The existing code leads to ``` Copying release tarballs to /home/****/public_html/spark-nightly/spark-branch-2.1-bin/spark-2.1.1-SNAPSHOT-2016_12_08_22_38-e8f351f-bin mput: SparkR-*: no files found ``` Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16231 from shivaram/typo-sparkr-build. (cherry picked from commit be5fc6ef72c7eb586b184b0f42ac50ef32843208) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 09 December 2016, 18:13:05 UTC
0c6415a [SPARK-17822][R] Make JVMObjectTracker a member variable of RBackend ## What changes were proposed in this pull request? * This PR changes `JVMObjectTracker` from `object` to `class` and let its instance associated with each RBackend. So we can manage the lifecycle of JVM objects when there are multiple `RBackend` sessions. `RBackend.close` will clear the object tracker explicitly. * I assume that `SQLUtils` and `RRunner` do not need to track JVM instances, which could be wrong. * Small refactor of `SerDe.sqlSerDe` to increase readability. ## How was this patch tested? * Added unit tests for `JVMObjectTracker`. * Wait for Jenkins to run full tests. Author: Xiangrui Meng <meng@databricks.com> Closes #16154 from mengxr/SPARK-17822. (cherry picked from commit fd48d80a6145ea94f03e7fc6e4d724a0fbccac58) Signed-off-by: Xiangrui Meng <meng@databricks.com> 09 December 2016, 15:51:58 UTC
b226f10 [MINOR][CORE][SQL][DOCS] Typo fixes ## What changes were proposed in this pull request? Typo fixes ## How was this patch tested? Local build. Awaiting the official build. Author: Jacek Laskowski <jacek@japila.pl> Closes #16144 from jaceklaskowski/typo-fixes. (cherry picked from commit b162cc0c2810c1a9fa2eee8e664ffae84f9eea11) Signed-off-by: Sean Owen <sowen@cloudera.com> 09 December 2016, 10:46:32 UTC
72bf519 [SPARK-18637][SQL] Stateful UDF should be considered as nondeterministic Make stateful udf as nondeterministic Add new test cases with both Stateful and Stateless UDF. Without the patch, the test cases will throw exception: 1 did not equal 10 ScalaTestFailureLocation: org.apache.spark.sql.hive.execution.HiveUDFSuite$$anonfun$21 at (HiveUDFSuite.scala:501) org.scalatest.exceptions.TestFailedException: 1 did not equal 10 at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) ... Author: Zhan Zhang <zhanzhang@fb.com> Closes #16068 from zhzhan/state. (cherry picked from commit 67587d961d5f94a8639c20cb80127c86bf79d5a8) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 December 2016, 08:36:35 UTC
2c88e1d Copy pyspark and SparkR packages to latest release dir too ## What changes were proposed in this pull request? Copy pyspark and SparkR packages to latest release dir, as per comment [here](https://github.com/apache/spark/pull/16226#discussion_r91664822) Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16227 from felixcheung/pyrftp. (cherry picked from commit c074c96dc57bf18b28fafdcac0c768d75c642cba) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 09 December 2016, 06:53:02 UTC
e8f351f Copy the SparkR source package with LFTP This PR adds a line in release-build.sh to copy the SparkR source archive using LFTP Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16226 from shivaram/fix-sparkr-copy-build. (cherry picked from commit 934035ae7cb648fe61665d8efe0b7aa2bbe4ca47) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 09 December 2016, 06:21:36 UTC
4ceed95 [SPARK-18349][SPARKR] Update R API documentation on ml model summary ## What changes were proposed in this pull request? In this PR, the document of `summary` method is improved in the format: returns summary information of the fitted model, which is a list. The list includes ....... Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here. In current document, some `return` have `.` and some don't have. `.` is added to missed ones. Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged. ## How was this patch tested? Manual build. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16150 from wangmiao1981/audit2. (cherry picked from commit 86a96034ccb47c5bba2cd739d793240afcfc25f6) Signed-off-by: Felix Cheung <felixcheung@apache.org> 09 December 2016, 06:08:51 UTC
ef5646b [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in https://github.com/apache/spark/pull/16014#issuecomment-265867125 Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16221 from shivaram/fix-sparkr-release-build-name. (cherry picked from commit 4ac8b20bf2f962d9b8b6b209468896758d49efe3) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 09 December 2016, 02:27:05 UTC
1cafc76 [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is enabled (branch 2.1) ## What changes were proposed in this pull request? Backport #16203 to branch 2.1. ## How was this patch tested? Jennkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16216 from zsxwing/SPARK-18774-2.1. 09 December 2016, 01:58:44 UTC
fcd22e5 [SPARK-18776][SS] Make Offset for FileStreamSource corrected formatted in json ## What changes were proposed in this pull request? - Changed FileStreamSource to use new FileStreamSourceOffset rather than LongOffset. The field is named as `logOffset` to make it more clear that this is a offset in the file stream log. - Fixed bug in FileStreamSourceLog, the field endId in the FileStreamSourceLog.get(startId, endId) was not being used at all. No test caught it earlier. Only my updated tests caught it. Other minor changes - Dont use batchId in the FileStreamSource, as calling it batch id is extremely miss leading. With multiple sources, it may happen that a new batch has no new data from a file source. So offset of FileStreamSource != batchId after that batch. ## How was this patch tested? Updated unit test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16205 from tdas/SPARK-18776. (cherry picked from commit 458fa3325e5f8c21c50e406ac8059d6236f93a9c) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 09 December 2016, 01:53:45 UTC
e43209f [SPARK-18590][SPARKR] Change the R source build to Hadoop 2.6 This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in https://github.com/apache/spark/pull/16014#issuecomment-265843991 Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #16218 from shivaram/fix-sparkr-release-build. (cherry picked from commit 202fcd21ce01393fa6dfaa1c2126e18e9b85ee96) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 08 December 2016, 21:01:54 UTC
9483242 [SPARK-18760][SQL] Consistent format specification for FileFormats ## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` Also closes #14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold Xin <rxin@databricks.com> Closes #16187 from rxin/SPARK-18760. (cherry picked from commit 5f894d23a54ea99f75f8b722e111e5270f7f80cf) Signed-off-by: Reynold Xin <rxin@databricks.com> 08 December 2016, 20:52:21 UTC
a035644 [SPARK-18751][CORE] Fix deadlock when SparkContext.stop is called in Utils.tryOrStopSparkContext ## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in #15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16178 from zsxwing/fix-stop-deadlock. (cherry picked from commit 26432df9cc6ffe569583aa628c6ecd7050b38316) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 08 December 2016, 19:54:10 UTC
d69df90 [SPARK-18590][SPARKR] build R source package when making distribution This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. Manually, CI. Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16014 from felixcheung/rdist. (cherry picked from commit c3d3a9d0e85b834abef87069e4edd27db87fc607) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 08 December 2016, 19:31:24 UTC
e0173f1 [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip. Author: Andrew Ray <ray.andrew@gmail.com> Closes #16121 from aray/fix-cartesian. (cherry picked from commit 3c68944b229aaaeeaee3efcbae3e3be9a2914855) Signed-off-by: Davies Liu <davies.liu@gmail.com> 08 December 2016, 19:08:27 UTC
726217e [SPARK-18667][PYSPARK][SQL] Change the way to group row in BatchEvalPythonExec so input_file_name function can work with UDF in pyspark ## What changes were proposed in this pull request? `input_file_name` doesn't return filename when working with UDF in PySpark. An example shows the problem: from pyspark.sql.functions import * from pyspark.sql.types import * def filename(path): return path sourceFile = udf(filename, StringType()) spark.read.json("tmp.json").select(sourceFile(input_file_name())).show() +---------------------------+ |filename(input_file_name())| +---------------------------+ | | +---------------------------+ The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching processing of PythonUDF. Currently we group rows first and then evaluate expressions on the rows. If the data is less than the required number of rows for a group, the iterator will be consumed to the end before the evaluation. However, once the iterator reaches the end, we will unset input filename. So the input_file_name expression can't return correct filename. This patch fixes the approach to group the batch of rows. We evaluate the expression first and then group evaluated results to batch. ## How was this patch tested? Added unit test to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16115 from viirya/fix-py-udf-input-filename. (cherry picked from commit 6a5a7254dc37952505989e9e580a14543adb730c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 December 2016, 15:22:40 UTC
9095c15 [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide ## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16148 from yanboliang/spark-18325. (cherry picked from commit 9bf8f3cd4f62f921c32fb50b8abf49576a80874f) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 08 December 2016, 14:20:28 UTC
48aa677 Preparing development version 2.1.1-SNAPSHOT 08 December 2016, 06:29:55 UTC
0807174 Preparing Spark release v2.1.0-rc2 08 December 2016, 06:29:49 UTC
1c3f1da [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1 ## What changes were proposed in this pull request? Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues: * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle. * Fix ```spark.als``` params to make it consistent with MLlib. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16169 from yanboliang/spark-18326. (cherry picked from commit 97255497d885f0f8ccfc808e868bc8aa5e4d1063) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 08 December 2016, 04:23:45 UTC
ab865cf [SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and elastic-net ## What changes were proposed in this pull request? WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change. ## How was this patch tested? Docs only. Generated documentation to make sure Latex looks ok. Author: sethah <seth.hendrickson16@gmail.com> Closes #16139 from sethah/SPARK-18705. (cherry picked from commit 82253617f5b3cdbd418c48f94e748651ee80077e) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 08 December 2016, 03:42:06 UTC
617ce3b [SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query ## What changes were proposed in this pull request? Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets rerouted through the Spark's main listener bus, that is, - StreamingQuery posts event to StreamingQueryListenerBus. Only the queries associated with the same session as the bus posts events to it. - StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as a SparkEvent. - StreamingQueryListenerBus also subscribes to LiveListenerBus events thus getting back the posted event in a different thread. - The received is posted to the registered listeners. The problem is that *all StreamingQueryListenerBuses in all sessions* gets the events and posts them to their listeners. This is wrong. In this PR, I solve it by making StreamingQueryListenerBus track active queries (by their runIds) when a query posts the QueryStarted event to the bus. This allows the rerouted events to be filtered using the tracked queries. Note that this list needs to be maintained separately from the `StreamingQueryManager.activeQueries` because a terminated query is cleared from `StreamingQueryManager.activeQueries` as soon as it is stopped, but the this ListenerBus must clear a query only after the termination event of that query has been posted lazily, much after the query has been terminated. Credit goes to zsxwing for coming up with the initial idea. ## How was this patch tested? Updated test harness code to use the correct session, and added new unit test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16186 from tdas/SPARK-18758. (cherry picked from commit 9ab725eabbb4ad515a663b395bd2f91bb5853a23) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 08 December 2016, 03:23:41 UTC
839c2eb [SPARK-18633][ML][EXAMPLE] Add multiclass logistic regression summary python example and document ## What changes were proposed in this pull request? Logistic Regression summary is added in Python API. We need to add example and document for summary. The newly added example is consistent with Scala and Java examples. ## How was this patch tested? Manually tests: Run the example with spark-submit; copy & paste code into pyspark; build document and check the document. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16064 from wangmiao1981/py. (cherry picked from commit aad11209eb4db585f991ba09d08d90576f315bb4) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> 08 December 2016, 02:13:04 UTC
1c64197 [SPARK-18754][SS] Rename recentProgresses to recentProgress Based on an informal survey, users find this option easier to understand / remember. Author: Michael Armbrust <michael@databricks.com> Closes #16182 from marmbrus/renameRecentProgress. (cherry picked from commit 70b2bf717d367d598c5a238d569d62c777e63fde) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 07 December 2016, 23:36:39 UTC
e9b3afa [SPARK-18588][TESTS] Fix flaky test: KafkaSourceStressForDontFailOnDataLossSuite ## What changes were proposed in this pull request? Fixed the following failures: ``` org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 3745 times over 1.0000790851666665 minutes. Last failure message: assertion failed: failOnDataLoss-0 not deleted after timeout. ``` ``` sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query query-66 terminated with exception: null at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:146) Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null at java.util.ArrayList.addAll(ArrayList.java:577) at org.apache.kafka.clients.Metadata.getClusterForCurrentTopics(Metadata.java:257) at org.apache.kafka.clients.Metadata.update(Metadata.java:177) at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleResponse(NetworkClient.java:605) at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeHandleCompletedReceive(NetworkClient.java:582) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:450) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938) at ... ``` ## How was this patch tested? Tested in #16048 by running many times. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16109 from zsxwing/fix-kafka-flaky-test. (cherry picked from commit edc87e18922b98be47c298cdc3daa2b049a737e9) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 07 December 2016, 21:47:54 UTC
76e1f16 [SPARK-18762][WEBUI] Web UI should be http:4040 instead of https:4040 ## What changes were proposed in this pull request? When SSL is enabled, the Spark shell shows: ``` Spark context Web UI available at https://192.168.99.1:4040 ``` This is wrong because 4040 is http, not https. It redirects to the https port. More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481. CC: mengxr liancheng I manually tested accessing by accessing MasterPage, WorkerPage and HistoryServer with SSL enabled. Author: sarutak <sarutak@oss.nttdata.co.jp> Closes #16190 from sarutak/SPARK-18761. (cherry picked from commit bb94f61a7ac97bf904ec0e8d5a4ab69a4142443f) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 07 December 2016, 19:41:46 UTC
acb6ac5 [SPARK-18764][CORE] Add a warning log when skipping a corrupted file ## What changes were proposed in this pull request? It's better to add a warning log when skipping a corrupted file. It will be helpful when we want to finish the job first, then find them in the log and fix these files. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16192 from zsxwing/SPARK-18764. (cherry picked from commit dbf3e298a1a35c0243f087814ddf88034ff96d66) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 07 December 2016, 18:30:15 UTC
5dbcd4f [SPARK-17760][SQL] AnalysisException with dataframe pivot when groupBy column is not attribute ## What changes were proposed in this pull request? Fixes AnalysisException for pivot queries that have group by columns that are expressions and not attributes by substituting the expressions output attribute in the second aggregation and final projection. ## How was this patch tested? existing and additional unit tests Author: Andrew Ray <ray.andrew@gmail.com> Closes #16177 from aray/SPARK-17760. (cherry picked from commit f1fca81b165c5a673f7d86b268e04ea42a6c267e) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 07 December 2016, 12:44:25 UTC
4432a2a [SPARK-18208][SHUFFLE] Executor OOM due to a growing LongArray in BytesToBytesMap ## What changes were proposed in this pull request? BytesToBytesMap currently does not release the in-memory storage (the longArray variable) after it spills to disk. This is typically not a problem during aggregation because the longArray should be much smaller than the pages, and because we grow the longArray at a conservative rate. However this can lead to an OOM when an already running task is allocated more than its fair share, this can happen because of a scheduling delay. In this case the longArray can grow beyond the fair share of memory for the task. This becomes problematic when the task spills and the long array is not freed, that causes subsequent memory allocation requests to be denied by the memory manager resulting in an OOM. This PR fixes this issuing by freeing the longArray when the BytesToBytesMap spills. ## How was this patch tested? Existing tests and tested on realworld workloads. Author: Jie Xiong <jiexiong@fb.com> Author: jiexiong <jiexiong@gmail.com> Closes #15722 from jiexiong/jie_oom_fix. (cherry picked from commit c496d03b5289f7c604661a12af86f6accddcf125) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 07 December 2016, 12:33:50 UTC
51754d6 [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils ## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes #16129 from srowen/SPARK-18678. (cherry picked from commit 79f5f281bb69cb2de9f64006180abd753e8ae427) Signed-off-by: Sean Owen <sowen@cloudera.com> 07 December 2016, 09:34:57 UTC
99c293e [SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16131 from actuaryzhang/master. (cherry picked from commit b8280271396eb74638da6546d76bbb2d06c7011b) Signed-off-by: Sean Owen <sowen@cloudera.com> 07 December 2016, 08:37:37 UTC
340e9ae [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. ## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #16117 from yanboliang/spark-18686. (cherry picked from commit 90b59d1bf262b41c3a5f780697f504030f9d079c) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 07 December 2016, 08:32:32 UTC
3750c6e [SPARK-18671][SS][TEST-MAVEN] Follow up PR to fix test for Maven ## What changes were proposed in this pull request? Maven compilation seem to not allow resource is sql/test to be easily referred to in kafka-0-10-sql tests. So moved the kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql test resources. ## How was this patch tested? Manually ran maven test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16183 from tdas/SPARK-18671-1. (cherry picked from commit 5c6bcdbda4dd23bbd112a7395cd9d1cfd04cf4bb) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 07 December 2016, 05:51:52 UTC
9b5bc2a [SPARK-18734][SS] Represent timestamp in StreamingQueryProgress as formatted string instead of millis ## What changes were proposed in this pull request? Easier to read while debugging as a formatted string (in ISO8601 format) than in millis ## How was this patch tested? Updated unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16166 from tdas/SPARK-18734. (cherry picked from commit 539bb3cf9573be5cd86e7e6502523ce89c0de170) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 07 December 2016, 01:04:34 UTC
65f5331 [SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. ## What changes were proposed in this pull request? Since we already include the python examples in the pyspark package, we should include the example data with it as well. We should also include the third-party licences since we distribute their jars with the pyspark package. ## How was this patch tested? Manually tested with python2.7 and python3.4 ```sh $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package $ cd python $ python setup.py sdist $ pip install dist/pyspark-2.1.0.dev0.tar.gz $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/ graphx mllib streaming $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/ 600K /usr/local/lib/python2.7/dist-packages/pyspark/data/ $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5 LICENSE-AnchorJS.txt LICENSE-DPark.txt LICENSE-Mockito.txt LICENSE-SnapTree.txt LICENSE-antlr.txt ``` Author: Shuai Lin <linshuai2012@gmail.com> Closes #16082 from lins05/include-data-in-pyspark-dist. (cherry picked from commit bd9a4a5ac3abcc48131d1249df55e7d68266343a) Signed-off-by: Sean Owen <sowen@cloudera.com> 06 December 2016, 22:09:50 UTC
d20e0d6 [SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured Streaming log formats ## What changes were proposed in this pull request? To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog. ## How was this patch tested? new unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16128 from tdas/SPARK-18671. (cherry picked from commit 1ef6b296d7cd2d93cdfd5f54940842d6bb915ce0) Signed-off-by: Shixiong Zhu <shixiong@databricks.com> 06 December 2016, 21:05:31 UTC
ace4079 [SPARK-18714][SQL] Add a simple time function to SparkSession ## What changes were proposed in this pull request? Many Spark developers often want to test the runtime of some function in interactive debugging and testing. This patch adds a simple time function to SparkSession: ``` scala> spark.time { spark.range(1000).count() } Time taken: 77 ms res1: Long = 1000 ``` ## How was this patch tested? I tested this interactively in spark-shell. Author: Reynold Xin <rxin@databricks.com> Closes #16140 from rxin/SPARK-18714. (cherry picked from commit cb1f10b468e7771af75cb2288d375a87ab66d316) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 06 December 2016, 19:48:21 UTC
e362d99 [SPARK-18634][SQL][TRIVIAL] Touch-up Generate ## What changes were proposed in this pull request? I jumped the gun on merging https://github.com/apache/spark/pull/16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #16170 from hvanhovell/SPARK-18634. (cherry picked from commit 381ef4ea76b0920e05c81adb44b1fef88bee5d25) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 06 December 2016, 13:51:55 UTC
655297b [SPARK-18721][SS] Fix ForeachSink with watermark + append ## What changes were proposed in this pull request? Right now ForeachSink creates a new physical plan, so StreamExecution cannot retrieval metrics and watermark. This PR changes ForeachSink to manually convert InternalRows to objects without creating a new plan. ## How was this patch tested? `test("foreach with watermark: append")`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16160 from zsxwing/SPARK-18721. (cherry picked from commit 7863c623791d088684107f833fdecb4b5fdab4ec) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 06 December 2016, 04:35:34 UTC
8ca6a82 [SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog` (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572) ## What changes were proposed in this pull request? Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table. To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows: Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds: 7.901 3.983 4.018 4.331 4.261 Spark at bdc8153, `SHOW PARTITIONS table2` (Timed out after 10 minutes with a `SocketTimeoutException`.) Spark at this PR, `SHOW PARTITIONS table1`, times in seconds: 3.801 0.449 0.395 0.348 0.336 Spark at this PR, `SHOW PARTITIONS table2`, times in seconds: 5.184 1.63 1.474 1.519 1.41 Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master. This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x. ## How was this patch tested? I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions. Author: Michael Allman <michael@videoamp.com> Closes #15998 from mallman/spark-18572-list_partition_names. (cherry picked from commit 772ddbeaa6fe5abf189d01246f57d295f9346fa3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 December 2016, 03:33:55 UTC
d458816 [SPARK-18722][SS] Move no data rate limit from StreamExecution to ProgressReporter ## What changes were proposed in this pull request? Move no data rate limit from StreamExecution to ProgressReporter to make `recentProgresses` and listener events consistent. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16155 from zsxwing/SPARK-18722. (cherry picked from commit 4af142f55771affa5fc7f2abbbf5e47766194e6e) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 06 December 2016, 02:51:22 UTC
1946854 [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name Here are the major changes in this PR. - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`. - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`). - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default. - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`. Implementation details - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`). - Added the `id` as the new `StreamMetadata`. - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`. - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name` TODO - [x] Test handling of name=null in json generation of StreamingQueryProgress - [x] Test handling of name=null in json generation of StreamingQueryListener events - [x] Test python API of runId Updated unit tests and new unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16113 from tdas/SPARK-18657. (cherry picked from commit bb57bfe97d9fb077885065b8e804b85d4c493faf) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 06 December 2016, 02:19:15 UTC
6c4c336 [SPARK-18729][SS] Move DataFrame.collect out of synchronized block in MemorySink ## What changes were proposed in this pull request? Move DataFrame.collect out of synchronized block so that we can query content in MemorySink when `DataFrame.collect` is running. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16162 from zsxwing/SPARK-18729. (cherry picked from commit 1b2785c3d0a40da2fca923af78066060dbfbcf0a) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 06 December 2016, 02:16:07 UTC
fecd23d [SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding Python UDFs ## What changes were proposed in this pull request? As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL. The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result. >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> >>> df = spark.range(10) >>> >>> def return_range(value): ... return [(i, str(i)) for i in range(value - 1, value + 1)] ... >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()), ... StructField("string_val", StringType())]))) >>> >>> df.select("id", explode(range_udf(df.id))).show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/spark/python/pyspark/sql/dataframe.py", line 318, in show print(self._jdf.showString(n, 20)) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120) at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57) The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`. Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes. It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`. However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen. To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct. ## How was this patch tested? Added test cases to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16120 from viirya/fix-py-udf-with-generator. (cherry picked from commit 3ba69b64852ccbf6d4ec05a021bc20616a09f574) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 06 December 2016, 01:50:55 UTC
c6a4e3d [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException (branch 2.1) ## What changes were proposed in this pull request? Backport #16125 to branch 2.1. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16153 from zsxwing/SPARK-18694-2.1. 05 December 2016, 22:59:42 UTC
39759ff [DOCS][MINOR] Update location of Spark YARN shuffle jar Looking at the distributions provided on spark.apache.org, I see that the Spark YARN shuffle jar is under `yarn/` and not `lib/`. This change is so minor I'm not sure it needs a JIRA. But let me know if so and I'll create one. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #16130 from nchammas/yarn-doc-fix. (cherry picked from commit 5a92dc76ab431d73275a2bdfbc2c0a8ceb0d75d1) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 05 December 2016, 20:58:21 UTC
e23c8cf [SPARK-18711][SQL] should disable subexpression elimination for LambdaVariable ## What changes were proposed in this pull request? This is kind of a long-standing bug, it's hidden until https://github.com/apache/spark/pull/15780 , which may add `AssertNotNull` on top of `LambdaVariable` and thus enables subexpression elimination. However, subexpression elimination will evaluate the common expressions at the beginning, which is invalid for `LambdaVariable`. `LambdaVariable` usually represents loop variable, which can't be evaluated ahead of the loop. This PR skips expressions containing `LambdaVariable` when doing subexpression elimination. ## How was this patch tested? updated test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #16143 from cloud-fan/aggregator. (cherry picked from commit 01a7d33d0851d82fd1bb477a58d9925fe8d727d8) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 05 December 2016, 19:37:24 UTC
30c0743 Revert "[SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise" This reverts commit fce1be6cc81b1fe3991a4df91128f4fcd14ff615 from branch-2.1. 05 December 2016, 18:49:22 UTC
afd2321 [MINOR][DOC] Use SparkR `TRUE` value and add default values for `StructField` in SQL Guide. ## What changes were proposed in this pull request? In `SQL Programming Guide`, this PR uses `TRUE` instead of `True` in SparkR and adds default values of `nullable` for `StructField` in Scala/Python/R (i.e., "Note: The default value of nullable is true."). In Java API, `nullable` is not optional. **BEFORE** * SPARK 2.1.0 RC1 http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/sql-programming-guide.html#data-types **AFTER** * R <img width="916" alt="screen shot 2016-12-04 at 11 58 19 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877443/abba19a6-ba7d-11e6-8984-afbe00333fb0.png"> * Scala <img width="914" alt="screen shot 2016-12-04 at 11 57 37 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877433/99ce734a-ba7d-11e6-8bb5-e8619041b09b.png"> * Python <img width="914" alt="screen shot 2016-12-04 at 11 58 04 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877440/a5c89338-ba7d-11e6-8f92-6c0ae9388d7e.png"> ## How was this patch tested? Manual. ``` cd docs SKIP_API=1 jekyll build open _site/index.html ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16141 from dongjoon-hyun/SPARK-SQL-GUIDE. (cherry picked from commit 410b7898661f77e748564aaee6a5ab7747ce34ad) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 05 December 2016, 18:36:26 UTC
1821cbe [SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide. ## What changes were proposed in this pull request? Add R examples to ML programming guide for the following algorithms as POC: * spark.glm * spark.survreg * spark.naiveBayes * spark.kmeans The four algorithms were added to SparkR since 2.0.0, more docs for algorithms added during 2.1 release cycle will be addressed in a separate follow-up PR. ## How was this patch tested? This is the screenshots of generated ML programming guide for ```GeneralizedLinearRegression```: ![image](https://cloud.githubusercontent.com/assets/1962026/20866403/babad856-b9e1-11e6-9984-62747801e8c4.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #16136 from yanboliang/spark-18279. (cherry picked from commit eb8dd68132998aa00902dfeb935db1358781e1c1) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 05 December 2016, 08:40:33 UTC
88e07ef [SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol ## What changes were proposed in this pull request? add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel` ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16059 from zhengruifeng/ovrm_setCol. (cherry picked from commit bdfe7f67468ecfd9927a1fec60d6605dd05ebe3f) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 05 December 2016, 08:33:21 UTC
c13c293 [SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark ## What changes were proposed in this pull request? If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session. This seems to be a regression on the earlier behavior. Fix is to always try to install or check for the cached Spark if running in an interactive session. As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc) ## How was this patch tested? Manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16077 from felixcheung/rsessioninteractive. (cherry picked from commit b019b3a8ac49336e657f5e093fa2fba77f8d12d2) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu> 05 December 2016, 04:25:21 UTC
41d698e [SPARK-18661][SQL] Creating a partitioned datasource table should not scan all files for table ## What changes were proposed in this pull request? Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason. We should avoid doing this when the user specifies a schema. ## How was this patch tested? Perf stat tests. Author: Eric Liang <ekl@databricks.com> Closes #16090 from ericl/spark-18661. (cherry picked from commit d9eb4c7215f26dd05527c0b9980af35087ab9d64) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 December 2016, 12:44:16 UTC
back to top