https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
0a4c03f Preparing Spark release v2.4.0-rc5 29 October 2018, 06:15:29 UTC
7f4fce4 [SPARK-25179][PYTHON][DOCS] Document BinaryType support in Arrow conversion ## What changes were proposed in this pull request? This PR targets to document binary type in "Apache Arrow in Spark". ## How was this patch tested? Manually built the documentation and checked. Closes #22871 from HyukjinKwon/SPARK-25179. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit fbaf150507a289ec0ac02fdbf4009c42cd9bc164) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 October 2018, 06:02:09 UTC
b6ba0dd [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.enabled ## What changes were proposed in this pull request? Updated the doc string value for spark.sql.parquet.recordLevelFilter.enabled to indicate that spark.sql.parquet.enableVectorizedReader must be disabled. The code in ParquetFileFormat uses spark.sql.parquet.recordLevelFilter.enabled only after falling back to parquet-mr (see else for this if statement): https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L412 https://github.com/apache/spark/blob/d5573c578a1eea9ee04886d9df37c7178e67bb30/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L427-L430 Tests also bear this out. ## How was this patch tested? This is just a doc string fix: I built Spark and ran a single test. Closes #22865 from bersprockets/confdocfix. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4e990d9dd2407dc257712c4b12b507f0990ca4e9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 October 2018, 05:45:23 UTC
00771dc [SPARK-25816][SQL] Fix attribute resolution in nested extractors ## What changes were proposed in this pull request? Extractors are made of 2 expressions, one of them defines the the value to be extract from (called `child`) and the other defines the way of extraction (called `extraction`). In this term extractors have 2 children so they shouldn't be `UnaryExpression`s. `ResolveReferences` was changed in this commit: https://github.com/apache/spark/commit/36b826f5d17ae7be89135cb2c43ff797f9e7fe48 which resulted a regression with nested extractors. An extractor need to define its children as the set of both `child` and `extraction`; and should try to resolve both in `ResolveReferences`. This PR changes `UnresolvedExtractValue` to a `BinaryExpression`. ## How was this patch tested? added UT Closes #22817 from peter-toth/SPARK-25816. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit ca2fca143277deaff58a69b7f1e0360cfc70561f) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 October 2018, 00:51:53 UTC
0f74bac [SPARK-24709][SQL][2.4] use str instead of basestring in isinstance ## What changes were proposed in this pull request? after backport https://github.com/apache/spark/pull/22775 to 2.4, the 2.4 sbt Jenkins QA job is broken, see https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.4-test-sbt-hadoop-2.7/147/console This PR adds `if sys.version >= '3': basestring = str` which onlly exists in master. ## How was this patch tested? existing test Closes #22858 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> 28 October 2018, 02:50:46 UTC
f575616 [SPARK-25859][ML] add scala/java/python example and doc for PrefixSpan ## What changes were proposed in this pull request? add scala/java/python example and doc for PrefixSpan in branch 2.4 ## How was this patch tested? Manually tested Author: Huaxin Gao <huaxing@us.ibm.com> Closes #22863 from huaxingao/mydocbranch. 27 October 2018, 22:14:29 UTC
313a1f0 [SPARK-25854][BUILD] fix `build/mvn` not to fail during Zinc server shutdown ## What changes were proposed in this pull request? the final line in the mvn helper script in build/ attempts to shut down the zinc server. due to the zinc server being set up w/a 30min timeout, by the time the mvn test instantiation finishes, the server times out. this means that when the mvn script tries to shut down zinc, it returns w/an exit code of 1. this will then automatically fail the entire build (even if the build passes). ## How was this patch tested? i set up a test build: https://amplab.cs.berkeley.edu/jenkins/job/sknapp-testing-spark-branch-2.4-test-maven-hadoop-2.7/ Closes #22854 from shaneknapp/fix-mvn-helper-script. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 6aa506394958bfb30cd2a9085a5e8e8be927de51) Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 October 2018, 21:37:50 UTC
cb2827d Preparing development version 2.4.1-SNAPSHOT 26 October 2018, 16:47:05 UTC
4a7ead4 Preparing Spark release v2.4.0-rc5 26 October 2018, 16:47:00 UTC
1757a60 HOT-FIX pyspark import 26 October 2018, 16:43:16 UTC
d868dc2 Preparing development version 2.4.1-SNAPSHOT 26 October 2018, 16:26:36 UTC
075447b Preparing Spark release v2.4.0-rc5 26 October 2018, 16:26:31 UTC
40ed093 [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column). This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now. This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it. Unit tests were added. Closes #22775 from HyukjinKwon/SPARK-25447-followup. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 33e337c1180a12edf1ae97f0221e389f23192461) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 October 2018, 16:06:11 UTC
26e1d3e [SPARK-25835][K8S] Create kubernetes-tests profile and use the detected SCALA_VERSION - Fixes the scala version propagation issue. - Disables the tests under the k8s profile, now we will run them manually. Adds a test specific profile otherwise tests will not run if we just remove the module from the kubernetes profile (quickest solution I can think of). Manually by running the tests with different versions of scala. Closes #22838 from skonto/propagate-scala2.12. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 7d44bc26408b2189804fd305797afcefb7b2b0e0) Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 October 2018, 13:54:04 UTC
b47b827 [SPARK-25797][SQL][DOCS] Add migration doc for solving issues caused by view canonicalization approach change ## What changes were proposed in this pull request? Since Spark 2.2, view definitions are stored in a different way from prior versions. This may cause Spark unable to read views created by prior versions. See [SPARK-25797](https://issues.apache.org/jira/browse/SPARK-25797) for more details. Basically, we have 2 options. 1) Make Spark 2.2+ able to get older view definitions back. Since the expanded text is buggy and unusable, we have to use original text (this is possible with [SPARK-25459](https://issues.apache.org/jira/browse/SPARK-25459)). However, because older Spark versions don't save the context for the database, we cannot always get correct view definitions without view default database. 2) Recreate the views by `ALTER VIEW AS` or `CREATE OR REPLACE VIEW AS`. This PR aims to add migration doc to help users troubleshoot this issue by above option 2. ## How was this patch tested? N/A. Docs are generated and checked locally ``` cd docs SKIP_API=1 jekyll serve --watch ``` Closes #22846 from seancxmao/SPARK-25797. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6fd5ff3951ed9ac7c0b20f2666d8bc39929bfb5c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 October 2018, 10:54:26 UTC
f37bcea [SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608 ## What changes were proposed in this pull request? See the detailed information at https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be deprecated and redesigned. This patch also reverts https://github.com/apache/spark/commit/8acb51f08b448628b65e90af3b268994f9550e45 which applies to 2.4. ## How was this patch tested? Only deprecation and doc changes. Closes #22841 from rxin/SPARK-25842. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 89d748b33c8636a1b1411c505921b0a585e1e6cb) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 October 2018, 05:17:50 UTC
eff1c50 [SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker ## What changes were proposed in this pull request? There is a race condition when releasing a Python worker. If `ReaderIterator.handleEndOfDataSection` is not running in the task thread, when a task is early terminated (such as `take(N)`), the task completion listener may close the worker but "handleEndOfDataSection" can still put the worker into the worker pool to reuse. https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7 is a patch to reproduce this issue. I also found a user reported this in the mail list: http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=H+YLUEpd23nwvq13Ms5hOStkhX3ao4f4zQV6sgO5zM-xAmail.gmail.com%3E This PR fixes the issue by using `compareAndSet` to make sure we will never return a closed worker to the work pool. ## How was this patch tested? Jenkins. Closes #22816 from zsxwing/fix-socket-closed. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit 86d469aeaa492c0642db09b27bb0879ead5d7166) Signed-off-by: Takuya UESHIN <ueshin@databricks.com> 26 October 2018, 04:54:14 UTC
adfd105 [MINOR][TEST][BRANCH-2.4] Regenerate golden file `datetime.sql.out` ## What changes were proposed in this pull request? `datetime.sql.out` is a generated golden file, but it's a little bit broken during manual [reverting](https://github.com/dongjoon-hyun/spark/commit/5d744499667fcd08825bca0ac6d5d90d6e110ebc#diff-79dd276be45ede6f34e24ad7005b0a7cR87). This doens't cause test failure because the difference is inside `comments` and blank lines. We had better fix this minor issue before RC5. ## How was this patch tested? Pass the Jenkins. Closes #22837 from dongjoon-hyun/fix_datetime_sql_out. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 October 2018, 03:37:07 UTC
b739fb0 [SPARK-25840][BUILD] `make-distribution.sh` should not fail due to missing LICENSE-binary ## What changes were proposed in this pull request? We vote for the artifacts. All releases are in the form of the source materials needed to make changes to the software being released. (http://www.apache.org/legal/release-policy.html#artifacts) From Spark 2.4.0, the source artifact and binary artifact starts to contain own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. However, unfortunately, `dev/make-distribution.sh` inside source artifacts start to fail because it expects `LICENSE-binary` and source artifact have only the LICENSE file. https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz `dev/make-distribution.sh` is used during the voting phase because we are voting on that source artifact instead of GitHub repository. Individual contributors usually don't have the downstream repository and starts to try build the voting source artifacts to help the verification for the source artifact during voting phase. (Personally, I did before.) This PR aims to recover that script to work in any way. This doesn't aim for source artifacts to reproduce the compiled artifacts. ## How was this patch tested? Manual. ``` $ rm LICENSE-binary $ dev/make-distribution.sh ``` Closes #22840 from dongjoon-hyun/SPARK-25840. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 79f3babcc6e189d7405464b9ac1eb1c017e51f5d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 October 2018, 03:26:26 UTC
39e108f [SPARK-25793][ML] call SaveLoadV2_0.load for classNameV2_0 ## What changes were proposed in this pull request? The following code in BisectingKMeansModel.load calls the wrong version of load. ``` case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => val model = SaveLoadV1_0.load(sc, path) ``` Closes #22790 from huaxingao/spark-25793. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dc9b320807881403ca9f1e2e6d01de4b52db3975) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 October 2018, 03:08:51 UTC
a9f200e [SPARK-25832][SQL][BRANCH-2.4] Revert newly added map related functions ## What changes were proposed in this pull request? - Revert [SPARK-23935][SQL] Adding map_entries function: https://github.com/apache/spark/pull/21236 - Revert [SPARK-23937][SQL] Add map_filter SQL function: https://github.com/apache/spark/pull/21986 - Revert [SPARK-23940][SQL] Add transform_values SQL function: https://github.com/apache/spark/pull/22045 - Revert [SPARK-23939][SQL] Add transform_keys function: https://github.com/apache/spark/pull/22013 - Revert [SPARK-23938][SQL] Add map_zip_with function: https://github.com/apache/spark/pull/22017 - Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: https://github.com/apache/spark/pull/21434/ ## How was this patch tested? The existing tests. Closes #22827 from gatorsmile/revertMap2.4. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 October 2018, 23:38:55 UTC
db121a2 [SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc and examples about extra data source options ## What changes were proposed in this pull request? Our current doc does not explain how we are passing the data source specific options to the underlying data source. According to [the review comment](https://github.com/apache/spark/pull/22622#discussion_r222911529), this PR aims to add more detailed information and examples. This is a backport of #22801. `orc.column.encoding.direct` is removed since it's not supported in ORC 1.5.2. ## How was this patch tested? Manual. Closes #22839 from dongjoon-hyun/SPARK-25656-2.4. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 25 October 2018, 21:15:03 UTC
1b075f2 [SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs ## What changes were proposed in this pull request? `hsync` has been added as part of SPARK-19531 to get the latest data in the history sever ui, but that is causing the performance overhead and also leading to drop many history log events. `hsync` uses the force `FileChannel.force` to sync the data to the disk and happens for the data pipeline, it is costly operation and making the application to face overhead and drop the events. I think getting the latest data in history server can be done in different way (no impact to application while writing events), there is an api `DFSInputStream.getFileLength()` which gives the file length including the `lastBlockBeingWrittenLength`(different from `FileStatus.getLen()`), this api can be used when the file status length and previously cached length are equal to verify whether any new data has been written or not, if there is any update in data length then the history server can update the in progress history log. And also I made this change as configurable with the default value false, and can be enabled for history server if users want to see the updated data in ui. ## How was this patch tested? Added new test and verified manually, with the added conf `spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true`, history server is reading the logs including the last block data which is being written and updating the Web UI with the latest data. Closes #22752 from devaraj-kavali/SPARK-24787. Authored-by: Devaraj K <devaraj@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 46d2d2c74d9aaf30e158aeda58a189f6c8e48b9c) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 October 2018, 20:16:18 UTC
45ed76d [SPARK-25803][K8S] Fix docker-image-tool.sh -n option ## What changes were proposed in this pull request? docker-image-tool.sh uses getopts in which a colon signifies that an option takes an argument. Since -n does not take an argument it should not have a colon. ## How was this patch tested? Following the reproduction in [JIRA](https://issues.apache.org/jira/browse/SPARK-25803):- 0. Created a custom Dockerfile to use for the spark-r container image. In each of the steps below the path to this Dockerfile is passed with the '-R' option. (spark-r is used here simply as an example, the bug applies to all options) 1. Built container images without '-n'. The [result](https://gist.github.com/sel/59f0911bb1a6a485c2487cf7ca770f9d) is that the '-R' option is honoured and the hello-world image is built for spark-r, as expected. 2. Built container images with '-n' to reproduce the issue The [result](https://gist.github.com/sel/e5cabb9f3bdad5d087349e7fbed75141) is that the '-R' option is ignored and the default container image for spark-r is built 3. Applied the patch and re-built container images with '-n' and did not reproduce the issue The [result](https://gist.github.com/sel/6af14b95012ba8ff267a4fce6e3bd3bf) is that the '-R' option is honoured and the hello-world image is built for spark-r, as expected. Closes #22798 from sel/fix-docker-image-tool-nocache. Authored-by: Steve <sel@users.noreply.github.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 9b98d9166ee2c130ba38a09e8c0aa12e29676b76) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 October 2018, 20:01:10 UTC
a20660b [SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide ## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: <img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png"> Java: <img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png"> Python: <img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png"> R: <img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png"> Closes #22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6540c2f8f31bbde4df57e48698f46bb1815740ff) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 October 2018, 15:04:06 UTC
d5e6948 [SPARK-25805][SQL][TEST] Fix test for SPARK-25159 The original test would sometimes fail if the listener bus did not keep up, so just wait till the listener bus is empty. Tested by adding a sleep in the listener, which made the test consistently fail without the fix, but pass consistently after the fix. Closes #22799 from squito/SPARK-25805. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 78c8bd2e68a77ee3c12c233289a8804e339bd71d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 October 2018, 06:20:55 UTC
4099565 [SPARK-24499][SQL][DOC][FOLLOW-UP] Fix spelling in doc ## What changes were proposed in this pull request? This PR replaces `turing` with `tuning` in files and a file name. Currently, in the left side menu, `Turing` is shown. [This page](https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-docs/_site/sql-performance-turing.html) is one of examples. ![image](https://user-images.githubusercontent.com/1315079/47332714-20a96180-d6bb-11e8-9a5a-0a8dad292626.png) ## How was this patch tested? `grep -rin turing docs` && `find docs -name "*turing*"` Closes #22800 from kiszk/SPARK-24499-follow. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c391dc65efb21357bdd80b28fba3851773759bc6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 October 2018, 04:20:00 UTC
b9b594a [SPARK-25795][R][EXAMPLE] Fix CSV SparkR SQL Example ## What changes were proposed in this pull request? This PR aims to fix the following SparkR example in Spark 2.3.0 ~ 2.4.0. ```r > df <- read.df("examples/src/main/resources/people.csv", "csv") > namesAndAges <- select(df, "name", "age") ... Caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_c0];; 'Project ['name, 'age] +- AnalysisBarrier +- Relation[_c0#97] csv ``` - https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/_site/sql-programming-guide.html#manually-specifying-options - http://spark.apache.org/docs/2.3.2/sql-programming-guide.html#manually-specifying-options - http://spark.apache.org/docs/2.3.1/sql-programming-guide.html#manually-specifying-options - http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#manually-specifying-options ## How was this patch tested? Manual test in SparkR. (Please note that `RSparkSQLExample.R` fails at the last JDBC example) ```r > df <- read.df("examples/src/main/resources/people.csv", "csv", sep=";", inferSchema=T, header=T) > namesAndAges <- select(df, "name", "age") ``` Closes #22791 from dongjoon-hyun/SPARK-25795. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3b4556745e90a13f4ae7ebae4ab682617de25c38) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 October 2018, 23:34:48 UTC
f33d888 Preparing development version 2.4.1-SNAPSHOT 22 October 2018, 14:50:55 UTC
e69e2bf Preparing Spark release v2.4.0-rc4 22 October 2018, 14:50:51 UTC
c21d7e1 fix security issue of zinc(simplier version) 22 October 2018, 04:19:24 UTC
0239277 [DOC][MINOR] Fix minor error in the code of graphx guide ## What changes were proposed in this pull request? Fix minor error in the code "sketch of pregel implementation" of GraphX guide. This fixed error relates to `[SPARK-12995][GraphX] Remove deprecate APIs from Pregel` ## How was this patch tested? N/A Closes #22780 from WeichenXu123/minor_doc_update1. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3b4f35f568eb3844d2a789c8a409bc705477df6b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 October 2018, 17:32:23 UTC
869242c [MINOR][DOC] Update the building doc to use Maven 3.5.4 and Java 8 only ## What changes were proposed in this pull request? Since we didn't test Java 9 ~ 11 up to now in the community, fix the document to describe Java 8 only. ## How was this patch tested? N/A (This is a document only change.) Closes #22781 from dongjoon-hyun/SPARK-JDK-DOC. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fc9ba9dcc6ad47fbd05f093b94e7e13580000d5f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 October 2018, 06:56:53 UTC
d6a02c5 [SPARK-24499][SQL][DOC][FOLLOWUP] Fix some broken links ## What changes were proposed in this pull request? Fix some broken links in the new document. I have clicked through all the links. Hopefully i haven't missed any :-) ## How was this patch tested? Built using jekyll and verified the links. Closes #22772 from dilipbiswal/doc_check. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit ed9d0aac905136375444c1e00a2a9a0822b264aa) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 20 October 2018, 06:55:34 UTC
e3a60b0 Revert "[SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator" This reverts commit 36307b1e4b42ce22b07e7a3fc2679c4b5e7c34c8. 20 October 2018, 01:30:12 UTC
432697c Revert "[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans" This reverts commit c2962546d9a5900a5628a31b83d2c4b22c3a7936. 19 October 2018, 21:57:52 UTC
1001d23 [SPARK-25704][CORE] Allocate a bit less than Int.MaxValue JVMs don't you allocate arrays of length exactly Int.MaxValue, so leave a little extra room. This is necessary when reading blocks >2GB off the network (for remote reads or for cache replication). Unit tests via jenkins, ran a test with blocks over 2gb on a cluster Closes #22705 from squito/SPARK-25704. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> 19 October 2018, 17:54:08 UTC
9c0c6d4 Preparing development version 2.4.1-SNAPSHOT 19 October 2018, 14:22:04 UTC
1ff8dd4 Preparing Spark release v2.4.0-rc4 19 October 2018, 14:22:00 UTC
8926c4a fix security issue of zinc 19 October 2018, 13:34:35 UTC
6a06b8c [SPARK-25768][SQL] fix constant argument expecting UDAFs ## What changes were proposed in this pull request? Without this PR some UDAFs like `GenericUDAFPercentileApprox` can throw an exception because expecting a constant parameter (object inspector) as a particular argument. The exception is thrown because `toPrettySQL` call in `ResolveAliases` analyzer rule transforms a `Literal` parameter to a `PrettyAttribute` which is then transformed to an `ObjectInspector` instead of a `ConstantObjectInspector`. The exception comes from `getEvaluator` method of `GenericUDAFPercentileApprox` that actually shouldn't be called during `toPrettySQL` transformation. The reason why it is called are the non lazy fields in `HiveUDAFFunction`. This PR makes all fields of `HiveUDAFFunction` lazy. ## How was this patch tested? added new UT Closes #22766 from peter-toth/SPARK-25768. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f38594fc561208e17af80d17acf8da362b91fca4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 October 2018, 13:17:49 UTC
df60d9f [SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor signature ## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <maryannxue@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e8167768cfebfdb11acd8e0a06fe34ca43c14648) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 October 2018, 13:04:33 UTC
9ed2e42 [MINOR][DOC] Spacing items in migration guide for readability and consistency ## What changes were proposed in this pull request? Currently, migration guide has no space between each item which looks too compact and hard to read. Some of items already had some spaces between them in the migration guide. This PR suggest to format them consistently for readability. Before: ![screen shot 2018-10-18 at 10 00 04 am](https://user-images.githubusercontent.com/6477701/47126768-9e84fb80-d2bc-11e8-9211-84703486c553.png) After: ![screen shot 2018-10-18 at 9 53 55 am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png) ## How was this patch tested? Manually tested: Closes #22761 from HyukjinKwon/minor-migration-doc. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit c8f7691c64a28174a54e8faa159b50a3836a7225) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 19 October 2018, 05:55:43 UTC
36307b1 [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator ## What changes were proposed in this pull request? The PR updates the examples for `BisectingKMeans` so that they don't use the deprecated method `computeCost` (see SPARK-25758). ## How was this patch tested? running examples Closes #22763 from mgaido91/SPARK-25764. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d0ecff28545ac81f5ba7ac06957ced65b6e3ebcd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 October 2018, 01:34:25 UTC
fd5b247 [SPARK-24499][DOC][FOLLOW-UP] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? Forgot to clean remove the link for `Upgrading From Spark SQL 2.4 to 3.0` when merging to 2.4 ## How was this patch tested? N/A Closes #22769 from gatorsmile/test2.4. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 October 2018, 20:51:13 UTC
7153551 [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 987f386588de7311b066cf0f62f0eed64d4aa7d7) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 October 2018, 19:12:05 UTC
71a6a9c [SPARK-25758][ML] Deprecate computeCost on BisectingKMeans ## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22756 from mgaido91/SPARK-25758. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c2962546d9a5900a5628a31b83d2c4b22c3a7936) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 October 2018, 17:32:37 UTC
ac9a6f0 [SPARK-25741][WEBUI] Long URLs are not rendered properly in web UI ## What changes were proposed in this pull request? When the URL for description column in the table of job/stage page is long, WebUI doesn't render it properly. ![beforefix](https://user-images.githubusercontent.com/1097932/47009242-9323ba00-d16e-11e8-8262-0848d814442a.jpeg) Both job and stage page are using the class `name-link` for the description URL, so change the style of `a.name-link` to fix it. ## How was this patch tested? Manual test on my local: ![afterfix](https://user-images.githubusercontent.com/1097932/47009269-a46cc680-d16e-11e8-9ff5-0318a20db634.jpeg) Closes #22744 from gengliangwang/fixUILink. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 1901f06211661c19d70f231db235cca3cdb1f2dd) Signed-off-by: Sean Owen <sean.owen@databricks.com> 17 October 2018, 14:53:07 UTC
b698bd4 [SPARK-21402][SQL] Fix java array of structs deserialization When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. michalsenkyr cloud-fan marmbrus liancheng Closes #22708 from vofque/SPARK-21402. Lead-authored-by: Vladimir Kuriatkov <vofque@gmail.com> Co-authored-by: Vladimir Kuriatkov <Vladimir_Kuriatkov@epam.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e5b8136f47a947356e74c8d4bf9d03139f455a2f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 October 2018, 14:14:01 UTC
362103b [SPARK-25754][DOC] Change CDN for MathJax ## What changes were proposed in this pull request? Currently when we open our doc site: https://spark.apache.org/docs/latest/index.html , there is one warning ![image](https://user-images.githubusercontent.com/1097932/47065926-2b757980-d217-11e8-868f-02ce73f513ae.png) This PR is to change the CDN as per the migration tips: https://www.mathjax.org/cdn-shutting-down/ This is very very trivial. But it would be good to follow the suggestion from MathJax team and remove the warning, in case one day the original CDN is no longer available. ## How was this patch tested? Manual check. Closes #22753 from gengliangwang/migrateMathJax. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 2ab4473bed44828cee5a47759b5c36fc81dd5d71) Signed-off-by: Sean Owen <sean.owen@databricks.com> 17 October 2018, 11:52:17 UTC
3591bd2 [SQL][CATALYST][MINOR] update some error comments ## What changes were proposed in this pull request? this PR correct some comment error: 1. change from "as low a possible" to "as low as possible" in RewriteDistinctAggregates.scala 2. delete redundant word “with” in HiveTableScanExec’s doExecute() method ## How was this patch tested? Existing unit tests. Closes #22694 from CarolinePeng/update_comment. Authored-by: 彭灿00244106 <00244106@zte.intra> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit e9332f600eb4f275b3bff368863a68c2a4349182) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 17 October 2018, 04:45:30 UTC
144cb94 [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates ## What changes were proposed in this pull request? This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored. **Test Data** ```scala scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") ``` **Spark 2.3.2** ```scala scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 1542 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 152 ms ``` **Spark 2.4.0 RC3** ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 4074 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 1771 ms ``` ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #22597 from dongjoon-hyun/SPARK-25579. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 2c664edc060a41340eb374fd44b5d32c3c06a15c) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 16 October 2018, 12:30:40 UTC
77156f8 [SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count ## What changes were proposed in this pull request? AFAIK multi-column count is not widely supported by the mainstream databases(postgres doesn't support), and the SQL standard doesn't define it clearly, as near as I can tell. Since Spark supports it, we should clearly document the current behavior and add tests to verify it. ## How was this patch tested? N/A Closes #22728 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit e028fd3aed9e5e4c478f307f0a467b54b73ff0d5) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 16 October 2018, 07:13:19 UTC
8bc7ab0 [SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch ## What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4cee191c04f14d7272347e4b29201763c6cfb6bf) Signed-off-by: Sean Owen <sean.owen@databricks.com> 16 October 2018, 02:20:49 UTC
d64b355 [SPARK-25738][SQL] Fix LOAD DATA INPATH for hdfs port ## What changes were proposed in this pull request? LOAD DATA INPATH didn't work if the defaultFS included a port for hdfs. Handling this just requires a small change to use the correct URI constructor. ## How was this patch tested? Added a unit test, ran all tests via jenkins Closes #22733 from squito/SPARK-25738. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit fdaa99897ac8755938d031896ae0eefb46ce7107) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 16 October 2018, 01:34:47 UTC
b6e4aca [SPARK-25700][SQL][BRANCH-2.4] Partially revert append mode support in Data Source V2 ## What changes were proposed in this pull request? This PR proposes to partially revert https://github.com/apache/spark/commit/5fef6e3513d6023a837c427d183006d153c7102b so that it does make a readsupport and read schema when it writes in branch 2-4 since it's too breaking change. https://github.com/apache/spark/commit/5fef6e3513d6023a837c427d183006d153c7102b happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path. For instance, this breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path". See also https://github.com/apache/spark/pull/22009#discussion_r223982672 See also https://github.com/apache/spark/pull/22688 See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ## How was this patch tested? Unit test and manual tests. Closes #22697 from HyukjinKwon/append-revert-2.4. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 15 October 2018, 02:46:10 UTC
3e776d7 [SPARK-25727][SQL] Add outputOrdering to otherCopyArgs in InMemoryRelation ## What changes were proposed in this pull request? Add `outputOrdering ` to `otherCopyArgs` in InMemoryRelation so that this field will be copied when we doing the tree transformation. ``` val data = Seq(100).toDF("count").cache() data.queryExecution.optimizedPlan.toJSON ``` The above code can generate the following error: ``` assertion failed: InMemoryRelation fields: output, cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) Project [value#176 AS count#178] +- LocalTableScan [value#176] ,None), Statistics(sizeInBytes=12.0 B, hints=none) java.lang.AssertionError: assertion failed: InMemoryRelation fields: output, cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),*(1) Project [value#176 AS count#178] +- LocalTableScan [value#176] ,None), Statistics(sizeInBytes=12.0 B, hints=none) at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.catalyst.trees.TreeNode.jsonFields(TreeNode.scala:611) at org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$collectJsonValue$1(TreeNode.scala:599) at org.apache.spark.sql.catalyst.trees.TreeNode.jsonValue(TreeNode.scala:604) at org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:590) ``` ## How was this patch tested? Added a test Closes #22715 from gatorsmile/copyArgs1. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6c3f2c6a6aa69f80de5504961cfd61b9a61ea7ce) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 October 2018, 05:10:57 UTC
883ca3f [SPARK-25726][SQL][TEST] Fix flaky test in SaveIntoDataSourceCommandSuite ## What changes were proposed in this pull request? [SPARK-22479](https://github.com/apache/spark/pull/19708/files#diff-5c22ac5160d3c9d81225c5dd86265d27R31) adds a test case which sometimes fails because the used password string `123` matches `41230802`. This PR aims to fix the flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97343/consoleFull ```scala SaveIntoDataSourceCommandSuite: - simpleString is redacted *** FAILED *** "SaveIntoDataSourceCommand .org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider41230802, Map(password -> *********(redacted), url -> *********(redacted), driver -> mydriver), ErrorIfExists +- Range (0, 1, step=1, splits=Some(2)) " contained "123" (SaveIntoDataSourceCommandSuite.scala:42) ``` ## How was this patch tested? Pass the Jenkins with the updated test case Closes #22716 from dongjoon-hyun/SPARK-25726. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6bbceb9fefe815d18001c6dd84f9ea2883d17a88) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 October 2018, 01:01:42 UTC
c4efcf1 [SPARK-25714][SQL][FOLLOWUP] improve the comment inside BooleanSimplification rule ## What changes were proposed in this pull request? improve the code comment added in https://github.com/apache/spark/pull/22702/files ## How was this patch tested? N/A Closes #22711 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit b73f76beb3c33feef0cb451726da50740ffed689) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 13 October 2018, 23:43:29 UTC
6634819 [SPARK-25718][SQL] Detect recursive reference in Avro schema and throw exception ## What changes were proposed in this pull request? Avro schema allows recursive reference, e.g. the schema for linked-list in https://avro.apache.org/docs/1.8.2/spec.html#schema_record ``` { "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], // old name for this "fields" : [ {"name": "value", "type": "long"}, // each element has a long {"name": "next", "type": ["null", "LongList"]} // optional next element ] } ``` In current Spark SQL, it is impossible to convert the schema as `StructType` . Run `SchemaConverters.toSqlType(avroSchema)` and we will get stack overflow exception. We should detect the recursive reference and throw exception for it. ## How was this patch tested? New unit test case. Closes #22709 from gengliangwang/avroRecursiveRef. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2eaf0587883ac3c65e77d01ffbb39f64c6152f87) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 October 2018, 06:50:33 UTC
765cbca [MINOR] Fix code comment in BooleanSimplification. 13 October 2018, 06:03:06 UTC
5554a33 [SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes #22702 from gatorsmile/fixBooleanSimplify2. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit c9ba59d38e2be17b802156b49d374a726e66c6b9) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 13 October 2018, 04:02:53 UTC
0f58b98 [STREAMING][DOC] Fix typo & formatting for JavaDoc ## What changes were proposed in this pull request? - Fixed typo for function outputMode - OutputMode.Complete(), changed `these is some updates` to `there are some updates` - Replaced hyphenized list by HTML unordered list tags in comments to fix the Javadoc documentation. Current render from most recent [Spark API Docs](https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/streaming/DataStreamWriter.html): #### outputMode(OutputMode) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250648-11086700-c3f4-11e8-8a5a-d88b079c165d.png) #### outputMode(String) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250651-24b3cd80-c3f4-11e8-9dac-ae37599afbce.png) #### partitionBy(String*) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250655-36957080-c3f4-11e8-990b-47bd612d3c51.png) ## How was this patch tested? This PR contains a document patch ergo no functional testing is required. Closes #22593 from niofire/fix-typo-datastreamwriter. Authored-by: Mathieu St-Louis <mastloui@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 4e141a416082cb978396ffbd6bf529b168652b9d) Signed-off-by: Sean Owen <sean.owen@databricks.com> 12 October 2018, 19:09:24 UTC
1a33544 [SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter ## What changes were proposed in this pull request? The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`: ```Scala String index out of range: 1 java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.charAt(String.java:658) ``` because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar` ## How was this patch tested? Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it. Closes #22654 from MaxGekk/csv-slash-delim. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit c7eadb5e6652468f9d5cd714c112ba1de187eea8) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 12 October 2018, 19:04:16 UTC
bb211cf [SPARK-25697][CORE] When zstd compression enabled, InProgress application is throwing Error in the history webui ## What changes were proposed in this pull request? When we enable event log compression and compression codec as 'zstd', we are unable to open the webui of the running application from the history server page. The reason is that, Replay listener was unable to read from the zstd compressed eventlog due to the zstd frame was not finished yet. This causes truncated error while reading the eventLog. So, when we try to open the WebUI from the History server page, it throws "truncated error ", and we never able to open running application in the webui, when we enable zstd compression. In this PR, when the IO excpetion happens, and if it is a running application, we log the error, "Failed to read Spark event log: evetLogDirAppName.inprogress", instead of throwing exception. ## How was this patch tested? Test steps: 1)spark.eventLog.compress = true 2)spark.io.compression.codec = zstd 3)restart history server 4) launch bin/spark-shell 5) run some queries 6) Open history server page 7) click on the application **Before fix:** ![screenshot from 2018-10-10 23-52-12](https://user-images.githubusercontent.com/23054875/46757387-9b4fa580-cce7-11e8-96ad-8938400483ed.png) ![screenshot from 2018-10-10 23-52-28](https://user-images.githubusercontent.com/23054875/46757393-a0145980-cce7-11e8-8cb0-44b583dde648.png) **After fix:** ![screenshot from 2018-10-10 23-43-49](https://user-images.githubusercontent.com/23054875/46756971-6858e200-cce6-11e8-946c-0bffebb2cfba.png) ![screenshot from 2018-10-10 23-44-05](https://user-images.githubusercontent.com/23054875/46756981-6d1d9600-cce6-11e8-95ea-ff8339a2fdfd.png) (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22689 from shahidki31/SPARK-25697. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 8e039a75548e91b0a8799d9d72c6797b066ddd62) Signed-off-by: Sean Owen <sean.owen@databricks.com> 12 October 2018, 17:57:25 UTC
3dba5d4 [SPARK-25708][SQL] HAVING without GROUP BY means global aggregate According to the SQL standard, when a query contains `HAVING`, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/ However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser. new test Closes #22696 from cloud-fan/having. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 78e133141ce8131c60181f947346802864b0951a) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 12 October 2018, 07:25:28 UTC
1961f8e [SPARK-25690][SQL] Analyzer rule HandleNullInputsForUDF does not stabilize and can be applied infinitely ## What changes were proposed in this pull request? The HandleNullInputsForUDF rule can generate new If node infinitely, thus causing problems like match of SQL cache missed. This was fixed in SPARK-24891 and was then broken by SPARK-25044. The unit test in `AnalysisSuite` added in SPARK-24891 should have failed but didn't because it wasn't properly updated after the `ScalaUDF` constructor signature change. So this PR also updates the test accordingly based on the new `ScalaUDF` constructor. ## How was this patch tested? Updated the original UT. This should be justified as the original UT became invalid after SPARK-25044. Closes #22701 from maryannxue/spark-25690. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 368513048198efcee8c9a35678b608be0cb9ad48) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 12 October 2018, 03:45:24 UTC
e80ab13 [SPARK-25674][SQL] If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated ## What changes were proposed in this pull request? If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated,because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. ## How was this patch tested? existed unit tests Closes #22594 from 10110346/inputMetrics. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 69f5e9cce14632a1f912c3632243a4e20b275365) Signed-off-by: Sean Owen <sean.owen@databricks.com> 11 October 2018, 21:24:31 UTC
cd40655 [SPARK-25636][CORE] spark-submit cuts off the failure reason when there is an error connecting to master ## What changes were proposed in this pull request? Cause of the error is wrapped with SparkException, now finding the cause from the wrapped exception and throwing the cause instead of the wrapped exception. ## How was this patch tested? Verified it manually by checking the cause of the error, it gives the error as shown below. ### Without the PR change ``` [apache-spark]$ ./bin/spark-submit --verbose --master spark://****** .... Error: Exception thrown in awaitResult: Run with --help for usage help or --verbose for debug output ``` ### With the PR change ``` [apache-spark]$ ./bin/spark-submit --verbose --master spark://****** .... Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) .... at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.IOException: Failed to connect to devaraj-pc1/10.3.66.65:7077 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) .... at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: devaraj-pc1/10.3.66.65:7077 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) .... at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more Caused by: java.net.ConnectException: Connection refused ... 11 more ``` Closes #22623 from devaraj-kavali/SPARK-25636. Authored-by: Devaraj K <devaraj@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 8a7872dc254710f9b29fdfdb2915a949ef606871) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 10 October 2018, 16:24:50 UTC
71b8739 Preparing development version 2.4.1-SNAPSHOT 10 October 2018, 13:26:16 UTC
8e4a99b Preparing Spark release v2.4.0-rc3 10 October 2018, 13:26:12 UTC
404c840 [SPARK-25669][SQL] Check CSV header only when it exists ## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes #22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 46fe40838aa682a7073dd6f1373518b0c8498a94) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 09 October 2018, 06:36:33 UTC
4baa4d4 [SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks ## What changes were proposed in this pull request? Added - Python foreach - Scala, Java and Python foreachBatch - Multiple watermark policy - The semantics of what changes are allowed to the streaming between restarts. ## How was this patch tested? No tests Closes #22627 from tdas/SPARK-25639. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> (cherry picked from commit f9935a3f85f46deef2cb7b213c1c02c8ff627a8c) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 08 October 2018, 21:32:18 UTC
193ce77 [SPARK-25677][DOC] spark.io.compression.codec = org.apache.spark.io.ZstdCompressionCodec throwing IllegalArgumentException Exception ## What changes were proposed in this pull request? Documentation is updated with proper classname org.apache.spark.io.ZStdCompressionCodec ## How was this patch tested? we used the spark.io.compression.codec = org.apache.spark.io.ZStdCompressionCodec and verified the logs. Closes #22669 from shivusondur/CompressionIssue. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 1a6815cd9f421a106f8d96a36a53042a00f02386) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 08 October 2018, 07:43:35 UTC
692ddb3 [SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator ## What changes were proposed in this pull request? If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry. ## How was this patch tested? Added test. Closes #22635 from viirya/SPARK-25591. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit cb90617f894fd51a092710271823ec7d1cd3a668) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 08 October 2018, 07:18:27 UTC
4214ddd [SPARK-25673][BUILD] Remove Travis CI which enables Java lint check ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/12980 added Travis CI file mainly for linter because we disabled Java lint check in Jenkins. It's enabled as of https://github.com/apache/spark/pull/21399 and now SBT runs it. Looks we can now remove the file added before. ## How was this patch tested? N/A Closes #22665 Closes #22667 from HyukjinKwon/SPARK-25673. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 219922422003e59cc8b3bece60778536759fa669) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 08 October 2018, 07:07:35 UTC
c8b9409 [SPARK-25671] Build external/spark-ganglia-lgpl in Jenkins Test ## What changes were proposed in this pull request? Currently, we do not build external/spark-ganglia-lgpl in Jenkins tests when the code is changed. ## How was this patch tested? N/A Closes #22658 from gatorsmile/buildGanglia. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 8bb242902760535d12c6c40c5d8481a98fdc11e0) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 06 October 2018, 22:49:55 UTC
48e2e6f [SPARK-25644][SS][FOLLOWUP][BUILD] Fix Scala 2.12 build error due to foreachBatch ## What changes were proposed in this pull request? This PR fixes the Scala-2.12 build error due to ambiguity in `foreachBatch` test cases. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/428/console ```scala [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSinkSuite.scala:102: ambiguous reference to overloaded definition, [error] both method foreachBatch in class DataStreamWriter of type (function: org.apache.spark.api.java.function.VoidFunction2[org.apache.spark.sql.Dataset[Int],Long])org.apache.spark.sql.streaming.DataStreamWriter[Int] [error] and method foreachBatch in class DataStreamWriter of type (function: (org.apache.spark.sql.Dataset[Int], Long) => Unit)org.apache.spark.sql.streaming.DataStreamWriter[Int] [error] match argument types ((org.apache.spark.sql.Dataset[Int], Any) => Unit) [error] ds.writeStream.foreachBatch((_, _) => {}).trigger(Trigger.Continuous("1 second")).start() [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSinkSuite.scala:106: ambiguous reference to overloaded definition, [error] both method foreachBatch in class DataStreamWriter of type (function: org.apache.spark.api.java.function.VoidFunction2[org.apache.spark.sql.Dataset[Int],Long])org.apache.spark.sql.streaming.DataStreamWriter[Int] [error] and method foreachBatch in class DataStreamWriter of type (function: (org.apache.spark.sql.Dataset[Int], Long) => Unit)org.apache.spark.sql.streaming.DataStreamWriter[Int] [error] match argument types ((org.apache.spark.sql.Dataset[Int], Any) => Unit) [error] ds.writeStream.foreachBatch((_, _) => {}).partitionBy("value").start() [error] ^ ``` ## How was this patch tested? Manual. Since this failure occurs in Scala-2.12 profile and test cases, Jenkins will not test this. We need to build with Scala-2.12 and run the tests. Closes #22649 from dongjoon-hyun/SPARK-SCALA212. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9cbf105ab1256d65f027115ba5505842ce8fffe3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 October 2018, 16:40:54 UTC
a2991d2 [SPARK-25646][K8S] Fix docker-image-tool.sh on dev build. The docker file was referencing a path that only existed in the distribution tarball; it needs to be parameterized so that the right path can be used in a dev build. Tested on local dev build. Closes #22634 from vanzin/SPARK-25646. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 58287a39864db463eeef17d1152d664be021d9ef) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 October 2018, 04:18:12 UTC
0a70afd [SPARK-25644][SS] Fix java foreachBatch in DataStreamWriter ## What changes were proposed in this pull request? The java `foreachBatch` API in `DataStreamWriter` should accept `java.lang.Long` rather `scala.Long`. ## How was this patch tested? New java test. Closes #22633 from zsxwing/fix-java-foreachbatch. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 05 October 2018, 18:18:49 UTC
2c700ee [SPARK-25521][SQL] Job id showing null in the logs when insert into command Job is finished. ## What changes were proposed in this pull request? ``As part of insert command in FileFormatWriter, a job context is created for handling the write operation , While initializing the job context using setupJob() API in HadoopMapReduceCommitProtocol , we set the jobid in the Jobcontext configuration.In FileFormatWriter since we are directly getting the jobId from the map reduce JobContext the job id will come as null while adding the log. As a solution we shall get the jobID from the configuration of the map reduce Jobcontext.`` ## How was this patch tested? Manually, verified the logs after the changes. ![spark-25521 1](https://user-images.githubusercontent.com/12999161/46164933-e95ab700-c2ac-11e8-88e9-49fa5100b872.PNG) Closes #22572 from sujith71955/master_log_issue. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 459700727fadf3f35a211eab2ffc8d68a4a1c39a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 October 2018, 08:51:59 UTC
c9bb83a [SPARK-25602][SQL] SparkPlan.getByteArrayRdd should not consume the input when not necessary ## What changes were proposed in this pull request? In `SparkPlan.getByteArrayRdd`, we should only call `it.hasNext` when the limit is not hit, as `iter.hasNext` may produce one row and buffer it, and cause wrong metrics. ## How was this patch tested? new tests Closes #22621 from cloud-fan/range. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 71c24aad36ae6b3f50447a019bf893490dcf1cf4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 October 2018, 12:16:32 UTC
0763b75 [SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement ## What changes were proposed in this pull request? This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf("integer", PandasUDFType.GROUPED_AGG) def sum_udf(v): return v.sum() spark.udf.register("sum_udf", sum_udf) q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" spark.sql(q).show() ``` ``` +---+-----------+ | v2|sum_udf(v1)| +---+-----------+ | 1| 1| | 0| 5| +---+-----------+ ``` ## How was this patch tested? Manual test and unit test. Closes #22620 from HyukjinKwon/SPARK-25601. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> 04 October 2018, 01:43:42 UTC
443d12d [SPARK-25538][SQL] Zero-out all bytes when writing decimal ## What changes were proposed in this pull request? In #20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed. I see 2 solutions here: - we can zero-out all the bytes in advance as it was done before #20850 (safer solution IMHO); - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option). Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later. ## How was this patch tested? Running the test attached in the JIRA + added UT Closes #22602 from mgaido91/SPARK-25582. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d7ae36a810bfcbedfe7360eb2cdbbc3ca970e4d0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 03 October 2018, 14:28:48 UTC
ea4068a [SPARK-25583][DOC] Add history-server related configuration in the documentation. ## What changes were proposed in this pull request? Add history-server related configuration in the documentation. Some of the history server related configurations were missing in the documentation.Like, 'spark.history.store.maxDiskUsage', 'spark.ui.liveUpdate.period' etc. ## How was this patch tested? ![screenshot from 2018-10-01 20-58-26](https://user-images.githubusercontent.com/23054875/46298568-04833a80-c5bd-11e8-95b8-54c9d6582fd2.png) ![screenshot from 2018-10-01 20-59-31](https://user-images.githubusercontent.com/23054875/46298591-11a02980-c5bd-11e8-93d0-892afdfd4f9a.png) ![screenshot from 2018-10-01 20-59-45](https://user-images.githubusercontent.com/23054875/46298601-1533b080-c5bd-11e8-9689-e9b39882a7b5.png) Closes #22601 from shahidki31/historyConf. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 71876633f3af706408355b5fb561b58dbc593360) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 October 2018, 15:06:32 UTC
ad7b3f6 [SPARK-25578][BUILD] Update to Scala 2.12.7 ## What changes were proposed in this pull request? Update to Scala 2.12.7. See https://issues.apache.org/jira/browse/SPARK-25578 for why. ## How was this patch tested? Existing tests. Closes #22600 from srowen/SPARK-25578. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 5114db5781967c1e8046296905d97560187479fb) Signed-off-by: Sean Owen <sean.owen@databricks.com> 02 October 2018, 02:35:26 UTC
426c2bd [SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests ## What changes were proposed in this pull request? Add more data types for Pandas UDF Tests for PySpark SQL ## How was this patch tested? manual tests Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests. Lead-authored-by: Aleksandr Koriagin <aleksandr_koriagin@epam.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Alexander Koryagin <AlexanderKoryagin@users.noreply.github.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 30f5d0f2ddfe56266ea81e4255f9b4f373dab237) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 01 October 2018, 09:19:00 UTC
82990e5 [SPARK-25453][SQL][TEST][.FFFFFFFFF] OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss ## What changes were proposed in this pull request? This PR aims to fix the failed test of `OracleIntegrationSuite`. ## How was this patch tested? Existing integration tests. Closes #22461 from seancxmao/SPARK-25453. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 21f0b73dbcd94f9eea8cbc06a024b0e899edaf4c) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 01 October 2018, 05:49:27 UTC
7b1094b [SPARK-25505][SQL][FOLLOWUP] Fix for attributes cosmetically different in Pivot clause ## What changes were proposed in this pull request? #22519 introduced a bug when the attributes in the pivot clause are cosmetically different from the output ones (eg. different case). In particular, the problem is that the PR used a `Set[Attribute]` instead of an `AttributeSet`. ## How was this patch tested? added UT Closes #22582 from mgaido91/SPARK-25505_followup. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit fb8f4c05657595e089b6812d97dbfee246fce06f) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 01 October 2018, 05:08:19 UTC
c886f05 [SPARK-25543][K8S] Print debug message iff execIdsRemovedInThisRound is not empty. ## What changes were proposed in this pull request? Spurious logs like /sec. 2018-09-26 09:33:57 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors with ids from Spark that were either found to be deleted or non-existent in the cluster. 2018-09-26 09:33:58 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors with ids from Spark that were either found to be deleted or non-existent in the cluster. 2018-09-26 09:33:59 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors with ids from Spark that were either found to be deleted or non-existent in the cluster. 2018-09-26 09:34:00 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors with ids from Spark that were either found to be deleted or non-existent in the cluster. The fix is easy, first check if there are any removed executors, before producing the log message. ## How was this patch tested? Tested by manually deploying to a minikube cluster. Closes #22565 from ScrapCodes/spark-25543/k8s/debug-log-spurious-warning. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4da541a5d23b039eb549dd849cf121bdc8676e59) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 September 2018, 21:28:39 UTC
8e6fb47 [CORE][MINOR] Fix obvious error and compiling for Scala 2.12.7 ## What changes were proposed in this pull request? Fix an obvious error. ## How was this patch tested? Existing tests. Closes #22577 from sadhen/minor_fix. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 40e6ed89405828ff312eca0abd43cfba4b9185b2) Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 September 2018, 14:00:34 UTC
6f510c6 [SPARK-25568][CORE] Continue to update the remaining accumulators when failing to update one accumulator ## What changes were proposed in this pull request? Since we don't fail a job when `AccumulatorV2.merge` fails, we should try to update the remaining accumulators so that they can still report correct values. ## How was this patch tested? The new unit test. Closes #22586 from zsxwing/SPARK-25568. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit b6b8a6632e2b6e5482aaf4bfa093700752a9df80) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 30 September 2018, 01:10:22 UTC
fef3027 [SPARK-25572][SPARKR] test only if not cran ## What changes were proposed in this pull request? CRAN doesn't seem to respect the system requirements as running tests - we have seen cases where SparkR is run on Java 10, which unfortunately Spark does not start on. For 2.4, lets attempt skipping all tests ## How was this patch tested? manual, jenkins, appveyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #22589 from felixcheung/ralltests. (cherry picked from commit f4b138082ff91be74b0f5bbe19cdb90dd9e5f131) Signed-off-by: Felix Cheung <felixcheung@apache.org> 29 September 2018, 21:48:51 UTC
a14306b [SPARK-25262][DOC][FOLLOWUP] Fix link tags in html table ## What changes were proposed in this pull request? Markdown links are not working inside html table. We should use html link tag. ## How was this patch tested? Verified in IntelliJ IDEA's markdown editor and online markdown editor. Closes #22588 from viirya/SPARK-25262-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit dcb9a97f3e16d4645529ac619c3197fcba1c9806) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 29 September 2018, 10:18:52 UTC
ec2c17a [SPARK-25570][SQL][TEST] Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? This PR aims to prevent test slowdowns at `HiveExternalCatalogVersionsSuite` by using the latest Apache Spark 2.3.2 link because the Apache mirrors will remove the old Spark 2.3.1 binaries eventually. `HiveExternalCatalogVersionsSuite` will not fail because [SPARK-24813](https://issues.apache.org/jira/browse/SPARK-24813) implements a fallback logic. However, it will cause many trials and fallbacks in all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue. ## How was this patch tested? Pass the Jenkins with the updated version. Closes #22587 from dongjoon-hyun/SPARK-25570. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 1e437835e96c4417117f44c29eba5ebc0112926f) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 29 September 2018, 03:44:12 UTC
7614313 [SPARK-25542][CORE][TEST] Move flaky test in OpenHashMapSuite to OpenHashSetSuite and make it against OpenHashSet ## What changes were proposed in this pull request? The specified test in OpenHashMapSuite to test large items is somehow flaky to throw OOM. By considering the original work #6763 that added this test, the test can be against OpenHashSetSuite. And by doing this should be to save memory because OpenHashMap allocates two more arrays when growing the map/set. ## How was this patch tested? Existing tests. Closes #22569 from viirya/SPARK-25542. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b7d80349b0e367d78cab238e62c2ec353f0f12b3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 September 2018, 21:30:12 UTC
81391c2 [SPARK-23285][DOC][FOLLOWUP] Fix missing markup tag ## What changes were proposed in this pull request? This adds a missing markup tag. This should go to `master/branch-2.4`. ## How was this patch tested? Manual via `SKIP_API=1 jekyll build`. Closes #22585 from dongjoon-hyun/SPARK-23285. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0b33f08683a41f6f3a6ec02c327010c0722cc1d1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 September 2018, 21:10:47 UTC
b2a1e2f [SPARK-25505][SQL] The output order of grouping columns in Pivot is different from the input order ## What changes were proposed in this pull request? The grouping columns from a Pivot query are inferred as "input columns - pivot columns - pivot aggregate columns", where input columns are the output of the child relation of Pivot. The grouping columns will be the leading columns in the pivot output and they should preserve the same order as specified by the input. For example, ``` SELECT * FROM ( SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, "x" as x, "d" as d, "w" as w FROM courseSales ) PIVOT ( sum(earnings) FOR course IN ('dotNET', 'Java') ) ``` The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, b, c, d, w, x, y, z, ..." The fix is to use the child plan's `output` instead of `outputSet` so that the order can be preserved. ## How was this patch tested? Added UT. Closes #22519 from maryannxue/spark-25505. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit e120a38c0cdfb569c9151bef4d53e98175da2b25) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 28 September 2018, 07:09:21 UTC
a43a082 [SPARK-25533][CORE][WEBUI] AppSummary should hold the information about succeeded Jobs and completed stages only Currently, In the spark UI, when there are failed jobs or failed stages, display message for the completed jobs and completed stages are not consistent with the previous versions of spark. Reason is because, AppSummary holds the information about all the jobs and stages. But, In the below code, it checks against the completedJobs and completedStages. So, AppSummary should hold only successful jobs and stages. https://github.com/apache/spark/blob/66d29870c09e6050dd846336e596faaa8b0d14ad/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala#L306 https://github.com/apache/spark/blob/66d29870c09e6050dd846336e596faaa8b0d14ad/core/src/main/scala/org/apache/spark/ui/jobs/AllStagesPage.scala#L119 So, we should keep only completed jobs and stage information in the AppSummary, to make it consistent with Spark2.2 Test steps: bin/spark-shell ``` sc.parallelize(1 to 5, 5).collect() sc.parallelize(1 to 5, 2).map{ x => throw new RuntimeException("Fail")}.collect() ``` **Before fix:** ![screenshot from 2018-09-26 03-24-53](https://user-images.githubusercontent.com/23054875/46045669-f60bcd80-c13b-11e8-9aa6-a2e5a2038dba.png) ![screenshot from 2018-09-26 03-25-08](https://user-images.githubusercontent.com/23054875/46045699-0ae86100-c13c-11e8-94e5-ad35944c7615.png) **After fix:** ![screenshot from 2018-09-26 03-16-14](https://user-images.githubusercontent.com/23054875/46045636-d83e6880-c13b-11e8-98df-f49d15c18958.png) ![screenshot from 2018-09-26 03-16-28](https://user-images.githubusercontent.com/23054875/46045645-e1c7d080-c13b-11e8-8c9c-d32e1f663356.png) Closes #22549 from shahidki31/SPARK-25533. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 5ee21661834e837d414bc20591982a092c0aece3) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 27 September 2018, 17:24:14 UTC
0256f8a [SPARK-25546][CORE] Don't cache value of EVENT_LOG_CALLSITE_LONG_FORM. Caching the value of that config means different instances of SparkEnv will always use whatever was the first value to be read. It also breaks tests that use RDDInfo outside of the scope of a SparkContext. Since this is not a performance sensitive area, there's no advantage in caching the config value. Closes #22558 from vanzin/SPARK-25546. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5fd22d05363dd8c0e1b10f3822ccb71eb42f6db9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 September 2018, 16:27:05 UTC
back to top