https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
cee4ecb Preparing Spark release v2.4.5-rc2 02 February 2020, 19:23:09 UTC
cb4a736 [SPARK-30704][INFRA] Use jekyll-redirect-from 0.15.0 instead of the latest This PR aims to pin the version of `jekyll-redirect-from` to 0.15.0. This is a release blocker for both Apache Spark 3.0.0 and 2.4.5. `jekyll-redirect-from` released 0.16.0 a few days ago and that requires Ruby 2.4.0. - https://github.com/jekyll/jekyll-redirect-from/releases/tag/v0.16.0 ``` $ cd dev/create-release/spark-rm/ $ docker build -t spark:test . ... ERROR: Error installing jekyll-redirect-from: jekyll-redirect-from requires Ruby version >= 2.4.0. ... ``` No. Manually do the above command to build `spark-rm` Docker image. ``` ... Successfully installed jekyll-redirect-from-0.15.0 Parsing documentation for jekyll-redirect-from-0.15.0 Installing ri documentation for jekyll-redirect-from-0.15.0 Done installing documentation for jekyll-redirect-from after 0 seconds 1 gem installed Successfully installed rouge-3.15.0 Parsing documentation for rouge-3.15.0 Installing ri documentation for rouge-3.15.0 Done installing documentation for rouge after 4 seconds 1 gem installed Removing intermediate container e0ec7c77b69f ---> 32dec37291c6 ``` Closes #27434 from dongjoon-hyun/SPARK-30704. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1adf3520e3c753e6df8dccb752e8239de682a09a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 February 2020, 08:46:20 UTC
c7c2bda [SPARK-30065][SQL][2.4] DataFrameNaFunctions.drop should handle duplicate columns (Backport of #26700) ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ |col1|col2|col2| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes #27411 from imback82/backport-SPARK-30065. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 January 2020, 07:01:57 UTC
4c3c1d6 [SPARK-29890][SQL][2.4] DataFrameNaFunctions.fill should handle duplicate columns (Backport of #26593) ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ |col1| col2| col2| +----+-----+-----+ | 1|hello| 2| | 3| 4|hello| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #27407 from imback82/backport-SPARK-29890. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 January 2020, 00:52:25 UTC
eeef0e7 [SPARK-29578][TESTS][2.4] Add "8634" as another skipped day for Kwajalein timzeone due to more recent timezone updates in later JDK 8 (Backport of https://github.com/apache/spark/pull/26236) ### What changes were proposed in this pull request? Recent timezone definition changes in very new JDK 8 (and beyond) releases cause test failures. The below was observed on JDK 1.8.0_232. As before, the easy fix is to allow for these inconsequential variations in test results due to differing definition of timezones. ### Why are the changes needed? Keeps test passing on the latest JDK releases. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests Closes #27386 from srowen/SPARK-29578.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 January 2020, 01:20:32 UTC
b93f250 [SPARK-29367][DOC][2.4] Add compatibility note for Arrow 0.15.0 to SQL guide ### What changes were proposed in this pull request? Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark. ### Why are the changes needed? Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility. ### Does this PR introduce any user-facing change? Yes. ![arrow](https://user-images.githubusercontent.com/9700541/73404705-faec0e80-42a6-11ea-952a-25c544a6d90b.png) ### How was this patch tested? Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set. Closes #27383 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367-branch24. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 January 2020, 22:54:00 UTC
c7c9f9e [SPARK-30310][CORE][2.4] Resolve missing match case in SparkUncaughtExceptionHandler and added tests ### What changes were proposed in this pull request? Backport of SPARK-30310 from master (f5f05d549efd8f9a81376bfc31474292be7aaa8a) 1) Added missing match case to SparkUncaughtExceptionHandler, so that it would not halt the process when the exception doesn't match any of the match case statements. 2) Added log message before halting process. During debugging it wasn't obvious why the Worker process would DEAD (until we set SPARK_NO_DAEMONIZE=1) due to the shell-scripts puts the process into background and essentially absorbs the exit code. 3) Added SparkUncaughtExceptionHandlerSuite. Basically we create a Spark exception-throwing application with SparkUncaughtExceptionHandler and then check its exit code. ### Why are the changes needed? SPARK-30310, because the process would halt unexpectedly. ### How was this patch tested? All unit tests (mvn test) were ran and OK. Closes #27384 from n-marion/branch-2.4_30310-backport. Authored-by: git <tinto@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 January 2020, 22:40:51 UTC
12f4492 [SPARK-30512] Added a dedicated boss event loop group ### What changes were proposed in this pull request? Adding a dedicated boss event loop group to the Netty pipeline in the External Shuffle Service to avoid the delay in channel registration. ``` EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1, conf.getModuleName() + "-boss"); EventLoopGroup workerGroup = NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server"); bootstrap = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NettyUtils.getServerChannelClass(ioMode)) .option(ChannelOption.ALLOCATOR, allocator) ``` ### Why are the changes needed? We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service. ``` java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) ``` The investigation that we have done is described here: https://github.com/netty/netty/issues/9890 After adding `LoggingHandler` to the netty pipeline, we saw that the registration of the channel was getting delay which is because the worker threads are busy with the existing channels. ### Does this PR introduce any user-facing change? No ### How was this patch tested? We have tested the patch on our clusters and with a stress testing tool. After this change, we didn't see any SASL requests timing out. Existing unit tests pass. Closes #27240 from otterc/SPARK-30512. Authored-by: Chandni Singh <chsingh@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 6b47ace27d04012bcff47951ea1eea2aa6fb7d60) Signed-off-by: Thomas Graves <tgraves@apache.org> 29 January 2020, 21:13:20 UTC
6c29070 [SPARK-23435][2.4][SPARKR][TESTS] Update testthat to >= 2.0.0 ### What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/27359: - Update `testthat` to >= 2.0.0 - Replace of `testthat:::run_tests` with `testthat:::test_package_dir` - Add trivial assertions for tests, without any expectations, to avoid skipping. - Update related docs. ### Why are the changes needed? `testthat` version has been frozen by [SPARK-22817](https://issues.apache.org/jira/browse/SPARK-22817) / https://github.com/apache/spark/pull/20003, but 1.0.2 is pretty old, and we shouldn't keep things in this state forever. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - Existing CI pipeline: - Windows build on AppVeyor, R 3.6.2, testthtat 2.3.1 - Linux build on Jenkins, R 3.1.x, testthat 1.0.2 - Additional builds with thesthat 2.3.1 using [sparkr-build-sandbox](https://github.com/zero323/sparkr-build-sandbox) on c7ed64af9e697b3619779857dd820832176b3be3 R 3.4.4 (image digest ec9032f8cf98) ``` docker pull zero323/sparkr-build-sandbox:3.4.4 docker run zero323/sparkr-build-sandbox:3.4.4 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc ``` 3.5.3 (image digest 0b1759ee4d1d) ``` docker pull zero323/sparkr-build-sandbox:3.5.3 docker run zero323/sparkr-build-sandbox:3.5.3 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc ``` and 3.6.2 (image digest 6594c8ceb72f) ``` docker pull zero323/sparkr-build-sandbox:3.6.2 docker run zero323/sparkr-build-sandbox:3.6.2 zero323 --branch SPARK-23435 --commit c7ed64af9e697b3619779857dd820832176b3be3 --public-key https://keybase.io/zero323/pgp_keys.asc ```` Corresponding [asciicast](https://asciinema.org/) are available as 10.5281/zenodo.3629431 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3629431.svg)](https://doi.org/10.5281/zenodo.3629431) (a bit to large to burden asciinema.org, but can run locally via `asciinema play`). ---------------------------- Continued from #27328 Closes #27379 from HyukjinKwon/testthat-2.0. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 January 2020, 02:39:15 UTC
ad9f578 [SPARK-30633][SQL] Append L to seed when type is LongType ### What changes were proposed in this pull request? Allow for using longs as seed for xxHash. ### Why are the changes needed? Codegen fails when passing a seed to xxHash that is > 2^31. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests pass. Should more be added? Closes #27354 from patrickcording/fix_xxhash_seed_bug. Authored-by: Patrick Cording <patrick.cording@datarobot.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c5c580ba0d253a04a3df5bbfd5acf6b5d23cdc1c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 January 2020, 18:32:30 UTC
5f1cb2f Revert "[SPARK-29777][FOLLOW-UP][SPARKR] Remove no longer valid test for recursive calls" This reverts commit 81ea5a4cc1617de0dbbc61811847320724b3644f. 26 January 2020, 05:43:49 UTC
81ea5a4 [SPARK-29777][FOLLOW-UP][SPARKR] Remove no longer valid test for recursive calls ### What changes were proposed in this pull request? Disabling test for cleaning closure of recursive function. ### Why are the changes needed? As of https://github.com/apache/spark/commit/9514b822a70d77a6298ece48e6c053200360302c this test is no longer valid, and recursive calls, even simple ones: ```lead f <- function(x) { if(x > 0) { f(x - 1) } else { x } } ``` lead to ``` Error: node stack overflow ``` This is issue is silenced when tested with `testthat` 1.x (reason unknown), but cause failures when using `testthat` 2.x (issue can be reproduced outside test context). Problem is known and tracked by [SPARK-30629](https://issues.apache.org/jira/browse/SPARK-30629) Therefore, keeping this test active doesn't make sense, as it will lead to continuous test failures, when `testthat` is updated (https://github.com/apache/spark/pull/27359 / SPARK-23435). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. CC falaki Closes #27363 from zero323/SPARK-29777-FOLLOWUP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 January 2020, 05:41:15 UTC
1b3ddcf [SPARK-30645][SPARKR][TESTS][WINDOWS] Move Unicode test data to external file ### What changes were proposed in this pull request? Reference data for "collect() support Unicode characters" has been moved to an external file, to make test OS and locale independent. ### Why are the changes needed? As-is, embedded data is not properly encoded on Windows: ``` library(SparkR) SparkR::sparkR.session() Sys.info() # sysname release version # "Windows" "Server x64" "build 17763" # nodename machine login # "WIN-5BLT6Q610KH" "x86-64" "Administrator" # user effective_user # "Administrator" "Administrator" Sys.getlocale() # [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" lines <- c("{\"name\":\"안녕하세요\"}", "{\"name\":\"您好\", \"age\":30}", "{\"name\":\"こんにちは\", \"age\":19}", "{\"name\":\"Xin chào\"}") system(paste0("cat ", jsonPath)) # {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"} # {"name":"<U+60A8><U+597D>", "age":30} # {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19} # {"name":"Xin chào"} # [1] 0 jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") writeLines(lines, jsonPath) df <- read.df(jsonPath, "json") printSchema(df) # root # |-- _corrupt_record: string (nullable = true) # |-- age: long (nullable = true) # |-- name: string (nullable = true) head(df) # _corrupt_record age name # 1 <NA> NA <U+C548><U+B155><U+D558><U+C138><U+C694> # 2 <NA> 30 <U+60A8><U+597D> # 3 <NA> 19 <U+3053><U+3093><U+306B><U+3061><U+306F> # 4 {"name":"Xin ch<U+FFFD>o"} NA <NA> ``` This can be reproduced outside tests (Windows Server 2019, English locale), and causes failures, when `testthat` is updated to 2.x (https://github.com/apache/spark/pull/27359). Somehow problem is not picked-up when test is executed on `testthat` 1.0.2. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Running modified test, manual testing. ### Note Alternative seems to be to used bytes, but it hasn't been properly tested. ``` test_that("collect() support Unicode characters", { lines <- markUtf8(c( '{"name": "안녕하세요"}', '{"name": "您好", "age": 30}', '{"name": "こんにちは", "age": 19}', '{"name": "Xin ch\xc3\xa0o"}' )) jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") writeLines(lines, jsonPath, useBytes = TRUE) expected <- regmatches(lines, regexec('(?<="name": ").*?(?=")', lines, perl = TRUE)) df <- read.df(jsonPath, "json") rdf <- collect(df) expect_true(is.data.frame(rdf)) rdf$name <- markUtf8(rdf$name) expect_equal(rdf$name[1], expected[[1]]) expect_equal(rdf$name[2], expected[[2]]) expect_equal(rdf$name[3], expected[[3]]) expect_equal(rdf$name[4], expected[[4]]) df1 <- createDataFrame(rdf) expect_equal( collect( where(df1, df1$name == expected[[2]]) )$name, expected[[2]] ) }) ``` Closes #27362 from zero323/SPARK-30645. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 40b1f4d87e0f24e4e7e2fd6fe37cc5398ae778f8) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 January 2020, 04:00:12 UTC
d7be535 [SPARK-30630][ML][2.4] Deprecate numTrees in GBT in 2.4.5 ### What changes were proposed in this pull request? Deprecate numTrees in GBT in 2.4.5 so it can be removed in 3.0.0 ### Why are the changes needed? Currently, GBT has ``` /** * Number of trees in ensemble */ Since("2.0.0") val getNumTrees: Int = trees.length ``` and ``` /** Number of trees in ensemble */ val numTrees: Int = trees.length ``` I think we should remove one of them. I will deprecate it in 2.4.5 and remove it in 3.0.0 ### Does this PR introduce any user-facing change? Deprecate numTrees in 2.4.5 ### How was this patch tested? Existing tests Closes #27352 from huaxingao/spark-tree. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 January 2020, 10:01:49 UTC
2fc562c [SPARK-30556][SQL][2.4] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext ### What changes were proposed in this pull request? In `org.apache.spark.sql.execution.SubqueryExec#relationFuture` make a copy of `org.apache.spark.SparkContext#localProperties` and pass it to the sub-execution thread in `org.apache.spark.sql.execution.SubqueryExec#executionContext` ### Why are the changes needed? Local properties set via sparkContext are not available as TaskContext properties when executing jobs and threadpools have idle threads which are reused Explanation: When `SubqueryExec`, the relationFuture is evaluated via a separate thread. The threads inherit the `localProperties` from `sparkContext` as they are the child threads. These threads are created in the `executionContext` (thread pools). Each Thread pool has a default keepAliveSeconds of 60 seconds for idle threads. Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via `sparkContext.runJob/submitJob` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added UT Closes #27340 from ajithme/subquerylocalprop2. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2020, 17:00:01 UTC
c6b02cf [SPARK-30601][BUILD][2.4] Add a Google Maven Central as a primary repository ### What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/27307 This PR proposes to address four things. Three issues and fixes were a bit mixed so this PR sorts it out. See also http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html for the discussion in the mailing list. 1. Add the Google Maven Central mirror (GCS) as a primary repository. This will not only help development more stable but also in order to make Github Actions build (where it is always required to download jars) stable. In case of Jenkins PR builder, it wouldn't be affected too much as it uses the pre-downloaded jars under `.m2`. - Google Maven Central seems stable for heavy workload but not synced very quickly (e.g., new release is missing) - Maven Central (default) seems less stable but synced quickly. We already added this GCS mirror as a default additional remote repository at SPARK-29175. So I don't see an issue to add it as a repo. https://github.com/apache/spark/blob/abf759a91e01497586b8bb6b7a314dd28fd6cff1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2111-L2118 2. Currently, we have the hard-corded repository in [`sbt-pom-reader`](https://github.com/JoshRosen/sbt-pom-reader/blob/v1.0.0-spark/src/main/scala/com/typesafe/sbt/pom/MavenPomResolver.scala#L32) and this seems overwriting Maven's existing resolver by the same ID `central` with `http://` when initially the pom file is ported into SBT instance. This uses `http://` which latently Maven Central disallowed (see https://github.com/apache/spark/pull/27242) My speculation is that we just need to be able to load plugin and let it convert POM to SBT instance with another fallback repo. After that, it _seems_ using `central` with `https` properly. See also https://github.com/apache/spark/pull/27307#issuecomment-576720395. I double checked that we use `https` properly from the SBT build as well: ``` [debug] downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom ... [debug] public: downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom [debug] public: downloading https://repo1.maven.org/maven2/com/etsy/sbt-checkstyle-plugin_2.10_0.13/3.1.1/sbt-checkstyle-plugin-3.1.1.pom.sha1 ``` This was fixed by adding the same repo (https://github.com/apache/spark/pull/27281), `central_without_mirror`, which is a bit awkward. Instead, this PR adds GCS as a main repo, and community Maven central as a fallback repo. So, presumably the community Maven central repo is used when the plugin is loaded as a fallback. 3. While I am here, I fix another issue. Github Action at https://github.com/apache/spark/pull/27279 is being failed. The reason seems to be scalafmt 1.0.3 is in Maven central but not in GCS. ``` org.apache.maven.plugin.PluginResolutionException: Plugin org.antipathy:mvn-scalafmt_2.12:1.0.3 or one of its dependencies could not be resolved: Could not find artifact org.antipathy:mvn-scalafmt_2.12:jar:1.0.3 in google-maven-central (https://maven-central.storage-download.googleapis.com/repos/central/data/) at org.apache.maven.plugin.internal.DefaultPluginDependenciesResolver.resolve (DefaultPluginDependenciesResolver.java:131) ``` `mvn-scalafmt` exists in Maven central: ```bash $ curl https://repo.maven.apache.org/maven2/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom ``` ```xml <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> ... ``` whereas not in GCS mirror: ```bash $ curl https://maven-central.storage-download.googleapis.com/repos/central/data/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom ``` ```xml <?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: maven-central/repos/central/data/org/antipathy/mvn-scalafmt_2.12/1.0.3/mvn-scalafmt_2.12-1.0.3.pom</Details></Error>% ``` In this PR, simply make both repos accessible by adding to `pluginRepositories`. 4. Remove the workarounds in Github Actions to switch mirrors because now we have same repos in the same order (Google Maven Central first, and Maven Central second) ### Why are the changes needed? To make the build and Github Action more stable. ### Does this PR introduce any user-facing change? No, dev only change. ### How was this patch tested? I roughly checked local and PR against my fork (https://github.com/HyukjinKwon/spark/pull/2 and https://github.com/HyukjinKwon/spark/pull/3). Closes #27335 from HyukjinKwon/tmp. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2020, 16:56:39 UTC
0300d4b [SPARK-30553][DOCS] fix structured-streaming java example error # What changes were proposed in this pull request? Fix structured-streaming java example error. ```java Dataset<Row> windowedCounts = words .withWatermark("timestamp", "10 minutes") .groupBy( functions.window(words.col("timestamp"), "10 minutes", "5 minutes"), words.col("word")) .count(); ``` It does not clean up old state.May cause OOM > Before the fix ```scala == Physical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter48e331f0 +- *(4) HashAggregate(keys=[window#13, word#4], functions=[count(1)], output=[window#13, word#4, count#12L]) +- StateStoreSave [window#13, word#4], state info [ checkpoint = file:/C:/Users/chenhao/AppData/Local/Temp/temporary-91124080-0e20-41c0-9150-91735bdc22c0/state, runId = 5c425536-a3ae-4385-8167-5fa529e6760d, opId = 0, ver = 6, numPartitions = 1], Update, 1579530890886, 2 +- *(3) HashAggregate(keys=[window#13, word#4], functions=[merge_count(1)], output=[window#13, word#4, count#23L]) +- StateStoreRestore [window#13, word#4], state info [ checkpoint = file:/C:/Users/chenhao/AppData/Local/Temp/temporary-91124080-0e20-41c0-9150-91735bdc22c0/state, runId = 5c425536-a3ae-4385-8167-5fa529e6760d, opId = 0, ver = 6, numPartitions = 1], 2 +- *(2) HashAggregate(keys=[window#13, word#4], functions=[merge_count(1)], output=[window#13, word#4, count#23L]) +- Exchange hashpartitioning(window#13, word#4, 1) +- *(1) HashAggregate(keys=[window#13, word#4], functions=[partial_count(1)], output=[window#13, word#4, count#23L]) +- *(1) Project [window#13, word#4] +- *(1) Filter (((isnotnull(timestamp#5) && isnotnull(window#13)) && (timestamp#5 >= window#13.start)) && (timestamp#5 < window#13.end)) +- *(1) Expand [List(named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) as double) = (cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) THEN (CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) END + 0) - 2) * 300000000) + 0), LongType, TimestampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) as double) = (cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) THEN (CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) END + 0) - 2) * 300000000) + 600000000), LongType, TimestampType)), word#4, timestamp#5-T600000ms), List(named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) as double) = (cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) THEN (CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) END + 1) - 2) * 300000000) + 0), LongType, TimestampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) as double) = (cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) THEN (CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#5, TimestampType, LongType) - 0) as double) / 3.0E8)) END + 1) - 2) * 300000000) + 600000000), LongType, TimestampType)), word#4, timestamp#5-T600000ms)], [window#13, word#4, timestamp#5-T600000ms] +- EventTimeWatermark timestamp#5: timestamp, interval 10 minutes +- LocalTableScan <empty>, [word#4, timestamp#5] ``` > After the fix ```scala == Physical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter1df12a96 +- *(4) HashAggregate(keys=[window#13-T600000ms, word#4], functions=[count(1)], output=[window#8-T600000ms, word#4, count#12L]) +- StateStoreSave [window#13-T600000ms, word#4], state info [ checkpoint = file:/C:/Users/chenhao/AppData/Local/Temp/temporary-95ac74cc-aca6-42eb-827d-7586aa69bcd3/state, runId = 91fa311d-d47e-4726-9d0a-f21ef268d9d0, opId = 0, ver = 4, numPartitions = 1], Update, 1579529975342, 2 +- *(3) HashAggregate(keys=[window#13-T600000ms, word#4], functions=[merge_count(1)], output=[window#13-T600000ms, word#4, count#23L]) +- StateStoreRestore [window#13-T600000ms, word#4], state info [ checkpoint = file:/C:/Users/chenhao/AppData/Local/Temp/temporary-95ac74cc-aca6-42eb-827d-7586aa69bcd3/state, runId = 91fa311d-d47e-4726-9d0a-f21ef268d9d0, opId = 0, ver = 4, numPartitions = 1], 2 +- *(2) HashAggregate(keys=[window#13-T600000ms, word#4], functions=[merge_count(1)], output=[window#13-T600000ms, word#4, count#23L]) +- Exchange hashpartitioning(window#13-T600000ms, word#4, 1) +- *(1) HashAggregate(keys=[window#13-T600000ms, word#4], functions=[partial_count(1)], output=[window#13-T600000ms, word#4, count#23L]) +- *(1) Project [window#13-T600000ms, word#4] +- *(1) Filter (((isnotnull(timestamp#5-T600000ms) && isnotnull(window#13-T600000ms)) && (timestamp#5-T600000ms >= window#13-T600000ms.start)) && (timestamp#5-T600000ms < window#13-T600000ms.end)) +- *(1) Expand [List(named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) as double) = (cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) THEN (CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) END + 0) - 2) * 300000000) + 0), LongType, TimestampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) as double) = (cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) THEN (CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) END + 0) - 2) * 300000000) + 600000000), LongType, TimestampType)), word#4, timestamp#5-T600000ms), List(named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) as double) = (cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) THEN (CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) END + 1) - 2) * 300000000) + 0), LongType, TimestampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) as double) = (cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) THEN (CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) + 1) ELSE CEIL((cast((precisetimestampconversion(timestamp#5-T600000ms, TimestampType, LongType) - 0) as double) / 3.0E8)) END + 1) - 2) * 300000000) + 600000000), LongType, TimestampType)), word#4, timestamp#5-T600000ms)], [window#13-T600000ms, word#4, timestamp#5-T600000ms] +- EventTimeWatermark timestamp#5: timestamp, interval 10 minutes +- LocalTableScan <empty>, [word#4, timestamp#5] ``` ### Why are the changes needed? If we write the code according to the documentation.It does not clean up old state.May cause OOM ### Does this PR introduce any user-facing change? No ### How was this patch tested? ```java SparkSession spark = SparkSession.builder().appName("test").master("local[*]") .config("spark.sql.shuffle.partitions", 1) .getOrCreate(); Dataset<Row> lines = spark.readStream().format("socket") .option("host", "skynet") .option("includeTimestamp", true) .option("port", 8888).load(); Dataset<Row> words = lines.toDF("word", "timestamp"); Dataset<Row> windowedCounts = words .withWatermark("timestamp", "10 minutes") .groupBy( window(col("timestamp"), "10 minutes", "5 minutes"), col("word")) .count(); StreamingQuery start = windowedCounts.writeStream() .outputMode("update") .format("console").start(); start.awaitTermination(); ``` We can write an example like this.And input some date 1. see the matrics `stateOnCurrentVersionSizeBytes` in log.Is it increasing all the time? 2. see the Physical Plan.Whether it contains things like `HashAggregate(keys=[window#11-T10000ms, value#39]` 3. We can debug in `storeManager.remove(store, keyRow)`.Whether it will remove the old state. Closes #27268 from bettermouse/spark-30553. Authored-by: bettermouse <qq5375631> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3c4e61918fc8266368bd33ea0612144de77571e6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 January 2020, 05:37:40 UTC
90dea83 Revert "[SPARK-30534][INFRA] Use mvn in `dev/scalastyle`" This reverts commit ac541568d69388a3fd6467e8792900840573a590. 22 January 2020, 00:45:05 UTC
ebaa6fe [SPARK-30572][BUILD] Add a fallback Maven repository ### What changes were proposed in this pull request? This PR aims to add a fallback Maven repository when a mirror to `central` fail. ### Why are the changes needed? We use `Google Maven Central` in GitHub Action as a mirror of `central`. However, `Google Maven Central` sometimes doesn't have newly published artifacts and there is no guarantee when we get the newly published artifacts. By duplicating `Maven Central` with a new ID, we can add a fallback Maven repository which is not mirrored by `Google Maven Central`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually testing with the new `Twitter` chill artifacts by switching `chill.version` from `0.9.3` to `0.9.5`. ``` $ rm -rf ~/.m2/repository/com/twitter/chill* $ mvn compile | grep chill Downloading from google-maven-central: https://maven-central.storage-download.googleapis.com/repos/central/data/com/twitter/chill_2.12/0.9.5/chill_2.12-0.9.5.pom Downloading from central_without_mirror: https://repo.maven.apache.org/maven2/com/twitter/chill_2.12/0.9.5/chill_2.12-0.9.5.pom Downloaded from central_without_mirror: https://repo.maven.apache.org/maven2/com/twitter/chill_2.12/0.9.5/chill_2.12-0.9.5.pom (2.8 kB at 11 kB/s) ``` Closes #27281 from dongjoon-hyun/SPARK-30572. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c992716a33d78ce0a9aa78b26f5bdb45c26c2a01) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 January 2020, 01:42:48 UTC
fbd95b1 [MINOR][HIVE] Pick up HIVE-22708 HTTP transport fix Pick up the HTTP fix from https://issues.apache.org/jira/browse/HIVE-22708 This is a small but important fix to digest handling we should pick up from Hive. No. Existing tests Closes #27273 from srowen/Hive22708. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 January 2020, 19:55:47 UTC
9da2bc4 [MINOR][DOCS] Remove note about -T for parallel build ### What changes were proposed in this pull request? Removes suggestion to use -T for parallel Maven build. ### Why are the changes needed? Parallel builds don't necessarily work in the build right now. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27274 from srowen/ParallelBuild. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ef1af43c9f82ad0ff2ff4e196b1017285bfa7da4) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 January 2020, 19:49:08 UTC
3ff5021 [SPARK-28152][DOCS][FOLLOWUP] Add a migration guide for MsSQLServer JDBC dialect This PR adds a migration guide for MsSQLServer JDBC dialect for Apache Spark 2.4.4 and 2.4.5. Apache Spark 2.4.4 updates the type mapping correctly according to MS SQL Server, but missed to mention that in the migration guide. In addition, 2.4.4 adds a configuration for the legacy behavior. Yes. This is a documentation change. ![screenshot](https://user-images.githubusercontent.com/9700541/72649944-d6517780-3933-11ea-92be-9d4bf38e2eda.png) Manually generate and see the doc. Closes #27270 from dongjoon-hyun/SPARK-28152-DOC. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 505693c282d94ebb0f763477309f0bba90b5acbc) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 January 2020, 01:22:44 UTC
4df509f [SPARK-30312][DOCS][FOLLOWUP] Add a migration guide This is a followup of https://github.com/apache/spark/pull/26956 to add a migration document for 2.4.5. New legacy configuration will restore the previous behavior safely. This PR updates the doc. <img width="763" alt="screenshot" src="https://user-images.githubusercontent.com/9700541/72639939-9da5a400-391b-11ea-87b1-14bca15db5a6.png"> Build the document and see the change manually. Closes #27269 from dongjoon-hyun/SPARK-30312. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fdbded3f71b54baee187392089705f1b619019cc) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 January 2020, 21:43:14 UTC
7d56dc7 [SPARK-30333][BUILD][FOLLOWUP][2.4] Update sbt build together ### What changes were proposed in this pull request? This PR aims to update `SparkBuild.scala` as a follow-up of [SPARK-30333 Upgrade jackson-databind to 2.6.7.3](https://github.com/apache/spark/pull/26986). ### Why are the changes needed? Since SPARK-29781, we override SBT Jackson dependency like Maven. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes #27256 from dongjoon-hyun/SPARK-30333. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 January 2020, 16:03:27 UTC
ac54156 [SPARK-30534][INFRA] Use mvn in `dev/scalastyle` This PR aims to use `mvn` instead of `sbt` in `dev/scalastyle` to recover GitHub Action. As of now, Apache Spark sbt build is broken by the Maven Central repository policy. https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an > Effective January 15, 2020, The Central Maven Repository no longer supports insecure > communication over plain HTTP and requires that all requests to the repository are > encrypted over HTTPS. We can reproduce this locally by the following. ``` $ rm -rf ~/.m2/repository/org/apache/apache/18/ $ build/sbt clean ``` And, in GitHub Action, `lint-scala` is the only one which is using `sbt`. No. First of all, GitHub Action should be recovered. Also, manually, do the following. **Without Scalastyle violation** ``` $ dev/scalastyle OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0 Using `mvn` from path: /usr/local/bin/mvn Scalastyle checks passed. ``` **With Scalastyle violation** ``` $ dev/scalastyle OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0 Using `mvn` from path: /usr/local/bin/mvn Scalastyle checks failed at following occurrences: error file=/Users/dongjoon/PRS/SPARK-HTTP-501/core/src/main/scala/org/apache/spark/SparkConf.scala message=There should be no empty line separating imports in the same group. line=22 column=0 error file=/Users/dongjoon/PRS/SPARK-HTTP-501/core/src/test/scala/org/apache/spark/resource/ResourceProfileSuite.scala message=There should be no empty line separating imports in the same group. line=22 column=0 ``` Closes #27242 from dongjoon-hyun/SPARK-30534. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 384899944b25cb0abf5e71f9cc2610fecad4e8f5) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 January 2020, 00:08:02 UTC
60a908e [SPARK-29708][SQL][2.4] Correct aggregated values when grouping sets are duplicated ### What changes were proposed in this pull request? This pr intends to fix wrong aggregated values in `GROUPING SETS` when there are duplicated grouping sets in a query (e.g., `GROUPING SETS ((k1),(k1))`). For example; ``` scala> spark.table("t").show() +---+---+---+ | k1| k2| v| +---+---+---+ | 0| 0| 3| +---+---+---+ scala> sql("""select grouping_id(), k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2))""").show() +-------------+---+----+------+ |grouping_id()| k1| k2|sum(v)| +-------------+---+----+------+ | 0| 0| 0| 9| <---- wrong aggregate value and the correct answer is `3` | 1| 0|null| 3| +-------------+---+----+------+ // PostgreSQL case postgres=# select k1, k2, sum(v) from t group by grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2)); k1 | k2 | sum ----+------+----- 0 | 0 | 3 0 | 0 | 3 0 | 0 | 3 0 | NULL | 3 (4 rows) // Hive case hive> select GROUPING__ID, k1, k2, sum(v) from t group by k1, k2 grouping sets ((k1),(k1,k2),(k2,k1),(k1,k2)); 1 0 NULL 3 0 0 0 3 ``` [MS SQL Server has the same behaviour with PostgreSQL](https://github.com/apache/spark/pull/26961#issuecomment-573638442). This pr follows the behaviour of PostgreSQL/SQL server; it adds one more virtual attribute in `Expand` for avoiding wrongly grouping rows with the same grouping ID. This is the #26961 backport for `branch-2.4` ### Why are the changes needed? To fix bugs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The existing tests. Closes #27229 from maropu/SPARK-29708-BRANCHC2.4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 January 2020, 20:23:07 UTC
d6261a1 [SPARK-29450][SS][2.4] Measure the number of output rows for streaming aggregation with append mode ### What changes were proposed in this pull request? This patch addresses missing metric, the number of output rows for streaming aggregation with append mode. Other modes are correctly measuring it. ### Why are the changes needed? Without the patch, the value for such metric is always 0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test added. Also manually tested with below query: > query ``` import spark.implicits._ spark.conf.set("spark.sql.shuffle.partitions", "5") val df = spark.readStream .format("rate") .option("rowsPerSecond", 1000) .load() .withWatermark("timestamp", "5 seconds") .selectExpr("timestamp", "mod(value, 100) as mod", "value") .groupBy(window($"timestamp", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) val query = df .writeStream .format("memory") .option("queryName", "test") .outputMode("append") .start() query.awaitTermination() ``` > before the patch ![screenshot-before-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023217-58d7bc80-0a01-11ea-8cac-40f1cced6d16.png) > after the patch ![screenshot-after-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023221-5c6b4380-0a01-11ea-8a66-7bf1b7d09fc7.png) Closes #27209 from HeartSaVioR/SPARK-29450-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 January 2020, 07:26:48 UTC
94b5d3f Revert "[SPARK-27868][CORE] Better default value and documentation for socket server backlog." This reverts commit 84bd8085c74b5292cdccaa287023e39703062b07. 16 January 2020, 04:27:18 UTC
47a73b2 [SPARK-30491][INFRA][2.4] Enable dependency audit files to tell dependency classifier ### What changes were proposed in this pull request? Enable dependency audit files to tell the value of artifact id, version, and classifier of a dependency. For example, `avro-mapred-1.8.2-hadoop2.jar` should be expanded to `avro-mapred/1.8.2/hadoop2/avro-mapred-1.8.2-hadoop2.jar` where `avro-mapred` is the artifact id, `1.8.2` is the version, and `haddop2` is the classifier. ### Why are the changes needed? Dependency audit files are expected to be consumed by automated tests or downstream tools. However, current dependency audit files under `dev/deps` only show jar names. And there isn't a simple rule on how to parse the jar name to get the values of different fields. For example, `hadoop2` is the classifier of `avro-mapred-1.8.2-hadoop2.jar`, in contrast, `incubating` is the version of `htrace-core-3.1.0-incubating.jar`. Reference: There is a good example of the downstream tool that would be enabled as yhuai suggested, > Say we have a Spark application that depends on a third-party dependency `foo`, which pulls in `jackson` as a transient dependency. Unfortunately, `foo` depends on a different version of `jackson` than Spark. So, in the pom of this Spark application, we use the dependency management section to pin the version of `jackson`. By doing this, we are lifting `jackson` to the top-level dependency of my application and I want to have a way to keep tracking what Spark uses. What we can do is to cross-check my Spark application's classpath with what Spark uses. Then, with a test written in my code base, whenever my application bumps Spark version, this test will check what we define in the application and what Spark has, and then remind us to change our application's pom if needed. In my case, I am fine to directly access git to get these audit files. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Code changes are verified by generated dependency audit files naturally. Thus, there are no tests added. Closes #27178 from mengCareers/depsOptimize2.4. Lead-authored-by: Xinrong Meng <meng.careers@gmail.com> Co-authored-by: mengCareers <meng.careers@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 January 2020, 04:24:41 UTC
830a4ec [SPARK-30312][SQL][FOLLOWUP] Rename conf by adding `.enabled` Based on the [comment](https://github.com/apache/spark/pull/26956#discussion_r366680558), this patch changes the SQL config name from `spark.sql.truncateTable.ignorePermissionAcl` to `spark.sql.truncateTable.ignorePermissionAcl.enabled`. Make this config consistent other SQL configs. No. Unit test. Closes #27210 from viirya/truncate-table-permission-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit be4d825872b41e04e190066e550217362b82061e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 January 2020, 04:13:21 UTC
036568d [SPARK-30325][CORE][2.4] markPartitionCompleted cause task status inconsistent ### **What changes were proposed in this pull request?** Fix task status inconsistent in `executorLost` which caused by `markPartitionCompleted` ### **Why are the changes needed?** The inconsistent will cause app hung up. The bugs occurs in the corer case as follows: 1. The stage occurs during stage retry, scheduler will resubmit a new stage with unfinished tasks. 2. Those unfinished tasks in origin stage finished and the same task on the new retry stage hasn't finished, it will mark the task partition on the current retry stage as succesuful in TSM `successful` array variable. 3. The executor crashed when it is running tasks which have succeeded by origin stage, it cause TSM run `executorLost` to rescheduler the task on the executor, and it will change the partition's running status in `copiesRunning` twice to -1. 4. 'dequeueTaskFromList' will use `copiesRunning` equal 0 as reschedule basis when rescheduler tasks, and now it is -1, can't to reschedule, and the app will hung forever. ### **Does this PR introduce any user-facing change?** No ### **How was this patch tested?** Closes #27211 from seayoun/branch-2.4. Authored-by: you@example.com <you@example.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 January 2020, 03:32:40 UTC
390917b [SPARK-30246][CORE] OneForOneStreamManager might leak memory in connectionTerminated Ensure that all StreamStates are removed from OneForOneStreamManager memory map even if there's an error trying to release buffers OneForOneStreamManager may not remove all StreamStates from memory map when a connection is terminated. A RuntimeException might be thrown in StreamState$buffers.next() by one of ExternalShuffleBlockResolver$getBlockData... **breaking the loop through streams.entrySet(), keeping StreamStates in memory forever leaking memory.** That may happen when an application is terminated abruptly and executors removed before the connection is terminated or if shuffleIndexCache fails to get ShuffleIndexInformation References: https://github.com/apache/spark/blob/ee050ddbc6eb6bc08c7751a0eb00e7a05b011b52/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java#L319 https://github.com/apache/spark/blob/ee050ddbc6eb6bc08c7751a0eb00e7a05b011b52/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java#L357 https://github.com/apache/spark/blob/ee050ddbc6eb6bc08c7751a0eb00e7a05b011b52/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L195 https://github.com/apache/spark/blob/ee050ddbc6eb6bc08c7751a0eb00e7a05b011b52/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L208 https://github.com/apache/spark/blob/ee050ddbc6eb6bc08c7751a0eb00e7a05b011b52/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L330 No Unit test added Closes #27064 from hensg/SPARK-30246. Lead-authored-by: Henrique Goulart <henriquedsg89@gmail.com> Co-authored-by: Henrique Goulart <henrique.goulart@trivago.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit d42cf4566a9d4438fd1cae88674f0d02f3dbf5c9) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 15 January 2020, 21:33:36 UTC
b149046 Preparing development version 2.4.6-SNAPSHOT 13 January 2020, 10:06:46 UTC
33bd2be Preparing Spark release v2.4.5-rc1 13 January 2020, 10:06:41 UTC
69de7f3 [SPARK-28152][SQL][FOLLOWUP] Add a legacy conf for old MsSqlServerDialect numeric mapping This is a follow-up for https://github.com/apache/spark/pull/25248 . The new behavior cannot access the existing table which is created by old behavior. This PR provides a way to avoid new behavior for the existing users. Yes. This will fix the broken behavior on the existing tables. Pass the Jenkins and manually run JDBC integration test. ``` build/mvn install -DskipTests build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test ``` Closes #27184 from dongjoon-hyun/SPARK-28152-CONF. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 28fc0437ce6d2f6fbcd83be38aafb8a491c1a67d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 13 January 2020, 07:11:49 UTC
d1b527f [SPARK-30458][WEBUI] Fix Wrong Executor Computing Time in Time Line of Stage Page ### What changes were proposed in this pull request? The Executor Computing Time in Time Line of Stage Page will be right ### Why are the changes needed? The Executor Computing Time in Time Line of Stage Page is Wrong. It includes the Scheduler Delay Time, while the Proportion excludes the Scheduler Delay <img width="1467" alt="Snipaste_2020-01-08_19-04-33" src="https://user-images.githubusercontent.com/3488126/71976714-f2795880-3251-11ea-869a-43ca6e0cf96a.png"> The right executor computing time is 1ms, but the number in UI is 3ms(include 2ms scheduler delay); the proportion is right. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual Closes #27135 from sddyljsx/SPARK-30458. Lead-authored-by: Neal Song <neal_song@126.com> Co-authored-by: neal_song <neal_song@126.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 65b603d597f683bd3180b5241b1b24663722d950) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 January 2020, 04:09:00 UTC
b249634 [SPARK-30478][CORE][DOCS] Fix Memory Package documentation ### What changes were proposed in this pull request? update the doc of momery package ### Why are the changes needed? From Spark 2.0, the storage memory also uses off heap memory. We update the doc here. ![memory manager](https://user-images.githubusercontent.com/3488126/72124682-9b35ce00-33a0-11ea-8cf9-301494974ef4.png) ### Does this PR introduce any user-facing change? No ### How was this patch tested? No Tests Needed Closes #27160 from sddyljsx/SPARK-30478. Lead-authored-by: Neal Song <neal_song@126.com> Co-authored-by: neal_song <neal_song@126.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 26ad8f8f34a6effb7dbe1e555fbd9340ed0d2b31) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 January 2020, 03:52:07 UTC
d84a96c [SPARK-30312][SQL][FOLLOWUP] Use inequality check instead to be robust ### What changes were proposed in this pull request? This is a followup to fix a brittle assert in a test case. ### Why are the changes needed? Original assert assumes that default permission is `rwxr-xr-x`, but in jenkins [env](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/6/testReport/junit/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/SPARK_30312__truncate_table___keep_acl_permission/) it could be `rwxrwxr-x`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27175 from viirya/hot-fix. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b04407169b8165fe634c9c2214c0f54e45642fa6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 January 2020, 21:19:22 UTC
9b53480 [SPARK-30312][SQL][2.4] Preserve path permission and acl when truncate table ### What changes were proposed in this pull request? This patch proposes to preserve existing permission/acls of paths when truncate table/partition. Note that this is backport of #26956 to branch-2.4. ### Why are the changes needed? When Spark SQL truncates table, it deletes the paths of table/partitions, then re-create new ones. If permission/acls were set on the paths, the existing permission/acls will be deleted. We should preserve the permission/acls if possible. ### Does this PR introduce any user-facing change? Yes. When truncate table/partition, Spark will keep permission/acls of paths. ### How was this patch tested? Unit test and manual test as shown in #26956. Closes #27173 from viirya/truncate-table-permission-2.4. Authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 January 2020, 18:14:16 UTC
3b029d9 [SPARK-30489][BUILD] Make build delete pyspark.zip file properly ### What changes were proposed in this pull request? A small fix to the Maven build file under the `assembly` module by switch "dir" attribute to "file". ### Why are the changes needed? To make the `<delete>` task properly delete an existing zip file. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran a build with the change and confirmed that a corrupted zip file was replaced with the correct one. Closes #27171 from jeff303/SPARK-30489. Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 582509b7ae76bc298c31a68bcfd7011c1b9e23a7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 January 2020, 01:00:08 UTC
0a5757e [SPARK-30447][SQL][2.4] Constant propagation nullability issue ## What changes were proposed in this pull request? This PR fixes `ConstantPropagation` rule as the current implementation produce incorrect results in some cases. E.g. ``` SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1) ``` returns those rows where `c` is null due to `1 + 1 = 1` propagation but it shouldn't. ## Why are the changes needed? To fix a bug. ## Does this PR introduce any user-facing change? Yes, fixes a bug. ## How was this patch tested? New UTs. Closes #27167 from peter-toth/SPARK-30447-constant-propagation-nullability-issue-v2.4. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 January 2020, 18:57:54 UTC
6ac3659 [SPARK-30410][SQL][2.4] Calculating size of table with large number of partitions causes flooding logs ### What changes were proposed in this pull request? Backported from [pr#27079](https://github.com/apache/spark/pull/27079). For a partitioned table, if the number of partitions are very large, e.g. tens of thousands or even larger, calculating its total size causes flooding logs. The flooding happens in: 1. `calculateLocationSize` prints the starting and ending for calculating the location size, and it is called per partition; 2. `bulkListLeafFiles` prints all partition paths. This pr is to simplify the logging when calculating the size of a partitioned table. ### How was this patch tested? not related Closes #27143 from wzhfy/improve_log-2.4. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 09 January 2020, 06:01:05 UTC
e52ae4e [SPARK-30450][INFRA][FOLLOWUP][2.4] Fix git folder regex for windows file separator ### What changes were proposed in this pull request? The regex is to exclude the .git folder for the python linter, but bash escaping caused only one forward slash to be included. This adds the necessary second slash. ### Why are the changes needed? This is necessary to properly match the file separator character. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually. Added File dev/something.git.py and ran `dev/lint-python` ```dev/lint-python pycodestyle checks failed. *** Error compiling './dev/something.git.py'... File "./dev/something.git.py", line 1 mport asdf2 ^ SyntaxError: invalid syntax``` Closes #27139 from ericfchang/SPARK-30450. Authored-by: Eric Chang <eric.chang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 January 2020, 00:38:19 UTC
d1fa3ed [SPARK-30450][2.4][INFRA] Exclude .git folder for python linter ### What changes were proposed in this pull request? This excludes the .git folder when the python linter runs. We want to exclude because there may be files in .git from other branches that could cause the linter to fail. ### Why are the changes needed? I ran into a case where there was a branch name that ended ".py" suffix so there were git refs files in .git folder in .git/logs/refs and .git/refs/remotes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual. ``` $ git branch 3.py $ git checkout 3.py Switched to branch '3.py' $ dev/lint-python starting python compilation test... Python compilation failed with the following errors: *** Error compiling './.git/logs/refs/heads/3.py'... File "./.git/logs/refs/heads/3.py", line 1 0000000000000000000000000000000000000000 895e572b73ca2796cbc3c468bb2c21abed5b22f1 Dongjoon Hyun <dhyunapple.com> 1578438255 -0800 branch: Created from master ^ SyntaxError: invalid syntax *** Error compiling './.git/refs/heads/3.py'... File "./.git/refs/heads/3.py", line 1 895e572b73ca2796cbc3c468bb2c21abed5b22f1 ^ SyntaxError: invalid syntax ``` Closes #27121 from ericfchang/SPARK-30450. Authored-by: Eric Chang <eric.chang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 January 2020, 02:56:38 UTC
f935e4d [SPARK-30225][CORE] Correct read() behavior past EOF in NioBufferedFileInputStream This bug manifested itself when another stream would potentially make a call to NioBufferedFileInputStream.read() after it had reached EOF in the wrapped stream. In that case, the refill() code would clear the output buffer the first time EOF was found, leaving it in a readable state for subsequent read() calls. If any of those calls were made, bad data would be returned. By flipping the buffer before returning, even in the EOF case, you get the correct behavior in subsequent calls. I picked that approach to avoid keeping more state in this class, although it means calling the underlying stream even after EOF (which is fine, but perhaps a little more expensive). This showed up (at least) when using encryption, because the commons-crypto StreamInput class does not track EOF internally, leaving it for the wrapped stream to behave correctly. Tested with added unit test + slightly modified test case attached to SPARK-18105. Closes #27084 from vanzin/SPARK-30225. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 January 2020, 09:40:27 UTC
16f8fae [SPARK-30285][CORE] Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError There is a deadlock between `LiveListenerBus#stop` and `AsyncEventQueue#removeListenerOnError`. We can reproduce as follows: 1. Post some events to `LiveListenerBus` 2. Call `LiveListenerBus#stop` and hold the synchronized lock of `bus`(https://github.com/apache/spark/blob/5e92301723464d0876b5a7eec59c15fed0c5b98c/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L229), waiting until all the events are processed by listeners, then remove all the queues 3. Event queue would drain out events by posting to its listeners. If a listener is interrupted, it will call `AsyncEventQueue#removeListenerOnError`, inside it will call `bus.removeListener`(https://github.com/apache/spark/blob/7b1b60c7583faca70aeab2659f06d4e491efa5c0/core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala#L207), trying to acquire synchronized lock of bus, resulting in deadlock This PR removes the `synchronized` from `LiveListenerBus.stop` because underlying data structures themselves are thread-safe. To fix deadlock. No. New UT. Closes #26924 from wangshuo128/event-queue-race-condition. Authored-by: Wang Shuo <wangshuo128@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 10cae04108c375a7f5ca7685fea593bd7f49f7a6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 03 January 2020, 00:44:26 UTC
f989a35 [SPARK-26560][SQL][2.4] Spark should be able to run Hive UDF using jar regardless of current thread context classloader ### What changes were proposed in this pull request? This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior. (This patch contains the commit from #23921 to retain credit.) Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader. This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function. This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads. This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader. ### Why are the changes needed? Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Closes #27075 from HeartSaVioR/SPARK-26560-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 January 2020, 13:28:52 UTC
03cea11 [SPARK-30339][SQL][2.4] Avoid to fail twice in function lookup ### What changes were proposed in this pull request? Backported from [pr#26994](https://github.com/apache/spark/pull/26994). Currently if function lookup fails, spark will give it a second chance by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and cause extra cost like unnecessary metastore access. We should throw exceptions directly in these cases. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Covered by existing tests. Closes #27054 from wzhfy/avoid_udf_fail_twice-2.4. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Zhenhua Wang <wzh_zju@163.com> 31 December 2019, 16:44:42 UTC
db32408 [SPARK-30333][CORE][BUILD] Upgrade jackson-databind to 2.6.7.3 ### What changes were proposed in this pull request? Upgrade jackson-databind to 2.6.7.3 to following CVE CVE-2018-14718 - CVE-2018-14721 https://github.com/FasterXML/jackson-databind/issues/2097 CVE-2018-19360, CVE-2018-19361, CVE-2018-19362 https://github.com/FasterXML/jackson-databind/issues/2186 tag: https://github.com/FasterXML/jackson-databind/commits/jackson-databind-2.6.7.3 ### Why are the changes needed? CVE-2018-14718,CVE-2018-14719,CVE-2018-14720,CVE-2018-14721,CVE-2018-19360,CVE-2018-19361,CVE-2018-19362 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT Closes #26986 from sandeep-katta/jacksonUpgrade. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 24 December 2019, 04:53:07 UTC
bfa851c [SPARK-30269][SQL][2.4] Should use old partition stats to decide whether to update stats when analyzing partition ### What changes were proposed in this pull request? This is a backport for [pr#26908](https://github.com/apache/spark/pull/26908). It's an obvious bug: currently when analyzing partition stats, we use old table stats to compare with newly computed stats to decide whether it should update stats or not. ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? add new tests Closes #26963 from wzhfy/failto_update_part_stats_2.4. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 23 December 2019, 01:24:14 UTC
d10e414 [SPARK-30318][CORE] Upgrade jetty to 9.3.27.v20190418 ### What changes were proposed in this pull request? Upgrade jetty to 9.3.27.v20190418 to fix below CVE https://nvd.nist.gov/vuln/detail/CVE-2019-10247 https://nvd.nist.gov/vuln/detail/CVE-2019-10241 tag: https://github.com/eclipse/jetty.project/releases/tag/jetty-9.3.27.v20190418 ### Why are the changes needed? To fix CVE-2019-10247 and CVE-2019-10241 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing Test Closes #26967 from sandeep-katta/jettyUpgrade. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> 21 December 2019, 16:55:46 UTC
871aacc [SPARK-29918][SQL][FOLLOWUP][TEST] Fix arrayOffset in `RecordBinaryComparatorSuite` ### What changes were proposed in this pull request? As mentioned in https://github.com/apache/spark/pull/26548#pullrequestreview-334345333, some test cases in `RecordBinaryComparatorSuite` use a fixed arrayOffset when writing to long arrays, this could lead to weird stuff including crashing with a SIGSEGV. This PR fix the problem by computing the arrayOffset based on `Platform.LONG_ARRAY_OFFSET`. ### How was this patch tested? Tested locally. Previously, when we try to add `System.gc()` between write into long array and compare by RecordBinaryComparator, there is a chance to hit JVM crash with SIGSEGV like: ``` # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007efc66970bcb, pid=11831, tid=0x00007efc0f9f9700 # # JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10) # Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x5fbbcb] # # Core dump written. Default location: /home/jenkins/workspace/sql/core/core or core.11831 # # An error report file with more information is saved as: # /home/jenkins/workspace/sql/core/hs_err_pid11831.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # ``` After the fix those test cases didn't crash the JVM anymore. Closes #26939 from jiangxb1987/rbc. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 December 2019, 15:34:06 UTC
07caebf [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table When querying a partitioned table with format `org.apache.hive.hcatalog.data.JsonSerDe` and more than one task runs in each executor concurrently, the following exception is encountered: `java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord` The exception occurs in `HadoopTableReader.fillObject`. `org.apache.hive.hcatalog.data.JsonSerDe#initialize` populates a `cachedObjectInspector` field by calling `HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector`, which is not thread-safe; this `cachedObjectInspector` is returned by `JsonSerDe#getObjectInspector`. We protect against this Hive bug by synchronizing on an object when we need to call `initialize` on `org.apache.hadoop.hive.serde2.Deserializer` instances (which may be `JsonSerDe` instances). By doing so, the `ObjectInspector` for the `Deserializer` of the partitions of the JSON table and that of the table `SerDe` are the same cached `ObjectInspector` and `HadoopTableReader.fillObject` then works correctly. (If the `ObjectInspector`s are different, then a bug in `HCatRecordObjectInspector` causes an `ArrayList` to be created instead of an `HCatRecord`, resulting in the `ClassCastException` that is seen.) To avoid HIVE-15773 / HIVE-21752. No. Tested manually on a cluster with a partitioned JSON table and running a query using more than one core per executor. Before this change, the ClassCastException happens consistently. With this change it does not happen. Closes #26895 from wypoon/SPARK-17398. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit c72f88b0ba20727e831ba9755d9628d0347ee3cb) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 20 December 2019, 18:56:10 UTC
bbeb191 [SPARK-30236][SQL][DOCS][2.4] Clarify date and time patterns supported in the related functions ### What changes were proposed in this pull request? Link to appropriate Java Class with list of date/time patterns supported ### Why are the changes needed? Avoid confusion on the end-user's side of things, as seen in questions like [this](https://stackoverflow.com/questions/54496878/date-format-conversion-is-adding-1-year-to-the-border-dates) on StackOverflow ### Does this PR introduce any user-facing change? Yes, Docs are updated. ### How was this patch tested? Built docs: ![image](https://user-images.githubusercontent.com/2394761/70722498-4b0b1380-1cef-11ea-81ee-66bd6b1906c2.png) Closes #26867 from johnhany97/SPARK-30236-2.4. Authored-by: John Ayad <johnhany97@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 19 December 2019, 20:39:10 UTC
185ab4b [SPARK-30274][CORE] Avoid BytesToBytesMap lookup hang forever when holding keys reaching max capacity ### What changes were proposed in this pull request? We should not append keys to BytesToBytesMap to be its max capacity. ### Why are the changes needed? BytesToBytesMap.append allows to append keys until the number of keys reaches MAX_CAPACITY. But once the the pointer array in the map holds MAX_CAPACITY keys, next time call of lookup will hang forever. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually test by: ```java Test public void testCapacity() { TestMemoryManager memoryManager2 = new TestMemoryManager( new SparkConf() .set(package$.MODULE$.MEMORY_OFFHEAP_ENABLED(), true) .set(package$.MODULE$.MEMORY_OFFHEAP_SIZE(), 25600 * 1024 * 1024L) .set(package$.MODULE$.SHUFFLE_SPILL_COMPRESS(), false) .set(package$.MODULE$.SHUFFLE_COMPRESS(), false)); TaskMemoryManager taskMemoryManager2 = new TaskMemoryManager(memoryManager2, 0); final long pageSizeBytes = 8000000 + 8; // 8 bytes for end-of-page marker final BytesToBytesMap map = new BytesToBytesMap(taskMemoryManager2, 1024, pageSizeBytes); try { for (long i = 0; i < BytesToBytesMap.MAX_CAPACITY + 1; i++) { final long[] value = new long[]{i}; boolean succeed = map.lookup(value, Platform.LONG_ARRAY_OFFSET, 8).append( value, Platform.LONG_ARRAY_OFFSET, 8, value, Platform.LONG_ARRAY_OFFSET, 8); } map.free(); } finally { map.free(); } } ``` Once the map was appended to 536870912 keys (MAX_CAPACITY), the next lookup will hang. Closes #26914 from viirya/fix-bytemap2. Authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b2baaa2fccceaa69f69a76f534cfbc50e6471cbe) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 December 2019, 19:37:35 UTC
6d90298 [SPARK-25392][CORE][WEBUI] Prevent error page when accessing pools page from history server ### What changes were proposed in this pull request? ### Why are the changes needed? Currently from history server, we will not able to access the pool info, as we aren't writing pool information to the event log other than pool name. Already spark is hiding pool table when accessing from history server. But from the pool column in the stage table will redirect to the pools table, and that will throw error when accessing the pools page. To prevent error page, we need to hide the pool column also in the stage table ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual test Before change: ![Screenshot 2019-11-21 at 6 49 40 AM](https://user-images.githubusercontent.com/23054875/69293868-219b2280-0c30-11ea-9b9a-17140d024d3a.png) ![Screenshot 2019-11-21 at 6 48 51 AM](https://user-images.githubusercontent.com/23054875/69293834-147e3380-0c30-11ea-9dec-d5f67665486d.png) After change: ![Screenshot 2019-11-21 at 7 29 01 AM](https://user-images.githubusercontent.com/23054875/69293991-9cfcd400-0c30-11ea-98a0-7a6268a4e5ab.png) Closes #26616 from shahidki31/poolHistory. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit dd217e10fc0408831c2c658fc3f52d2917f1a6a2) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 16 December 2019, 23:02:55 UTC
cd6a0c4 [MINOR][DOCS] Fix documentation for slide function ### What changes were proposed in this pull request? This PR proposes to fix documentation for slide function. Fixed the spacing issue and added some parameter related info. ### Why are the changes needed? Documentation improvement ### Does this PR introduce any user-facing change? No (doc-only change). ### How was this patch tested? Manually tested by documentation build. Closes #26896 from bboutkov/pyspark_doc_fix. Authored-by: Boris Boutkov <boris.boutkov@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 3bf5498b4a58ebf39662ee717d3538af8b838e2c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 16 December 2019, 07:29:30 UTC
0bf0e44 Revert "[SPARK-29152][2.4][CORE] Executor Plugin shutdown when dynamic allocation is enabled" This reverts commit 74d8cf133743ff4a28e39f2ef4c63677e0fb7c08. 15 December 2019, 23:47:02 UTC
f0d6989 [SPARK-30263][CORE] Don't log potentially sensitive value of non-Spark properties ignored in spark-submit ### What changes were proposed in this pull request? The value of non-Spark config properties ignored in spark-submit is no longer logged. ### Why are the changes needed? The value isn't really needed in the logs, and could contain potentially sensitive info. While we can redact the values selectively too, I figured it's more robust to just not log them at all here, as the values aren't important in this log statement. ### Does this PR introduce any user-facing change? Other than the change to logging above, no. ### How was this patch tested? Existing tests Closes #26893 from srowen/SPARK-30263. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 46e950bea883b98cd3beb7bd637bffe522656435) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 14 December 2019, 21:14:09 UTC
c411269 [SPARK-30238][SQL][2.4] hive partition pruning can only support string and integral types backport https://github.com/apache/spark/pull/26871 ----- Check the partition column data type and only allow string and integral types in hive partition pruning. Currently we only support string and integral types in hive partition pruning, but the check is done for literals. If the predicate is `InSet`, then there is no literal and we may pass an unsupported partition predicate to Hive and cause problems. Closes #26876 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 13 December 2019, 07:47:22 UTC
f4e95ca [MINOR][SS][DOC] Fix the ss-kafka doc for availability of 'minPartitions' option ### What changes were proposed in this pull request? This patch fixes the availability of `minPartitions` option for Kafka source, as it is only supported by micro-batch for now. There's a WIP PR for batch (#25436) as well but there's no progress on the PR so far, so safer to fix the doc first, and let it be added later when we address it with batch case as well. ### Why are the changes needed? The doc is wrong and misleading. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Just a doc change. Closes #26849 from HeartSaVioR/MINOR-FIX-minPartition-availability-doc. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e39bb4c9fdeba05ee16c363f2183421fa49578c2) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 December 2019, 00:30:21 UTC
b0d5eb0 [SPARK-30198][CORE] BytesToBytesMap does not grow internal long array as expected ### What changes were proposed in this pull request? This patch changes the condition to check if BytesToBytesMap should grow up its internal array. Specifically, it changes to compare by the capacity of the array, instead of its size. ### Why are the changes needed? One Spark job on our cluster hangs forever at BytesToBytesMap.safeLookup. After inspecting, the long array size is 536870912. Currently in BytesToBytesMap.append, we only grow the internal array if the size of the array is less than its MAX_CAPACITY that is 536870912. So in above case, the array can not be grown up, and safeLookup can not find an empty slot forever. But it is wrong because we use two array entries per key, so the array size is twice the capacity. We should compare the current capacity of the array, instead of its size. ### Does this PR introduce any user-facing change? No ### How was this patch tested? This issue only happens when loading big number of values into BytesToBytesMap, so it is hard to do unit test. This is tested manually with internal Spark job. Closes #26828 from viirya/fix-bytemap. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b4aeaf906fe1ece886a730ae7291384e297a3bfb) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 December 2019, 22:58:41 UTC
74d8cf1 [SPARK-29152][2.4][CORE] Executor Plugin shutdown when dynamic allocation is enabled ### What changes were proposed in this pull request? Added `shutdownHook` for shutdown method of executor plugin. This will ensure that shutdown method will be called always. ### Why are the changes needed? Whenever executors are not going down gracefully, i.e getting killed due to idle time or getting killed forcefully, shutdown method of executors plugin is not getting called. Shutdown method can be used to release any resources that plugin has acquired during its initialisation. So its important to make sure that every time a executor goes down shutdown method of plugin gets called. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested Manually Closes #26841 from iRakson/SPARK-29152_2.4. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 December 2019, 02:52:07 UTC
0884766 [SPARK-30163][INFRA][FOLLOWUP] Make `.m2` directory for cold start without cache ### What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/26793 and aims to initialize `~/.m2` directory. ### Why are the changes needed? In case of cache reset, `~/.m2` directory doesn't exist. It causes a failure. - `master` branch has a cache as of now. So, we missed this. - `branch-2.4` has no cache as of now, and we hit this failure. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This PR is tested against personal `branch-2.4`. - https://github.com/dongjoon-hyun/spark/pull/12 Closes #26794 from dongjoon-hyun/SPARK-30163-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 16f1b23d75c0b44aac61111bfb2ae9bb0f3fab68) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 December 2019, 20:58:34 UTC
89d0f7c [SPARK-30163][INFRA] Use Google Maven mirror in GitHub Action This PR aims to use [Google Maven mirror](https://cloudplatform.googleblog.com/2015/11/faster-builds-for-Java-developers-with-Maven-Central-mirror.html) in `GitHub Action` jobs to improve the stability. ```xml <settings> <mirrors> <mirror> <id>google-maven-central</id> <name>GCS Maven Central mirror</name> <url>https://maven-central.storage-download.googleapis.com/repos/central/data/</url> <mirrorOf>central</mirrorOf> </mirror> </mirrors> </settings> ``` Although we added Maven cache inside `GitHub Action`, the timeouts happen too frequently during access `artifact descriptor`. ``` [ERROR] Failed to execute goal on project spark-mllib_2.12: ... Failed to read artifact descriptor for ... ... Connection timed out (Read failed) -> [Help 1] ``` No. This PR is irrelevant to Jenkins. This is tested on the personal repository first. `GitHub Action` of this PR should pass. - https://github.com/dongjoon-hyun/spark/pull/11 Closes #26793 from dongjoon-hyun/SPARK-30163. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1068b8b24910eec8122bf7fa4748a101becf0d2b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 December 2019, 20:06:09 UTC
6dff114 [SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large ### What changes were proposed in this pull request? This patch adds normalization to word vectors when fitting dataset in Word2Vec. ### Why are the changes needed? Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors. ### Does this PR introduce any user-facing change? Yes. After this patch, Word2Vec won't produce infinity word vectors. ### How was this patch tested? Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload. ```scala case class Sentences(name: String, words: Array[String]) val dataset = spark.read .option("header", "true").option("sep", "\t") .option("quote", "").option("nullValue", "\\N") .csv("/tmp/title.akas.tsv") .filter("region = 'US' or language = 'en'") .select("title") .as[String] .map(s => Sentences(s, s.split(' '))) .persist() println("Training model...") val word2Vec = new Word2Vec() .setInputCol("words") .setOutputCol("vector") .setVectorSize(64) .setWindowSize(4) .setNumPartitions(50) .setMinCount(5) .setMaxIter(30) val model = word2Vec.fit(dataset) model.getVectors.show() ``` Before: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-Infinity,-Infin...| | Talent|[-Infinity,Infini...| | Hourglass|[2.02805806500023...| |Nickelodeon's|[-4.2918617120906...| | Priests|[-1.3570403355926...| | Religion:|[-6.7049072282803...| | Bu|[5.05591774315586...| | Totoro:|[-1.0539840178632...| | Trouble,|[-3.5363592836003...| | Hatter|[4.90413981352826...| | '79|[7.50436471285412...| | Vile|[-2.9147142985312...| | 9/11|[-Infinity,Infini...| | Santino|[1.30005911270850...| | Motives|[-1.2538958306253...| | '13|[-4.5040152427657...| | Fierce|[Infinity,Infinit...| | Stover|[-2.6326895394029...| | 'It|[1.66574533864436...| | Butts|[Infinity,Infinit...| +-------------+--------------------+ only showing top 20 rows ``` After: ``` Training model... +-------------+--------------------+ | word| vector| +-------------+--------------------+ | Unspoken|[-0.0454501919448...| | Talent|[-0.2657704949378...| | Hourglass|[-0.1399687677621...| |Nickelodeon's|[-0.1767119318246...| | Priests|[-0.0047509293071...| | Religion:|[-0.0411605164408...| | Bu|[0.11837736517190...| | Totoro:|[0.05258282646536...| | Trouble,|[0.09482011198997...| | Hatter|[0.06040831282734...| | '79|[0.04783720895648...| | Vile|[-0.0017210749210...| | 9/11|[-0.0713915303349...| | Santino|[-0.0412711687386...| | Motives|[-0.0492418706417...| | '13|[-0.0073119504377...| | Fierce|[-0.0565455369651...| | Stover|[0.06938160210847...| | 'It|[0.01117012929171...| | Butts|[0.05374567210674...| +-------------+--------------------+ only showing top 20 rows ``` Closes #26722 from viirya/SPARK-24666-2. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com> (cherry picked from commit 755d8894485396b0a21304568c8ec5a55030f2fd) Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com> 06 December 2019, 00:33:19 UTC
3b64e2f [SPARK-30129][CORE][2.4] Set client's id in TransportClient after successful auth The new auth code was missing this bit, so it was not possible to know which app a client belonged to when auth was on. I also refactored the SASL test that checks for this so it also checks the new protocol (test failed before the fix, passes now). Closes #26764 from vanzin/SPARK-30129-2.4. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 December 2019, 17:02:10 UTC
663441f [SPARK-30082][SQL][2.4] Do not replace Zeros when replacing NaNs ### What changes were proposed in this pull request? Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` ### How was this patch tested? Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes #26749 from johnhany97/SPARK-30082-2.4. Authored-by: John Ayad <johnhany97@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 December 2019, 05:25:50 UTC
76576b6 [SPARK-30050][SQL] analyze table and rename table should not erase hive table bucketing info ### What changes were proposed in this pull request? This patch adds Hive provider into table metadata in `HiveExternalCatalog.alterTableStats`. When we call `HiveClient.alterTable`, `alterTable` will erase if it can not find hive provider in given table metadata. Rename table also has this issue. ### Why are the changes needed? Because running `ANALYZE TABLE` on a Hive table, if the table has bucketing info, will erase existing bucket info. ### Does this PR introduce any user-facing change? Yes. After this PR, running `ANALYZE TABLE` on Hive table, won't erase existing bucketing info. ### How was this patch tested? Unit test. Closes #26685 from viirya/fix-hive-bucket. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 85cb388ae3f25b0e6a7fc1a2d78fd1c3ec03f341) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 05:42:48 UTC
cc86b72 Revert "Revert "[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect"" This reverts commit 00b61e36958118e98c6dbfa0515c11c8672a62ac. 29 November 2019, 02:15:46 UTC
00b61e3 Revert "[SPARK-28152][SQL][2.4] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect" This reverts commit 98ba2f643bc9bba667203e4af6d72a8d2e4fcf5d. 29 November 2019, 01:44:25 UTC
0dc22df [SPARK-30030][INFRA] Use RegexChecker instead of TokenChecker to check `org.apache.commons.lang.` This PR replace `TokenChecker` with `RegexChecker` in `scalastyle` and fixes the missed instances. This will remove the old `comons-lang2` dependency from `core` module **BEFORE** ``` $ dev/scalastyle Scalastyle checks failed at following occurrences: [error] /Users/dongjoon/PRS/SPARK-SerializationUtils/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:7: Use Commons Lang 3 classes (package org.apache.commons.lang3.*) instead [error] of Commons Lang 2 (package org.apache.commons.lang.*) [error] Total time: 23 s, completed Nov 25, 2019 11:47:44 AM ``` **AFTER** ``` $ dev/scalastyle Scalastyle checks passed. ``` No. Pass the GitHub Action linter. Closes #26666 from dongjoon-hyun/SPARK-29081-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 38240a74dc047796e9f239e44d9bc0bbc66e1f7f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2019, 20:24:12 UTC
94ddc2a [SPARK-29971][CORE][2.4] Fix buffer leaks in `TransportFrameDecoder/TransportCipher` ### What changes were proposed in this pull request? - Correctly release `ByteBuf` in `TransportCipher` in all cases - Move closing / releasing logic to `handlerRemoved(...)` so we are guaranteed that is always called. - Correctly release `frameBuf` it is not null when the handler is removed (and so also when the channel becomes inactive) ### Why are the changes needed? We need to carefully manage the ownership / lifecycle of `ByteBuf` instances so we don't leak any of these. We did not correctly do this in all cases: - when end up in invalid cipher state. - when partial data was received and the channel is closed before the full frame is decoded Fixes https://github.com/netty/netty/issues/9784. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the newly added UTs. Closes #26660 from normanmaurer/leak_2_4. Authored-by: Norman Maurer <norman_maurer@apple.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 November 2019, 18:11:48 UTC
6880ccd [MINOR][INFRA] Use GitHub Action Cache for `build` ### What changes were proposed in this pull request? This PR adds `GitHub Action Cache` task on `build` directory. ### Why are the changes needed? This will replace the Maven downloading with the cache. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually check the GitHub Action log of this PR. Closes #26652 from dongjoon-hyun/SPARK-MAVEN-CACHE. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit cb68e58f88e8481e76b358f46fd4356d656e8277) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 20:36:13 UTC
eb87e53 [SPARK-29970][WEBUI] Preserver open/close state of Timelineview ### What changes were proposed in this pull request? Fix a bug related to Timelineview that does not preserve open/close state properly. ### Why are the changes needed? To preserve open/close state is originally intended but it doesn't work. ### Does this PR introduce any user-facing change? Yes. open/close state for Timeineview is to be preserved so if you open Timelineview in Stage page and go to another page, and then go back to Stage page, Timelineview should keep open. ### How was this patch tested? Manual test. Closes #26607 from sarutak/fix-timeline-view-state. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6cd6d5f57ed53aed234b169ad5be3e133dab608f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 18:17:12 UTC
12ec338 [SPARK-27558][CORE] Gracefully cleanup task when it fails with OOM exception ### What changes were proposed in this pull request? When a task fails with OOM exception, the UnsafeInMemorySorter.array could be null. In the meanwhile, the cleanupResources() on task completion would call UnsafeInMemorySorter.getMemoryUsage in turn, and that lead to another NPE thrown. ### Why are the changes needed? Check if array is null in UnsafeInMemorySorter.getMemoryUsage and it should help to avoid NPE. ### Does this PR introduce any user-facing change? No ### How was this patch tested? It was tested manually. Closes #26606 from ayudovin/fix-npe-in-listener-2.4. Authored-by: yudovin <artsiom.yudovin@profitero.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 21 November 2019, 00:18:29 UTC
a936522 [SPARK-29758][SQL][2.4] Fix truncation of requested string fields in `json_tuple` ### What changes were proposed in this pull request? In the PR, I propose to remove an optimization in `json_tuple` which causes truncation of results for large requested string fields. ### Why are the changes needed? Spark 2.4 uses Jackson Core 2.6.7 which has a bug in copying string. This bug may lead to truncation of results in some cases. The bug has been already fixed by the commit https://github.com/FasterXML/jackson-core/commit/554f8db0f940b2a53f974852a2af194739d65200 which is a part of Jackson Core since the version 2.7.7. Upgrading Jackson Core up to 2.7.7 or later version is risky. That's why this PR propose to avoid using the buggy methods of Jackson Core 2.6.7. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By new test added to `JsonFunctionsSuite` Closes #26563 from MaxGekk/fix-truncation-by-json_tuple-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 November 2019, 07:32:28 UTC
1a26c8e [SPARK-29964][BUILD] lintr github workflows failed due to buggy GnuPG ### What changes were proposed in this pull request? Linter (R) github workflows failed sometimes like: https://github.com/apache/spark/pull/26509/checks?check_run_id=310718016 Failed message: ``` Executing: /tmp/apt-key-gpghome.8r74rQNEjj/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 gpg: connecting dirmngr at '/tmp/apt-key-gpghome.8r74rQNEjj/S.dirmngr' failed: IPC connect call failed gpg: keyserver receive failed: No dirmngr ##[error]Process completed with exit code 2. ``` It is due to a buggy GnuPG. Context: https://github.com/sbt/website/pull/825 https://github.com/sbt/sbt/issues/4261 https://github.com/microsoft/WSL/issues/3286 ### Why are the changes needed? Make lint-r github workflows work. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass github workflows. Closes #26602 from viirya/SPARK-29964. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e753aa30e659706c3fa3414bf38566a79e0af8d6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 19 November 2019, 23:57:12 UTC
47cb1f3 [SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources ### What changes were proposed in this pull request? In the PR, I propose to use the `format()` method of `FastDateFormat` which accepts an instance of the `Calendar` type. This allows to adjust the `MILLISECOND` field of the calendar directly before formatting. I added new method `format()` to `DateTimeUtils.TimestampParser`. This method splits the input timestamp to a part truncated to seconds and the seconds fractional part. The calendar is initialized by the first part in normal way, and the last one is converted to a form appropriate for correctly formatting by `FastDateFormat` as the second fraction up to microsecond precision. I refactored `MicrosCalendar` by passing the number of digits from the fraction pattern as a parameter to the default constructor because it is used by the existing `getMicros()` and new one `setMicros()`. `setMicros()` is used to set the seconds fraction to calendar's `MILLISECOND` field directly before formatting. This PR supports various patterns for seconds fractions from `S` up to `SSSSSS`. If the patterns has more than 6 `S`, the first 6 digits reflect to milliseconds and microseconds of the input timestamp but the rest digits are set to `0`. ### Why are the changes needed? This fixes a bug of incorrectly formatting timestamps in microsecond precision. For example: ```scala Seq(java.sql.Timestamp.valueOf("2019-11-18 11:56:00.123456")).toDF("t") .select(to_json(struct($"t"), Map("timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSSSS")).as("json")) .show(false) +----------------------------------+ |json | +----------------------------------+ |{"t":"2019-11-18 11:56:00.000123"}| +----------------------------------+ ``` ### Does this PR introduce any user-facing change? Yes. The example above outputs: ```scala +----------------------------------+ |json | +----------------------------------+ |{"t":"2019-11-18 11:56:00.123456"}| +----------------------------------+ ``` ### How was this patch tested? - By new tests for formatting by different patterns from `S` to `SSSSSS` in `DateTimeUtilsSuite` - A test for `to_json()` in `JsonFunctionsSuite` - A roundtrp test for writing and reading back a timestamp in a CSV file. Closes #26582 from MaxGekk/micros-format-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 November 2019, 09:10:16 UTC
dc2abe5 [SPARK-29918][SQL] RecordBinaryComparator should check endianness when compared by long ### What changes were proposed in this pull request? This PR try to make sure the comparison results of `compared by 8 bytes at a time` and `compared by bytes wise` in RecordBinaryComparator is *consistent*, by reverse long bytes if it is little-endian and using Long.compareUnsigned. ### Why are the changes needed? If the architecture supports unaligned or the offset is 8 bytes aligned, `RecordBinaryComparator` compare 8 bytes at a time by reading 8 bytes as a long. Related code is ``` if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) { while (i <= leftLen - 8) { final long v1 = Platform.getLong(leftObj, leftOff + i); final long v2 = Platform.getLong(rightObj, rightOff + i); if (v1 != v2) { return v1 > v2 ? 1 : -1; } i += 8; } } ``` Otherwise, it will compare bytes by bytes.  Related code is ``` while (i < leftLen) { final int v1 = Platform.getByte(leftObj, leftOff + i) & 0xff; final int v2 = Platform.getByte(rightObj, rightOff + i) & 0xff; if (v1 != v2) { return v1 > v2 ? 1 : -1; } i += 1; } ``` However, on little-endian machine,  the result of *compared by a long value* and *compared bytes by bytes* maybe different. For two same records, its offsets may vary in the first run and second run, which will lead to compare them using long comparison or byte-by-byte comparison, the result maybe different. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new test cases in RecordBinaryComparatorSuite Closes #26548 from WangGuangxin/binary_comparator. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ffc97530371433bc0221e06d8c1d11af8d92bd94) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 November 2019, 08:12:30 UTC
568fa69 Revert "[SPARK-29644][SQL][2.4] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils" This reverts commit eda6360354b3d32453672ccdb368053516c8c285. 18 November 2019, 17:13:16 UTC
7f2c88d [MINOR][TESTS] Ignore GitHub Action and AppVeyor file changes in testing ### What changes were proposed in this pull request? This PR aims to ignore `GitHub Action` and `AppVeyor` file changes. When we touch these files, Jenkins job should not trigger a full testing. ### Why are the changes needed? Currently, these files are categorized to `root` and trigger the full testing and ends up wasting the Jenkins resources. - https://github.com/apache/spark/pull/26555 ``` [info] Using build tool sbt with Hadoop profile hadoop2.7 under environment amplab_jenkins From https://github.com/apache/spark * [new branch] master -> master [info] Found the following changed modules: sparkr, root [info] Setup the following environment variables for tests: ``` ### Does this PR introduce any user-facing change? No. (Jenkins testing only). ### How was this patch tested? Manually. ``` $ dev/run-tests.py -h -v ... Trying: [x.name for x in determine_modules_for_files([".github/workflows/master.yml", "appveyor.xml"])] Expecting: [] ... ``` Closes #26556 from dongjoon-hyun/SPARK-IGNORE-APPVEYOR. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d0470d639412ecbe6e126f8d8abf5a5819b9e278) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 November 2019, 07:46:01 UTC
b4e7e50 [SPARK-29936][R][2.4] Fix SparkR lint errors and add lint-r GitHub Action ### What changes were proposed in this pull request? This PR fixes SparkR lint errors and adds `lint-r` GitHub Action to protect the branch. ### Why are the changes needed? It turns out that we currently don't run it. It's recovered yesterday. However, after that, our Jenkins linter jobs (`master`/`branch-2.4`) has been broken on `lint-r` tasks. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the GitHub Action on this PR in addition to Jenkins R. Closes #26568 from dongjoon-hyun/SPARK-29936-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 18 November 2019, 07:05:07 UTC
ee6693e [SPARK-29932][R][TESTS] lint-r should do non-zero exit in case of errors ### What changes were proposed in this pull request? This PR aims to make `lint-r` exits with non-zero in case of errors. Please note that `lint-r` works correctly when everything are installed correctly. ### Why are the changes needed? There are two cases which hide errors from Jenkins/AppVeyor/GitHubAction. 1. `lint-r` exits with zero if there is no R installation. ```bash $ dev/lint-r dev/lint-r: line 25: type: Rscript: not found ERROR: You should install R $ echo $? 0 ``` 2. `lint-r` exits with zero if we didn't do `R/install-dev.sh`. ```bash $ dev/lint-r Error: You should install SparkR in a local directory with `R/install-dev.sh`. In addition: Warning message: In library(SparkR, lib.loc = LOCAL_LIB_LOC, logical.return = TRUE) : no library trees found in 'lib.loc' Execution halted lintr checks passed. // <=== Please note here $ echo $? 0 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually check the above two cases. Closes #26561 from dongjoon-hyun/SPARK-29932. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e1fc38b3e409e8a2c65d0cc1fc2ec63da527bbc6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 November 2019, 18:10:04 UTC
eda6360 [SPARK-29644][SQL][2.4] Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils This is a port SPARK-29644 to 2.4 ### What changes were proposed in this pull request? Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType as setShort() and setByte(). Changes in JDBCUtils.scala Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType. Problems - In master in JDBCUtils.scala line number 547 and 551 have a problem where ShortType and ByteType are set as Integers rather than set as Short and Byte respectively. ``` case ShortType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getShort(pos)) The issue was pointed out by maropu case ByteType => (stmt: PreparedStatement, row: Row, pos: Int) => stmt.setInt(pos + 1, row.getByte(pos)) ``` - Also at line JDBCUtils.scala 247 TinyInt is interpreted wrongly as IntergetType in getCatalystType() ``` case java.sql.Types.TINYINT => IntegerType ``` - At line 172 ShortType was wrongly interpreted as IntegerType ``` case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT)) ``` - All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes. ### Why are the changes needed? Given type should be set using the right type. ### Does this PR introduce any user-facing change? Yes. - User will now be able to create tables where dataframe contains ByteType when using JDBC connector in overwrite mode. - Users will see a SQL side table are created with the right data type. ShortType in spark will translate to smallint and ByteType to TinyInt on the SQL side. This will resulting in small size of db tables where applicable. ### How was this patch tested? Corrected Unit test cases where applicable. Validated in CI/CD Added/fixed test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows. ``` ./build/mvn install -DskipTests ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 ``` Closes #26549 from shivsood/port_29644_2.4. Authored-by: shivsood <shivsood@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 17 November 2019, 13:06:50 UTC
9c7e8be [SPARK-29904][SQL][2.4] Parse timestamps in microsecond precision by JSON/CSV datasources ### What changes were proposed in this pull request? In the PR, I propose parsing timestamp strings up to microsecond precision. To achieve that, I added a sub-class of `GregorianCalendar` to get access to `protected` field `fields` which contains non-normalized parsed fields immediately after parsing. In particular, I assume that the `MILLISECOND` field contains entire seconds fraction converted to `int`. By knowing the expected digits in the fractional part, the parsed field is converted to a fraction up to the microsecond precision. This PR supports additional patterns for seconds fractions from `S` to `SSSSSS` in JSON/CSV options. ### Why are the changes needed? To improve user experience with JSON and CSV datasources, and to allow parsing timestamp strings up to microsecond precision. ### Does this PR introduce any user-facing change? No, the PR extends the set of supported timestamp patterns for the seconds fraction by `S`, `SS`, `SSSS`, `SSSSS`, and `SSSSSS`. ### How was this patch tested? By existing test suites `JsonExpressionSuite`, `JsonFunctionsSuite`, `JsonSuite`, `CsvSuite`, `UnivocityParserSuite`, and added new tests to `DateTimeUtilsSuite`, `JsonFunctionsSuite` for `from_json()` and to `CSVSuite`. Closes #26507 from MaxGekk/fastdateformat-micros. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 November 2019, 23:40:43 UTC
1cff055 [SPARK-26499][SQL][2.4] JdbcUtils.makeGetter does not handle ByteType ## What changes were proposed in this pull request? This is a port of SPARK-26499 to 2.4 Modifed JdbcUtils.makeGetter to handle ByteType. ## How was this patch tested? Added a new test to JDBCSuite that maps ```TINYINT``` to ```ByteType```. Closes #23400 from twdsilva/tiny_int_support. Authored-by: Thomas D'Silva <tdsilvaapache.org> Signed-off-by: Hyukjin Kwon <gurwls223apache.org> ### Why are the changes needed? Changes are required to port pr #26301 (SPARK-29644) to 2.4 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Yes, tested on 2.4 with using docker integration test for MySQL, MsSQLServer, Postgres ``sh /build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MySQLIntegrationSuite ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.MsSqlServerIntegrationSuite ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.PostgresIntegrationSuite `` Closes #26531 from shivsood/pr_26499_2.4_port. Lead-authored-by: shivsood <shivsood@microsoft.com> Co-authored-by: Thomas D'Silva <tdsilva@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 November 2019, 03:12:30 UTC
7bdc76f [SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly ### What changes were proposed in this pull request? This PR addresses issues where conflicting attributes in `Expand` are not correctly handled. ### Why are the changes needed? ```Scala val numsDF = Seq(1, 2, 3, 4, 5, 6).toDF("nums") val cubeDF = numsDF.cube("nums").agg(max(lit(0)).as("agcol")) cubeDF.join(cubeDF, "nums").show ``` fails with the following exception: ``` org.apache.spark.sql.AnalysisException: Failure when resolving conflicting references in Join: 'Join Inner :- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#35] : +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36] : +- Project [nums#3, nums#3 AS nums#37] : +- Project [value#1 AS nums#3] : +- LocalRelation [value#1] +- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#58] +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36] ^^^^^^^ +- Project [nums#3, nums#3 AS nums#37] +- Project [value#1 AS nums#3] +- LocalRelation [value#1] Conflicting attributes: nums#38 ``` As you can see from the above plan, `num#38`, the output of `Expand` on the right side of `Join`, should have been handled to produce new attribute. Since the conflict is not resolved in `Expand`, the failure is happening upstream at `Aggregate`. This PR addresses handling conflicting attributes in `Expand`. ### Does this PR introduce any user-facing change? Yes, the previous example now shows the following output: ``` +----+-----+-----+ |nums|agcol|agcol| +----+-----+-----+ | 1| 0| 0| | 6| 0| 0| | 4| 0| 0| | 2| 0| 0| | 5| 0| 0| | 3| 0| 0| +----+-----+-----+ ``` ### How was this patch tested? Added new unit test. Closes #26441 from imback82/spark-29682. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e46e487b0831b39afa12ef9cff9b9133f111921b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 November 2019, 06:47:50 UTC
e9df8b6 [SPARK-29875][PYTHON][SQL][2.4] Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x ### What changes were proposed in this pull request? This PR proposes to avoid to use deprecated PyArrow API `open_stream`. It was deprecated as of 0.12.0 (https://github.com/apache/arrow/pull/3244). Root cause is that we use separate forked process and each process emits the warning (printing once via "default" filter at `warnings` package); however, we should avoid to use deprecated APIs anyway. In the current master, it was fixed when we upgrade PyArrow from 0.10.0 to 0.12.0 at https://github.com/apache/spark/commit/16990f929921b3f784a85f3afbe1a22fbe77d895 ### Why are the changes needed? if we use PyArrow higher then 0.12.0, Spark 2.4.x shows a bunch of annoying warnings as below: ``` from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP def add_one(x): return x + 1 spark.range(100).select(add_one("id")).collect() ``` ``` UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " ``` ### Does this PR introduce any user-facing change? Remove annoying warning messages. ### How was this patch tested? Manually tested. Closes #26501 from HyukjinKwon/SPARK-29875. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 13 November 2019, 17:50:37 UTC
7459353 [SPARK-29850][SQL] sort-merge-join an empty table should not memory leak When whole stage codegen `HashAggregateExec`, create the hash map when we begin to process inputs. Sort-merge join completes directly if the left side table is empty. If there is an aggregate in the right side, the aggregate will not be triggered at all, but its hash map is created during codegen and can't be released. No a new test Closes #26471 from cloud-fan/memory. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 414cade01112bc86a9e66a5928399dc78495b6e4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 November 2019, 17:06:13 UTC
2cc56e0 [SPARK-28939][SQL][2.4] Propagate SQLConf for plans executed by toRdd ### What changes were proposed in this pull request? The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it. In this way, SQL configs are effective also in these cases, while earlier they were ignored. ### Why are the changes needed? Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be: ``` withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") { val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select * from spark64kb limit 10") // Subexpression elimination is used here, despite it should have been disabled data.describe() } ``` ### Why are the changes needed? Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be: ``` withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") { val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*) df.createOrReplaceTempView("spark64kb") val data = spark.sql("select * from spark64kb limit 10") // Subexpression elimination is used here, despite it should have been disabled data.describe() } ``` ### Does this PR introduce any user-facing change? When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy. ### How was this patch tested? added UT Closes #25734 from mgaido91/SPARK-28939_2.4. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 November 2019, 22:00:09 UTC
a7b5746 [SPARK-29820][INFRA][FOLLOWUP][2.4] Use scala version instead of java version ### What changes were proposed in this pull request? This is a followup to fix the cache key to use `scala` version instead of `java` version in `branch-2.4`. ### Why are the changes needed? In `branch-2.4`, we have different build combinations. **Before** ``` Cache not found for input keys: -hadoop-2.6-maven-com-73f091c178d11734dd9a888d0c5d208d72850eefd9db3821d806518289c61371, -hadoop-2.6-maven-com-. ``` **After** ``` Cache not found for input keys: 2.11-hadoop-2.6-maven-com-73f091c178d11734dd9a888d0c5d208d72850eefd9db3821d806518289c61371, 2.11-hadoop-2.6-maven-com-. ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually check the log of GitHub Action of this PR. Closes #26460 from dongjoon-hyun/SPARK-29820-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 November 2019, 19:31:59 UTC
f42e741 [SPARK-29820][INFRA] Use GitHub Action Cache for `./.m2/repository/[com|org]` This PR aims to enable [GitHub Action Cache on Maven local repository](https://github.com/actions/cache/blob/master/examples.md#java---maven) for the following goals. 1. To reduce the chance of failure due to the Maven download flakiness. 2. To speed up the build a little bit. Unfortunately, due to the GitHub Action Cache limitation, it seems that we cannot put all into a single cache. It's ignored like the following. - **.m2/repository is 680777194 bytes** ``` /bin/tar -cz -f /home/runner/work/_temp/01f162c3-0c78-4772-b3de-b619bb5d7721/cache.tgz -C /home/runner/.m2/repository . 3 ``` Not only for the speed up, but also for reducing the Maven download flakiness, we had better enable caching on local maven repository. The followings are the failure examples in these days. - https://github.com/apache/spark/runs/295869450 ``` [ERROR] Failed to execute goal on project spark-streaming-kafka-0-10_2.12: Could not resolve dependencies for project org.apache.spark:spark-streaming-kafka-0-10_2.12:jar:spark-367716: Could not transfer artifact com.fasterxml.jackson.datatype:jackson-datatype-jdk8:jar:2.10.0 from/to central (https://repo.maven.apache.org/maven2): Failed to transfer file https://repo.maven.apache.org/maven2/com/fasterxml/jackson/datatype/ jackson-datatype-jdk8/2.10.0/jackson-datatype-jdk8-2.10.0.jar with status code 503 -> [Help 1] ... [ERROR] mvn <args> -rf :spark-streaming-kafka-0-10_2.12 ``` ``` [ERROR] Failed to execute goal on project spark-tools_2.12: Could not resolve dependencies for project org.apache.spark:spark-tools_2.12:jar:3.0.0-SNAPSHOT: Failed to collect dependencies at org.clapper:classutil_2.12:jar:1.5.1: Failed to read artifact descriptor for org.clapper:classutil_2.12:jar:1.5.1: Could not transfer artifact org.clapper:classutil_2.12:pom:1.5.1 from/to central (https://repo.maven.apache.org/maven2): Connection timed out (Read failed) -> [Help 1] ``` No. Manually check the GitHub Action log of this PR. ``` Cache restored from key: 1.8-hadoop-2.7-maven-com-5b4a9fb13c5f5ff78e65a20003a3810796e4d1fde5f24d397dfe6e5153960ce4 Cache restored from key: 1.8-hadoop-2.7-maven-org-5b4a9fb13c5f5ff78e65a20003a3810796e4d1fde5f24d397dfe6e5153960ce4 ``` Closes #26456 from dongjoon-hyun/SPARK-29820. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 4b71ad6ffb049219ccb04b31aa8d75595831c662) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 November 2019, 19:06:09 UTC
bef7c0f [SPARK-29790][DOC] Note required port for Kube API It adds a note about the required port of a master url in Kubernetes. Currently a port needs to be specified for the Kubernetes API. Also in case the API is hosted on the HTTPS port. Else the driver might fail with https://medium.com/kidane.weldemariam_75349/thanks-james-on-issuing-spark-submit-i-run-into-this-error-cc507d4f8f0d Yes, a change to the "Running on Kubernetes" guide. None - Documentation change Closes #26426 from Tapped/patch-1. Authored-by: Emil Sandstø <emilalexer@hotmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 0bdadba5e3810f8e3f5da13e2a598071cbadab94) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 08 November 2019, 17:33:32 UTC
e59370a [SPARK-29781][BUILD][2.4] Override SBT Jackson dependency like Maven ### What changes were proposed in this pull request? This PR aims to override SBT `jackson-databind` dependency like Maven in `branch-2.4`. ### Why are the changes needed? Without this, SBT and Maven works differently in `branch-2.4`. We had better fix this in `branch-2.4` because it's our LTS branch. ``` $ build/sbt -Phadoop-3.1 "core/testOnly *.ShuffleDependencySuite" [info] ShuffleDependencySuite: [info] org.apache.spark.shuffle.ShuffleDependencySuite *** ABORTED *** ... [info] Cause: com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.7.8 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually do `build/sbt -Phadoop-3.1 "core/testOnly *.ShuffleDependencySuite"`. Closes #26417 from dongjoon-hyun/SPARK-29781. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 November 2019, 21:24:59 UTC
465a596 [SPARK-29796][SQL][TESTS] `HiveExternalCatalogVersionsSuite` should ignore preview release ### What changes were proposed in this pull request? This aims to exclude the `preview` release to recover `HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks `branch-2.4` PRBuilder since yesterday. New release (especially `preview`) should not affect `branch-2.4`. - https://github.com/apache/spark/pull/26417 (Failed 4 times) ### Why are the changes needed? **BEFORE** ```scala scala> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/").mkString.split("\n").filter(_.contains("""<li><a href="spark-""")).map("""<a href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1)) java.util.NoSuchElementException: None.get ``` **AFTER** ```scala scala> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/").mkString.split("\n").filter(_.contains("""<li><a href="spark-""")).filterNot(_.contains("preview")).map("""<a href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1)) res5: Array[String] = Array(2.3.4, 2.4.4) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This should pass the PRBuilder. Closes #26428 from dongjoon-hyun/SPARK-HiveExternalCatalogVersionsSuite. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit da848b1897a52b79bde8111c09e92c0c88b2f914) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 November 2019, 18:28:48 UTC
0e558be [MINOR][INFRA] Change the Github Actions build command to `mvn install` This PR change the Github Actions build command from `mvn package` to `mvn install` to build Scaladoc jars. Sometimes `mvn install` build failure with error: `not found: type ClassName...`. More details: https://github.com/apache/spark/pull/24628#issuecomment-495655747 No. N/A Closes #26414 from wangyum/github-action-install. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e5c176a243b76b3953cc03b28e6c281658da93c8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 November 2019, 17:17:54 UTC
19e55c3 [SPARK-29743][SQL] sample should set needCopyResult to true if its child is `SampleExec` has a bug that it sets `needCopyResult` to false as long as the `withReplacement` parameter is false. This causes problems if its child needs to copy the result, e.g. a join. to fix a correctness issue Yes, the result will be corrected. a new test Closes #26387 from cloud-fan/sample-bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 326b7893401b6bd57cf11f657386c0f9da00902a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 November 2019, 18:57:38 UTC
8180364 [MINOR][DOCS][2.4] Fix pyspark documentation Back ports commit 1e1b7302f482a3b81e1fcd7060b4849a488376bf from master ### What changes were proposed in this pull request? I propose that we change the example code documentation to call the proper function . For example, under the `foreachBatch` function, the example code was calling the `foreach()` function by mistake. ### Why are the changes needed? I suppose it could confuse some people, and it is a typo ### Does this PR introduce any user-facing change? No, there is no "meaningful" code being change, simply the documentation ### How was this patch tested? I made the changes on a fork, and had pushed to master earlier Closes #26363 from mstill3/patch-2. Authored-by: Matt Stillwell <18670089+mstill3@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 November 2019, 23:42:53 UTC
4d2ed47 Revert "[SPARK-24152][R][TESTS] Disable check-cran from run-tests.sh" ### What changes were proposed in this pull request? This reverts commit 91d990162f13acde546d01e1163ed3e898cbf9a7. ### Why are the changes needed? CRAN check is pretty important for R package, we should enable it. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests. Closes #26381 from viirya/revert-SPARK-24152. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e7263242bd9b9b2207147af4ab3ae4ec2ff3c718) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 November 2019, 23:15:56 UTC
back to top