https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
de351e3 Preparing Spark release v3.1.2-rc1 24 May 2021, 04:10:33 UTC
f153ffd [SPARK-35495][R] Change SparkR maintainer for CRAN ### What changes were proposed in this pull request? As discussed, update SparkR maintainer for future release. ### Why are the changes needed? Shivaram will not be able to work with this in the future, so we would like to migrate off the maintainer contact email. shivaram Closes #32642 from felixcheung/sparkr-maintainer. Authored-by: Felix Cheung <felixcheung@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1530876615a64b009fff261da1ba064d03778fca) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 May 2021, 02:09:19 UTC
f37d077 [SPARK-35171][R] Declare the markdown package as a dependency of the SparkR package ### What changes were proposed in this pull request? Declare the markdown package as a dependency of the SparkR package ### Why are the changes needed? If we didn't install pandoc locally, running make-distribution.sh will fail with the following message: ``` — re-building ‘sparkr-vignettes.Rmd’ using rmarkdown Warning in engine$weave(file, quiet = quiet, encoding = enc) : Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1. Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics: The 'markdown' package should be declared as a dependency of the 'SparkR' package (e.g., in the 'Suggests' field of DESCRIPTION), because the latter contains vignette(s) built with the 'markdown' package. Please see https://github.com/yihui/knitr/issues/1864 for more information. — failed re-building ‘sparkr-vignettes.Rmd’ ``` ### Does this PR introduce _any_ user-facing change? Yes. Workaround for R packaging. ### How was this patch tested? Manually test. After the fix, the command `sh dev/make-distribution.sh -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn` in the environment without pandoc will pass. Closes #32270 from xuanyuanking/SPARK-35171. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8e9e70045b02532a5bfc900581154cbc84f5ad47) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 May 2021, 19:59:35 UTC
5649e37 [SPARK-35493][K8S] make `spark.blockManager.port` fallback for `spark.driver.blockManager.port` as same as other cluster managers ### What changes were proposed in this pull request? `spark.blockManager.port` does not work for k8s driver pods now, we should make it work as other cluster managers. ### Why are the changes needed? `spark.blockManager.port` should be able to work for spark driver pod ### Does this PR introduce _any_ user-facing change? yes, `spark.blockManager.port` will be respect iff it is present && `spark.driver.blockManager.port` is absent ### How was this patch tested? new tests Closes #32639 from yaooqinn/SPARK-35493. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 96b0548ab6d5fe36833812f7b6424c984f75c6dd) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 May 2021, 15:08:08 UTC
def4e52 [SPARK-35226][SQL][FOLLOWUP] Fix test added in SPARK-35226 for DB2KrbIntegrationSuite ### What changes were proposed in this pull request? This PR fixes an test added in SPARK-35226 (#32344). ### Why are the changes needed? `SELECT 1` seems non-valid query for DB2. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? DB2KrbIntegrationSuite passes on my laptop. I also confirmed all the KrbIntegrationSuites pass with the following command. ``` build/sbt -Phive -Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.*KrbIntegrationSuite" ``` Closes #32632 from sarutak/followup-SPARK-35226. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1a43415d8de1d6462d64a6b6f8daa505a3a81970) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 May 2021, 05:31:57 UTC
01c7705 [SPARK-35463][BUILD][FOLLOWUP] Redirect output for skipping checksum check ### What changes were proposed in this pull request? This patch is a followup of SPARK-35463. In SPARK-35463, we output a message to stdout and now we redirect it to stderr. ### Why are the changes needed? All `echo` statements in `build/mvn` should redirect to stderr if it is not followed by `exit`. It is because we use `build/mvn` to get stdout output by other scripts. If we don't redirect it, we can get invalid output, e.g. got "Skipping checksum because shasum is not installed." as `commons-cli` version. ### Does this PR introduce _any_ user-facing change? No. Dev only. ### How was this patch tested? Manually test on internal system. Closes #32637 from viirya/fix-build. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 594ffd2db224c89f5375645a7a249d4befca6163) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 May 2021, 02:13:45 UTC
d1607d6 Revert "[SPARK-35480][SQL] Make percentile_approx work with pivot" This reverts commit 612e3b59e81c52f36754ee962e91dc2df98fb941. 23 May 2021, 01:33:24 UTC
612e3b5 [SPARK-35480][SQL] Make percentile_approx work with pivot ### What changes were proposed in this pull request? This PR proposes to avoid wrapping if-else to the constant literals for `percentage` and `accuracy` in `percentile_approx`. They are expected to be literals (or foldable expressions). Pivot works by two phrase aggregations, and it works with manipulating the input to `null` for non-matched values (pivot column and value). Note that pivot supports an optimized version without such logic with changing input to `null` for some types (non-nested types basically). So the issue fixed by this PR is only for complex types. ```scala val df = Seq( ("a", -1.0), ("a", 5.5), ("a", 2.5), ("b", 3.0), ("b", 5.2)).toDF("type", "value") .groupBy().pivot("type", Seq("a", "b")).agg( percentile_approx(col("value"), array(lit(0.5)), lit(10000))) df.show() ``` **Before:** ``` org.apache.spark.sql.AnalysisException: cannot resolve 'percentile_approx((IF((type <=> CAST('a' AS STRING)), value, CAST(NULL AS DOUBLE))), (IF((type <=> CAST('a' AS STRING)), array(0.5D), NULL)), (IF((type <=> CAST('a' AS STRING)), 10000, CAST(NULL AS INT))))' due to data type mismatch: The accuracy or percentage provided must be a constant literal; 'Aggregate [percentile_approx(if ((type#7 <=> cast(a as string))) value#8 else cast(null as double), if ((type#7 <=> cast(a as string))) array(0.5) else cast(null as array<double>), if ((type#7 <=> cast(a as string))) 10000 else cast(null as int), 0, 0) AS a#16, percentile_approx(if ((type#7 <=> cast(b as string))) value#8 else cast(null as double), if ((type#7 <=> cast(b as string))) array(0.5) else cast(null as array<double>), if ((type#7 <=> cast(b as string))) 10000 else cast(null as int), 0, 0) AS b#18] +- Project [_1#2 AS type#7, _2#3 AS value#8] +- LocalRelation [_1#2, _2#3] ``` **After:** ``` +-----+-----+ | a| b| +-----+-----+ |[2.5]|[3.0]| +-----+-----+ ``` ### Why are the changes needed? To make percentile_approx work with pivot as expected ### Does this PR introduce _any_ user-facing change? Yes. It threw an exception but now it returns a correct result as shown above. ### How was this patch tested? Manually tested and unit test was added. Closes #32619 from HyukjinKwon/SPARK-35480. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1d9f09decb280beb313f3ff0384364a40c47c4b4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 May 2021, 22:35:54 UTC
56bc241 Revert "[SPARK-35454][SQL] One LogicalPlan can match multiple dataset ids" This reverts commit 4c1477005ae852cf692d0bac116b7a3e098afe51. 21 May 2021, 15:46:14 UTC
ed77f14 [SPARK-35482][K8S] Use `spark.blockManager.port` not the wrong `spark.blockmanager.port` in BasicExecutorFeatureStep ### What changes were proposed in this pull request? most spark conf keys are case sensitive, including `spark.blockManager.port`, we can not get the correct port number with `spark.blockmanager.port`. This PR changes the wrong key to `spark.blockManager.port` in `BasicExecutorFeatureStep`. This PR also ensures a fast fail when the port value is invalid for executor containers. When 0 is specified(it is valid as random port, but invalid as a k8s request), it should not be put in the `containerPort` field of executor pod desc. We do not expect executor pods to continuously fail to create because of invalid requests. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #32621 from yaooqinn/SPARK-35482. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d957426351149dd1b4e1106d1230f395934f61d2) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 May 2021, 15:28:02 UTC
4c14770 [SPARK-35454][SQL] One LogicalPlan can match multiple dataset ids ### What changes were proposed in this pull request? Change the type of `DATASET_ID_TAG` from `Long` to `HashSet[Long]` to allow the logical plan to match multiple datasets. ### Why are the changes needed? During the transformation from one Dataset to another Dataset, the DATASET_ID_TAG of logical plan won't change if the plan itself doesn't change: https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L234-L237 However, dataset id always changes even if the logical plan doesn't change: https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L207-L208 And this can lead to the mismatch between dataset's id and col's __dataset_id. E.g., ```scala test("SPARK-28344: fail ambiguous self join - Dataset.colRegex as column ref") { // The test can fail if we change it to: // val df1 = spark.range(3).toDF() // val df2 = df1.filter($"id" > 0).toDF() val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { assertAmbiguousSelfJoin(df1.join(df2, df1.colRegex("id") > df2.colRegex("id"))) } } ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. Closes #32616 from Ngone51/fix-ambiguous-join. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 34284c06496cd621792c0f9dfc90435da0ab9eb5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 May 2021, 08:15:34 UTC
72cdebd [SPARK-35463][BUILD] Skip checking checksum on a system without `shasum` ### What changes were proposed in this pull request? Not every build system has `shasum`. This PR aims to skip checksum checks on a system without `shasum`. ### Why are the changes needed? **PREPARE** ``` $ docker run -it --rm -v $PWD:/spark openjdk:11-slim /bin/bash roota0e001a6e50f:/# cd /spark/ roota0e001a6e50f:/spark# apt-get update roota0e001a6e50f:/spark# apt-get install curl roota0e001a6e50f:/spark# build/mvn clean ``` **BEFORE (Failure due to `command not found`)** ``` roota0e001a6e50f:/spark# build/mvn clean exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512 Veryfing checksum from /spark/build/apache-maven-3.6.3-bin.tar.gz.sha512 build/mvn: line 81: shasum: command not found Bad checksum from https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512 ``` **AFTER** ``` roota0e001a6e50f:/spark# build/mvn clean exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz Skipping checksum because shasum is not installed. exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512 Skipping checksum because shasum is not installed. Using `mvn` from path: /spark/build/apache-maven-3.6.3/bin/mvn ``` ### Does this PR introduce _any_ user-facing change? Yes, this will recover the build. ### How was this patch tested? Manually with the above process. Closes #32613 from dongjoon-hyun/SPARK-35463. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8e13b8c3d233b910a6bebbb89fb58fcc6b299e9f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 May 2021, 21:37:06 UTC
c3aba52 [SPARK-35458][BUILD] Use ` > /dev/null` to replace `-q` in shasum ## What changes were proposed in this pull request? Use ` > /dev/null` to replace `-q` in shasum validation. ### Why are the changes needed? In PR https://github.com/apache/spark/pull/32505 , added the shasum check on maven. The `shasum -a 512 -q -c xxx.sha` is used to validate checksum, the `-q` args is for "don't print OK for each successfully verified file", but `-q` arg is introduce in shasum 6.x version. So we got the `Unknown option: q`. ``` ➜ ~ uname -a Darwin MacBook.local 19.6.0 Darwin Kernel Version 19.6.0: Mon Apr 12 20:57:45 PDT 2021; root:xnu-6153.141.28.1~1/RELEASE_X86_64 x86_64 ➜ ~ shasum -v 5.84 ➜ ~ shasum -q Unknown option: q Type shasum -h for help ``` it makes ARM CI failed: [1] https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `shasum -a 512 -c wrong.sha > /dev/null` return code 1 without print `shasum -a 512 -c right.sha > /dev/null` return code 0 without print e2e test: ``` rm -f build/apache-maven-3.6.3-bin.tar.gz rm -r build/apache-maven-3.6.3-bin mvn -v ``` Closes #32604 from Yikun/patch-5. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 38fbc0b4f77dbdd1ed9a69ca7ab170c7ccee04bc) Signed-off-by: Sean Owen <srowen@gmail.com> 20 May 2021, 20:59:22 UTC
aac4274 [SPARK-35373][BUILD][FOLLOWUP] Fix "binary operator expected" error on build/mvn ### What changes were proposed in this pull request? change $(command -v curl) to "$(command -v curl)" ### Why are the changes needed? We need change $(command -v curl) to "$(command -v curl)" to make sure it work when `curl` or `wget` is uninstall. othewise raised: `build/mvn: line 56: [: /root/spark/build/apache-maven-3.6.3-bin.tar.gz: binary operator expected` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ``` apt remove curl rm -f build/apache-maven-3.6.3-bin.tar.gz rm -r build/apache-maven-3.6.3-bin mvn -v ``` Closes #32608 from Yikun/patch-6. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 3c3533d845bd121a9e094ac6c17a0fb3684e269c) Signed-off-by: Sean Owen <srowen@gmail.com> 20 May 2021, 17:25:31 UTC
50edf12 [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation ### What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/32505 to fix `zinc` installation. ### Why are the changes needed? Currently, branch-3.1/3.0 is broken. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action. Closes #32591 from dongjoon-hyun/SPARK-35373. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 19 May 2021, 16:51:58 UTC
5d4de77 [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn ### What changes were proposed in this pull request? `./build/mvn` now downloads the .sha512 checksum of Maven artifacts it downloads, and checks the checksum after download. ### Why are the changes needed? This ensures the integrity of the Maven artifact during a user's build, which may come from several non-ASF mirrors. ### Does this PR introduce _any_ user-facing change? Should not affect anything about Spark per se, just the build. ### How was this patch tested? Manual testing wherein I forced Maven/Scala download, verified checksums are downloaded and checked, and verified it fails on error with a corrupted checksum. Closes #32505 from srowen/SPARK-35373. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> 19 May 2021, 14:02:40 UTC
73abceb [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use ### What changes were proposed in this pull request? AQE has an optimization where it attempts to reuse compatible exchanges but it does not take into account whether the exchanges are columnar or not, resulting in incorrect reuse under some circumstances. This PR simply changes the key used to lookup cached stages. It now uses the canonicalized form of the new query stage (potentially created by a plugin) rather than using the canonicalized form of the original exchange. ### Why are the changes needed? When using the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query stage correctly create a row-based exchange and then Spark replaces it with a cached columnar exchange, which is not compatible, and this causes queries to fail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The patch has been tested with the query that highlighted this issue. I looked at writing unit tests for this but it would involve implementing a mock columnar exchange in the tests so would be quite a bit of work. If anyone has ideas on other ways to test this I am happy to hear them. Closes #32195 from andygrove/SPARK-35093. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 52e3cf9ff50b4209e29cb06df09b1ef3a18bc83b) Signed-off-by: Thomas Graves <tgraves@apache.org> 19 May 2021, 12:46:12 UTC
484db68 [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist ### What changes were proposed in this pull request? 1. In HadoopMapReduceCommitProtocol, create parent directory before renaming custom partition path staging files 2. In InMemoryCatalog and HiveExternalCatalog, create new partition directory before renaming old partition path 3. Check return value of FileSystem#rename, if false, throw exception to avoid silent data loss cause by rename failure 4. Change DebugFilesystem#rename behavior to make it match HDFS's behavior (return false without rename when dst parent directory not exist) ### Why are the changes needed? Depends on FileSystem#rename implementation, when destination directory does not exist, file system may 1. return false without renaming file nor throwing exception (e.g. HDFS), or 2. create destination directory, rename files, and return true (e.g. LocalFileSystem) In the first case above, renames in HadoopMapReduceCommitProtocol for custom partition path will fail silently if the destination partition path does not exist. Failed renames can happen when 1. dynamicPartitionOverwrite == true, the custom partition path directories are deleted by the job before the rename; or 2. the custom partition path directories do not exist before the job; or 3. something else is wrong when file system handle `rename` The renames in MemoryCatalog and HiveExternalCatalog for partition renaming also have similar issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified DebugFilesystem#rename, and added new unit tests. Without the fix in src code, five InsertSuite tests and one AlterTableRenamePartitionSuite test failed: InsertSuite.SPARK-20236: dynamic partition overwrite with custom partition path (existing test with modified FS) ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![2,1,1] ``` InsertSuite.SPARK-35106: insert overwrite with custom partition path ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![2,1,1] ``` InsertSuite.SPARK-35106: dynamic partition overwrite with custom partition path ``` == Results == !== Correct Answer - 2 == == Spark Answer - 1 == !struct<> struct<i:int,part1:int,part2:int> [1,1,1] [1,1,1] ![1,1,2] ``` InsertSuite.SPARK-35106: Throw exception when rename custom partition paths returns false ``` Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown ``` InsertSuite.SPARK-35106: Throw exception when rename dynamic partition paths returns false ``` Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown ``` AlterTableRenamePartitionSuite.ALTER TABLE .. RENAME PARTITION V1: multi part partition (existing test with modified FS) ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![3,123,3] ``` Closes #32530 from YuzhouSun/SPARK-35106. Authored-by: Yuzhou Sun <yuzhosun@amazon.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a72d05c7e632fbb0d8a6082c3cacdf61f36518b4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 May 2021, 07:46:46 UTC
3699a67 [SPARK-35425][BUILD][3.1] Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md ### What changes were proposed in this pull request? This PR backports SPARK-35425 (#32573). The following two things are done in this PR. * Add note about Jinja2 as a required dependency for document build. * Add Jinja2 dependency for the document build to `spark-rm/Dockerfile` ### Why are the changes needed? SPARK-35375(#32509) confined the version of Jinja to <3.0.0. So it's good to note about it in `docs/README.md` and add the dependency to `spark-rm/Dockerfile`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confimed that `make html` succeed under `python/docs` with dependencies installed by both of the following commands. ``` pip install sphinx==3.0.4 mkdocs==1.1.2 numpy==1.19.4 pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==2.11.3 pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx numpydoc 'jinja2<3.0.0' ``` Closes #32580 from sarutak/backport-SPARK-35425-branch-3.1. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 May 2021, 16:37:41 UTC
38808c2 [SPARK-35411][SQL] Add essential information while serializing TreeNode to json ### What changes were proposed in this pull request? Write out Seq of product objects which contain TreeNode, to avoid the cases as described in https://issues.apache.org/jira/browse/SPARK-35411 that essential information will be ignored and just written out as null values. These information are necessary to understand the query plans. ### Why are the changes needed? Information like cteRelations in With node, and branches in CaseWhen expression are necessary to understand the query plans, they should be written out to the result json string. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT case added. Closes #32557 from ivoson/plan-json-fix. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9804f07c17af6d8e789f729d5872b85740cc3186) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 May 2021, 15:20:29 UTC
9dba27c [SPARK-35431][SQL][TESTS] Sort elements generated by collect_set in SQLQueryTestSuite ### What changes were proposed in this pull request? To pass `subquery/scalar-subquery/scalar-subquery-select.sql` (`SQLQueryTestSuite`) in Scala v2.13, this PR proposes to change the aggregate expr of a test query in the file from `collect_set(...)` to `sort_array(collect_set(...))` because `collect_set` depends on the `mutable.HashSet` implementation and elements in the set are printed in a different order in Scala v2.12/v2.13. ### Why are the changes needed? To pass the test in Scala v2.13. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually checked. Closes #32578 from maropu/FixSQLTestIssueInScala213. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3b859a16c03fe0caaf8683d9cbc1d1c65551105a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 May 2021, 06:11:05 UTC
aa5c72f [SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation ### What changes were proposed in this pull request? This PR is used to fix this bug: ``` set spark.sql.legacy.charVarcharAsString=true; create table chartb01(a char(3)); insert into chartb01 select 'aaaaa'; ``` here we expect the data of table chartb01 is 'aaa', but it runs failed. ### Why are the changes needed? Improve backward compatibility ``` spark-sql> > create table tchar01(col char(2)) using parquet; Time taken: 0.767 seconds spark-sql> > insert into tchar01 select 'aaa'; ERROR | Executor task launch worker for task 0.0 in stage 0.0 (TID 0) | Aborting task | org.apache.spark.util.Utils.logError(Logging.scala:94) java.lang.RuntimeException: Exceeds char/varchar type length limitation: 2 at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:31) at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:44) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:279) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1500) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1466) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No (the legacy config is false by default). ### How was this patch tested? Added unit tests. Closes #32501 from fhygh/master. Authored-by: fhygh <283452027@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3a3f8ca6f421b9bc51e0059c954262489aa41f5d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 May 2021, 16:14:01 UTC
f9a396c [SPARK-35413][INFRA] Use the SHA of the latest commit when checking out databricks/tpcds-kit ### What changes were proposed in this pull request? This PR proposes to use the SHA of the latest commit ([2a5078a782192ddb6efbcead8de9973d6ab4f069](https://github.com/databricks/tpcds-kit/commit/2a5078a782192ddb6efbcead8de9973d6ab4f069)) when checking out `databricks/tpcds-kit`. This can prevent the test workflow from breaking accidentally if the repository changes drastically. ### Why are the changes needed? For better test workflow. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GA passed. Closes #32561 from maropu/UseRefInCheckout. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 2390b9dbcbc0b0377d694d2c3c2c0fa78179cbd6) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 17 May 2021, 00:26:27 UTC
c0f9eda [SPARK-35405][DOC] Submitting Applications documentation has outdated information about K8s client mode support ### What changes were proposed in this pull request? [Submitting Applications doc](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) has outdated information about K8s client mode support. It still says "Client mode is currently unsupported and will be supported in future releases". ![image](https://user-images.githubusercontent.com/31073930/118268920-b5b51580-b4c6-11eb-8eed-975be8d37964.png) Whereas it's already supported and [Running Spark on Kubernetes doc](https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode) says that it's supported started from 2.4.0 and has all needed information. ![image](https://user-images.githubusercontent.com/31073930/118268947-bd74ba00-b4c6-11eb-98d5-37961327642f.png) Changes: ![image](https://user-images.githubusercontent.com/31073930/118269179-12b0cb80-b4c7-11eb-8a37-d9d301bbda53.png) JIRA: https://issues.apache.org/jira/browse/SPARK-35405 ### Why are the changes needed? Outdated information in the doc is misleading ### Does this PR introduce _any_ user-facing change? Documentation changes ### How was this patch tested? Documentation changes Closes #32551 from o-shevchenko/SPARK-35405. Authored-by: Oleksandr Shevchenko <oleksandr.shevchenko@datarobot.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d2fbf0dce43cbc5bde99d572817011841d6a3d41) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 14 May 2021, 18:27:03 UTC
e8d70d2 [SPARK-35393][PYTHON][INFRA][TESTS] Recover pip packaging test in Github Actions Currently pip packaging test is being skipped: ``` ======================================================================== Running PySpark packaging tests ======================================================================== Constructing virtual env for testing Missing virtualenv & conda, skipping pip installability tests Cleaning up temporary directory - /tmp/tmp.iILYWISPXW ``` See https://github.com/apache/spark/runs/2568923639?check_suite_focus=true GitHub Actions's image has its default Conda installed at `/usr/share/miniconda` but seems like the image we're using for PySpark does not have it (which is legitimate). This PR proposes to install Conda to use in pip packaging tests in GitHub Actions. To recover the test coverage. No, dev-only. It was tested in my fork: https://github.com/HyukjinKwon/spark/runs/2575126882?check_suite_focus=true ``` ======================================================================== Running PySpark packaging tests ======================================================================== Constructing virtual env for testing Using conda virtual environments Testing pip installation with python 3.6 Using /tmp/tmp.qPjTenqfGn for virtualenv Collecting package metadata (current_repodata.json): ...working... done Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source. Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... done environment location: /tmp/tmp.qPjTenqfGn/3.6 added / updated specs: - numpy - pandas - pip - python=3.6 - setuptools ... Successfully ran pip sanity check ``` Closes #32537 from HyukjinKwon/SPARK-35393. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 7d371d27f2a974b682ffa16b71576e61e9338c34) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 May 2021, 01:46:37 UTC
67e4c94 [SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` **Before:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` **After:** ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 17b59a9970a0079ac9225de52247a1de4772c1fa) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 May 2021, 05:58:14 UTC
82e461a [SPARK-35381][R] Fix lambda variable name issues in nested higher order functions at R APIs This PR fixes the same issue as https://github.com/apache/spark/pull/32424 ```r df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") collect(select( df, array_transform("numbers", function(number) { array_transform("letters", function(latter) { struct(alias(number, "n"), alias(latter, "l")) }) }) )) ``` **Before:** ``` ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c ``` **After:** ``` ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c ``` To produce the correct results. Yes, it fixes the results to be correct as mentioned above. Manually tested as above, and unit test was added. Closes #32517 from HyukjinKwon/SPARK-35381. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ecb48ccb7db11f15b9420aaee57594dc4f9d448f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2021, 07:53:28 UTC
f88a522 [MINOR][DOCS] Avoid some python docs where first sentence has "e.g." or similar Avoid some python docs where first sentence has "e.g." or similar as the period causes the docs to show only half of the first sentence as the summary. See for example https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html?highlight=linearregressionmodel#pyspark.ml.regression.LinearRegressionModel.summary where the method description is clearly truncated. Only changes docs. Manual testing of docs. Closes #32508 from srowen/TruncatedPythonDesc. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a189be8754f474adbae477ed366750e2f85418d6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2021, 01:39:58 UTC
f9de071 [SPARK-35375][INFRA] Use Jinja2 < 3.0.0 for Python linter dependency in GA ### What changes were proposed in this pull request? From a few hours ago, Python linter fails in GA. The latest Jinja 3.0.0 seems to cause this failure. https://pypi.org/project/Jinja2/ ``` Run ./dev/lint-python starting python compilation test... python compilation succeeded. starting pycodestyle test... pycodestyle checks passed. starting flake8 test... flake8 checks passed. starting mypy test... mypy checks passed. starting sphinx-build tests... sphinx-build checks failed: Running Sphinx v3.0.4 making output directory... done [autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst Exception occurred: File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code {% if '__init__' in methods %} jinja2.exceptions.UndefinedError: 'methods' is undefined The full traceback has been saved in /tmp/sphinx-err-ypgyi75y.log, if you want to report the issue to the developers. Please also report this if it was a user error, so that a better error message can be provided next time. A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks! make: *** [Makefile:20: html] Error 2 re-running make html to print full warning list: Running Sphinx v3.0.4 making output directory... done [autosummary] generating autosummary for: development/contributing.rst, development/debugging.rst, development/index.rst, development/setting_ide.rst, development/testing.rst, getting_started/index.rst, getting_started/install.rst, getting_started/quickstart.ipynb, index.rst, migration_guide/index.rst, ..., reference/pyspark.ml.rst, reference/pyspark.mllib.rst, reference/pyspark.resource.rst, reference/pyspark.rst, reference/pyspark.sql.rst, reference/pyspark.ss.rst, reference/pyspark.streaming.rst, user_guide/arrow_pandas.rst, user_guide/index.rst, user_guide/python_packaging.rst Exception occurred: File "/__w/spark/spark/python/docs/source/_templates/autosummary/class_with_docs.rst", line 26, in top-level template code {% if '__init__' in methods %} jinja2.exceptions.UndefinedError: 'methods' is undefined The full traceback has been saved in /tmp/sphinx-err-fvtmvvwv.log, if you want to report the issue to the developers. Please also report this if it was a user error, so that a better error message can be provided next time. A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks! make: *** [Makefile:20: html] Error 2 Error: Process completed with exit code 2. ``` ### Why are the changes needed? To recover GA build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #32509 from sarutak/fix-python-lint-error. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit af0d99cce63a10aebd8104f12937e462f2873eac) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2021, 01:13:46 UTC
9623ccb [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM ### What changes were proposed in this pull request? This patch proposes to increase the maximum heap memory setting for release build. ### Why are the changes needed? When I was cutting RCs for 2.4.8, I frequently encountered OOM during building using mvn. It happens many times until I increased the heap memory setting. I am not sure if other release managers encounter the same issue. So I propose to increase the heap memory setting and see if it looks good for others. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Manually used it during cutting RCs of 2.4.8. Closes #32487 from viirya/release-mvn-oom. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 20d32242a2574637d18771c2f7de5c9878f85bad) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 May 2021, 07:29:20 UTC
4136023 [SPARK-34795][SPARK-35192][SPARK-35293][SPARK-35327][SQL][TESTS][3.1] Adds a new job in GitHub Actions to check the output of TPC-DS queries ### What changes were proposed in this pull request? This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries. NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in https://github.com/apache/spark/pull/32243 and https://github.com/apache/spark/pull/32460. This is the backport PR of https://github.com/apache/spark/pull/31886. ### Why are the changes needed? There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new test added. Closes #32462 from maropu/TPCDSQUeryTestSuite-Branch3.1. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 May 2021, 21:44:58 UTC
53a93fc Revert "[SPARK-35321][SQL][3.1] Don't register Hive permanent functions when creating Hive client" This reverts commit 6fbea6a38ddd0c95d54a71c850e0d901727ed842. 08 May 2021, 20:01:45 UTC
6fbea6a [SPARK-35321][SQL][3.1] Don't register Hive permanent functions when creating Hive client This is a backport of #32446 for branch-3.1 ### What changes were proposed in this pull request? Instantiate a new Hive client through `Hive.getWithFastCheck(conf, false)` instead of `Hive.get(conf)`. ### Why are the changes needed? [HIVE-10319](https://issues.apache.org/jira/browse/HIVE-10319) introduced a new API `get_all_functions` which is only supported in Hive 1.3.0/2.0.0 and up. As result, when Spark 3.x talks to a HMS service of version 1.2 or lower, the following error will occur: ``` Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) ... 96 more Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_all_functions(ThriftHiveMetastore.java:3845) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_all_functions(ThriftHiveMetastore.java:3833) ``` The `get_all_functions` is called only when `doRegisterAllFns` is set to true: ```java private Hive(HiveConf c, boolean doRegisterAllFns) throws HiveException { conf = c; if (doRegisterAllFns) { registerAllFunctionsOnce(); } } ``` what this does is to register all Hive permanent functions defined in HMS in Hive's `FunctionRegistry` class, via iterating through results from `get_all_functions`. To Spark, this seems unnecessary as it loads Hive permanent (not built-in) UDF via directly calling the HMS API, i.e., `get_function`. The `FunctionRegistry` is only used in loading Hive's built-in function that is not supported by Spark. At this time, it only applies to `histogram_numeric`. ### Does this PR introduce _any_ user-facing change? Yes with this fix Spark now should be able to talk to HMS server with Hive 1.2.x and lower (with HIVE-24608 too) ### How was this patch tested? Manually started a HMS server of Hive version 1.2.2, with patched Hive 2.3.8 using HIVE-24608. Without the PR it failed with the above exception. With the PR the error disappeared and I can successfully perform common operations such as create table, create database, list tables, etc. Closes #32472 from sunchao/SPARK-35321-3.1. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 May 2021, 00:40:05 UTC
a4c34ba [SPARK-35288][SQL] StaticInvoke should find the method without exact argument classes match ### What changes were proposed in this pull request? This patch proposes to make StaticInvoke able to find method with given method name even the parameter types do not exactly match to argument classes. ### Why are the changes needed? Unlike `Invoke`, `StaticInvoke` only tries to get the method with exact argument classes. If the calling method's parameter types are not exactly matched with the argument classes, `StaticInvoke` cannot find the method. `StaticInvoke` should be able to find the method under the cases too. ### Does this PR introduce _any_ user-facing change? Yes. `StaticInvoke` can find a method even the argument classes are not exactly matched. ### How was this patch tested? Unit test. Closes #32413 from viirya/static-invoke. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 33fbf5647b4a5587c78ac51339c0cbc9d70547a4) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 07 May 2021, 16:08:13 UTC
6df4ec0 [SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <dsolow@sayari.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: dmsolow <dsolow@sayarianalytics.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit f550e03b96638de93381734c4eada2ace02d9a4f) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 05 May 2021, 03:46:30 UTC
89f5ec7 [SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VALUE at CSV's unescapedQuoteHandling option documentation ### What changes were proposed in this pull request? This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too. `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation. ### Why are the changes needed? To correctly document. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? I checked them via running linters. Closes #32423 from HyukjinKwon/SPARK-35250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8aaa9e890acfa318775627fbe6d12eaf34bb35b1) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 May 2021, 23:44:36 UTC
25002e0 [SPARK-35278][SQL] Invoke should find the method with correct number of parameters ### What changes were proposed in this pull request? This patch fixes `Invoke` expression when the target object has more than one method with the given method name. ### Why are the changes needed? `Invoke` will find out the method on the target object with given method name. If there are more than one method with the name, currently it is undeterministic which method will be used. We should add the condition of parameter number when finding the method. ### Does this PR introduce _any_ user-facing change? Yes, fixed a bug when using `Invoke` on a object where more than one method with the given method name. ### How was this patch tested? Unit test. Closes #32404 from viirya/verify-invoke-param-len. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 6ce1b161e96176777344beb610163636e7dfeb00) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 01 May 2021, 17:20:59 UTC
c245d84 [SPARK-35226][SQL] Support refreshKrb5Config option in JDBC datasources ### What changes were proposed in this pull request? This PR proposes to introduce a new JDBC option `refreshKrb5Config` which allows to reflect the change of `krb5.conf`. ### Why are the changes needed? In the current master, JDBC datasources can't accept `refreshKrb5Config` which is defined in `Krb5LoginModule`. So even if we change the `krb5.conf` after establishing a connection, the change will not be reflected. The similar issue happens when we run multiple `*KrbIntegrationSuites` at the same time. `MiniKDC` starts and stops every KerberosIntegrationSuite and different port number is recorded to `krb5.conf`. Due to `SecureConnectionProvider.JDBCConfiguration` doesn't take `refreshKrb5Config`, KerberosIntegrationSuites except the first running one see the wrong port so those suites fail. You can easily confirm with the following command. ``` build/sbt -Phive Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.*KrbIntegrationSuite" ``` ### Does this PR introduce _any_ user-facing change? Yes. Users can set `refreshKrb5Config` to refresh krb5 relevant configuration. ### How was this patch tested? New test. Closes #32344 from sarutak/kerberos-refresh-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 529b875901a91a03caeb73d9eb7b3008b552c736) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 29 April 2021, 04:56:13 UTC
db8204e [SPARK-35159][SQL][DOCS][3.1] Extract hive format doc ### What changes were proposed in this pull request? Extract common doc about hive format for `sql-ref-syntax-ddl-create-table-hiveformat.md` and `sql-ref-syntax-qry-select-transform.md` to refer. ![image](https://user-images.githubusercontent.com/46485123/115802193-04641800-a411-11eb-827d-d92544881842.png) ### Why are the changes needed? Improve doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32379 from AngersZhuuuu/SPARK-35159-3.1. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 28 April 2021, 10:13:33 UTC
361e684 [SPARK-33976][SQL][DOCS][3.1] Add a SQL doc page for a TRANSFORM clause ### What changes were proposed in this pull request? Add doc about `TRANSFORM` and related function. ![image](https://user-images.githubusercontent.com/46485123/114332579-1627fe80-9b79-11eb-8fa7-131f0a20f72f.png) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32376 from AngersZhuuuu/SPARK-33976-3.1. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 28 April 2021, 07:51:44 UTC
e58055b [SPARK-35244][SQL] Invoke should throw the original exception ### What changes were proposed in this pull request? This PR updates the interpreted code path of invoke expressions, to unwrap the `InvocationTargetException` ### Why are the changes needed? Make interpreted and codegen path consistent for invoke expressions. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new UT Closes #32370 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> 28 April 2021, 05:59:49 UTC
cb1d802 [SPARK-35227][BUILD] Update the resolver for spark-packages in SparkSubmit ### What changes were proposed in this pull request? This change is to use repos.spark-packages.org instead of Bintray as the repository service for spark-packages. ### Why are the changes needed? The change is needed because Bintray will no longer be available from May 1st. ### Does this PR introduce _any_ user-facing change? This should be transparent for users who use SparkSubmit. ### How was this patch tested? Tested running spark-shell with --packages manually. Closes #32346 from bozhang2820/replace-bintray. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit f738fe07b6fc85c880b64a1cc2f6c7cc1cc1379b) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 27 April 2021, 01:59:45 UTC
526490f [SPARK-28247][SS][TEST] Fix flaky test "query without test harness" on ContinuousSuite ### What changes were proposed in this pull request? This is another attempt to fix the flaky test "query without test harness" on ContinuousSuite. `query without test harness` is flaky because it starts a continuous query with two partitions but assumes they will run at the same speed. In this test, 0 and 2 will be written to partition 0, 1 and 3 will be written to partition 1. It assumes when we see 3, 2 should be written to the memory sink. But this is not guaranteed. We can add `if (currentValue == 2) Thread.sleep(5000)` at this line https://github.com/apache/spark/blob/b2a2b5d8206b7c09b180b8b6363f73c6c3fdb1d8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala#L135 to reproduce the failure: `Result set Set([0], [1], [3]) are not a superset of Set(0, 1, 2, 3)!` The fix is changing `waitForRateSourceCommittedValue` to wait until all partitions reach the desired values before stopping the query. ### Why are the changes needed? Fix a flaky test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Manually verify the reproduction I mentioned above doesn't fail after this change. Closes #32316 from zsxwing/SPARK-28247-fix. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 0df3b501aeb2a88997e5a68a6a8f8e7f5196342c) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 26 April 2021, 23:07:36 UTC
c59db3d [SPARK-35224][SQL][TESTS][3.1][3.0] Fix buffer overflow in `MutableProjectionSuite` ### What changes were proposed in this pull request? In the test `"unsafe buffer with NO_CODEGEN"` of `MutableProjectionSuite`, fix unsafe buffer size calculation to be able to place all input fields without buffer overflow + meta-data. ### Why are the changes needed? To make the test suite `MutableProjectionSuite` more stable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite: ``` $ build/sbt "test:testOnly *MutableProjectionSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit d572a859891547f73b57ae8c9a3b800c48029678) Closes #32347 from MaxGekk/fix-buffer-overflow-MutableProjectionSuite-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 26 April 2021, 09:41:15 UTC
eaceb40 [SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations ### What changes were proposed in this pull request? Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue. ### Why are the changes needed? Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in https://github.com/apache/spark/pull/29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema. ### Does this PR introduce _any_ user-facing change? It fixes exceptions and incorrect results for valid uses in the latest Spark release. ### How was this patch tested? Added new unit tests for these edge cases. Closes #32338 from Kimahriman/bug/optimize-with-fields. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 74afc68e2172cb0dc3567e12a8a2c304bb7ea138) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 26 April 2021, 06:40:10 UTC
6595db2 [SPARK-35087][UI] Some columns in table Aggregated Metrics by Executor of stage-detail page shows incorrectly. ### What changes were proposed in this pull request? columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc in table ` Aggregated Metrics by Executor` of stage-detail page should be sorted as numerical-order instead of lexicographical-order. ### Why are the changes needed? buf fix,the sorting style should be consistent between different columns. The correspondence between the table and the index is shown below(it is defined in stagespage-template.html): | index | column name | | ----- | -------------------------------------- | | 0 | Executor ID | | 1 | Logs | | 2 | Address | | 3 | Task Time | | 4 | Total Tasks | | 5 | Failed Tasks | | 6 | Killed Tasks | | 7 | Succeeded Tasks | | 8 | Excluded | | 9 | Input Size / Records | | 10 | Output Size / Records | | 11 | Shuffle Read Size / Records | | 12 | Shuffle Write Size / Records | | 13 | Spill (Memory) | | 14 | Spill (Disk) | | 15 | Peak JVM Memory OnHeap / OffHeap | | 16 | Peak Execution Memory OnHeap / OffHeap | | 17 | Peak Storage Memory OnHeap / OffHeap | | 18 | Peak Pool Memory Direct / Mapped | I constructed some data to simulate the sorting results of the index columns from 9 to 18. As shown below,it can be seen that the sorting results of columns 9-12 are wrong: ![simulate-result](https://user-images.githubusercontent.com/52202080/115120775-c9fa1580-9fe1-11eb-8514-71f29db3a5eb.png) The reason is that the real data corresponding to columns 9-12 (note that it is not the data displayed on the page) are **all strings similar to`94685/131`(bytes/records),while the real data corresponding to columns 13-18 are all numbers,** so the sorting corresponding to columns 13-18 loos well, but the results of columns 9-12 are incorrect because the strings are sorted according to lexicographical order. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only JS was modified, and the manual test result works well. **before modified:** ![looks-illegal](https://user-images.githubusercontent.com/52202080/115120812-06c60c80-9fe2-11eb-9ada-fa520fe43c4e.png) **after modified:** ![sort-result-corrent](https://user-images.githubusercontent.com/52202080/114865187-7c847980-9e24-11eb-9fbc-39ee224726d6.png) Closes #32190 from kyoty/aggregated-metrics-by-executor-sorted-incorrectly. Authored-by: kyoty <echohlne@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 2d6467d6d1633a06b6416eacd4f8167973e88136) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 26 April 2021, 03:13:54 UTC
e053679 [SPARK-35168][SQL] mapred.reduce.tasks should be shuffle.partitions not adaptive.coalescePartitions.initialPartitionNum ### What changes were proposed in this pull request? ```sql spark-sql> set spark.sql.adaptive.coalescePartitions.initialPartitionNum=1; spark.sql.adaptive.coalescePartitions.initialPartitionNum 1 Time taken: 2.18 seconds, Fetched 1 row(s) spark-sql> set mapred.reduce.tasks; 21/04/21 14:27:11 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead. spark.sql.shuffle.partitions 1 Time taken: 0.03 seconds, Fetched 1 row(s) spark-sql> set spark.sql.shuffle.partitions; spark.sql.shuffle.partitions 200 Time taken: 0.024 seconds, Fetched 1 row(s) spark-sql> set mapred.reduce.tasks=2; 21/04/21 14:31:52 WARN SetCommand: Property mapred.reduce.tasks is deprecated, automatically converted to spark.sql.shuffle.partitions instead. spark.sql.shuffle.partitions 2 Time taken: 0.017 seconds, Fetched 1 row(s) spark-sql> set mapred.reduce.tasks; 21/04/21 14:31:55 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead. spark.sql.shuffle.partitions 1 Time taken: 0.017 seconds, Fetched 1 row(s) spark-sql> ``` `mapred.reduce.tasks` is mapping to `spark.sql.shuffle.partitions` at write-side, but `spark.sql.adaptive.coalescePartitions.initialPartitionNum` might take precede of `spark.sql.shuffle.partitions` ### Why are the changes needed? roundtrip for `mapred.reduce.tasks` ### Does this PR introduce _any_ user-facing change? yes, `mapred.reduce.tasks` will always report `spark.sql.shuffle.partitions` whether `spark.sql.adaptive.coalescePartitions.initialPartitionNum` exists or not. ### How was this patch tested? a new test Closes #32265 from yaooqinn/SPARK-35168. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 5b1353f690bf416fdb3a34c94741425b95f97308) Signed-off-by: Kent Yao <yao@apache.org> 25 April 2021, 12:27:37 UTC
06f0560 [SPARK-35210][BUILD][3.1] Upgrade Jetty to 9.4.40 to fix ERR_CONNECTION_RESET issue ### What changes were proposed in this pull request? This PR backports SPARK-35210 (#32318). This PR proposes to upgrade Jetty to 9.4.40. ### Why are the changes needed? SPARK-34988 (#32091) upgraded Jetty to 9.4.39 for CVE-2021-28165. But after the upgrade, Jetty 9.4.40 was released to fix the ERR_CONNECTION_RESET issue (https://github.com/eclipse/jetty.project/issues/6152). This issue seems to affect Jetty 9.4.39 when POST method is used with SSL. For Spark, job submission using REST and ThriftServer with HTTPS protocol can be affected. ### Does this PR introduce _any_ user-facing change? No. No released version uses Jetty 9.3.39. ### How was this patch tested? CI. Closes #32324 from sarutak/backport-3.1-SPARK-35210. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 24 April 2021, 16:39:32 UTC
26e8b47 [SPARK-34897][SQL][3.1] Support reconcile schemas based on index after nested column pruning This PR backports https://github.com/apache/spark/pull/31993 to branch-3.1. The origin PR description: ### What changes were proposed in this pull request? It will remove `StructField` when [pruning nested columns](https://github.com/apache/spark/blob/0f2c0b53e8fb18c86c67b5dd679c006db93f94a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L28-L42). For example: ```scala spark.sql( """ |CREATE TABLE t1 ( | _col0 INT, | _col1 STRING, | _col2 STRUCT<c1: STRING, c2: STRING, c3: STRING, c4: BIGINT>) |USING ORC |""".stripMargin) spark.sql("INSERT INTO t1 values(1, '2', struct('a', 'b', 'c', 10L))") spark.sql("SELECT _col0, _col2.c1 FROM t1").show ``` Before this pr. The returned schema is: ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` add it will throw exception: ``` java.lang.AssertionError: assertion failed: The given data schema struct<_col0:int,_col2:struct<c1:string>> has less fields than the actual ORC physical schema, no idea which columns were dropped, fail to read. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.requestedColumnIds(OrcUtils.scala:160) ``` After this pr. The returned schema is: ``` `_col0` INT,`_col1` STRING,`_col2` STRUCT<`c1`: STRING> ```. The finally schema is ``` `_col0` INT,`_col2` STRUCT<`c1`: STRING> ``` after the complete column pruning: https://github.com/apache/spark/blob/7a5647a93aaea9d1d78d9262e24fc8c010db04d0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L208-L213 https://github.com/apache/spark/blob/e64eb75aede71a5403a4d4436e63b1fcfdeca14d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala#L96-L97 ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32279 from wangyum/SPARK-34897-3.1. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> 23 April 2021, 07:08:11 UTC
60a2dd4 [SPARK-35127][UI] When we switch between different stage-detail pages, the entry item in the newly-opened page may be blank ### What changes were proposed in this pull request? To make sure that pageSize shoud not be shared between different stage pages. The screenshots of the problem are placed in the attachment of [JIRA](https://issues.apache.org/jira/browse/SPARK-35127) ### Why are the changes needed? fix the bug. according to reference:`https://datatables.net/reference/option/lengthMenu` `-1` represents display all rows, but now we use `totalTasksToShow`, it will cause the select item show as empty when we swich between different stage-detail pages. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test, it is a small io problem, and the modification does not affect the function, but just an adjustment of js configuration the gif below shows how the problem can be reproduced: ![reproduce](https://user-images.githubusercontent.com/52202080/115204351-f7060f80-a12a-11eb-8900-a009ad0c8870.gif) ![微信截图_20210419162849](https://user-images.githubusercontent.com/52202080/115205675-629cac80-a12c-11eb-9cb8-1939c7450e99.png) the gif below shows the result after modified: ![after_modified](https://user-images.githubusercontent.com/52202080/115204886-91fee980-a12b-11eb-9ccb-d5900a99095d.gif) Closes #32223 from kyoty/stages-task-empty-pagesize. Authored-by: kyoty <echohlne@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 7242d7f774b821cbcb564c8afeddecaf6664d159) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 22 April 2021, 12:00:27 UTC
cc62c49 [SPARK-34674][CORE][K8S] Close SparkContext after the Main method has finished ### What changes were proposed in this pull request? Close SparkContext after the Main method has finished, to allow SparkApplication on K8S to complete. This is fixed version of [merged and reverted PR](https://github.com/apache/spark/pull/32081). ### Why are the changes needed? if I don't call the method sparkContext.stop() explicitly, then a Spark driver process doesn't terminate even after its Main method has been completed. This behaviour is different from spark on yarn, where the manual sparkContext stopping is not required. It looks like, the problem is in using non-daemon threads, which prevent the driver jvm process from terminating. So I have inserted code that closes sparkContext automatically. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually on the production AWS EKS environment in my company. Closes #32283 from kotlovs/close-spark-context-on-exit-2. Authored-by: skotlov <skotlov@joom.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b17a0e6931cac98cc839c047b1b5d4ea6d052009) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 April 2021, 05:54:31 UTC
b41624b [SPARK-35178][BUILD][FOLLOWUP][3.1] Update Zinc argument ### What changes were proposed in this pull request? This is a follow-up to adjust zinc installation with the new parameter. ### Why are the changes needed? Currently, zinc is ignored. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. **BEFORE** ``` $ build/mvn clean exec: curl --silent --show-error -L https://downloads.lightbend.com/zinc/0.3.15 tar: Error opening archive: Unrecognized archive format exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download build/mvn: line 149: /Users/dongjoon/APACHE/spark-merge/build/zinc-0.3.15/bin/zinc: No such file or directory build/mvn: line 151: /Users/dongjoon/APACHE/spark-merge/build/zinc-0.3.15/bin/zinc: No such file or directory Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn [INFO] Scanning for projects... ``` **AFTER** ``` $ build/mvn clean exec: curl --silent --show-error -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download Using `mvn` from path: /Users/dongjoon/PRS/SPARK-PR-32282/build/apache-maven-3.6.3/bin/mvn [INFO] Scanning for projects... ``` Closes #32282 from dongjoon-hyun/SPARK-35178. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 April 2021, 02:51:57 UTC
3c5b3e2 [SPARK-35178][BUILD] Use new Apache 'closer.lua' syntax to obtain Maven ### What changes were proposed in this pull request? Use new Apache 'closer.lua' syntax to obtain Maven ### Why are the changes needed? The current closer.lua redirector, which redirects to download Maven from a local mirror, has a new syntax. build/mvn does not work properly otherwise now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing. Closes #32277 from srowen/SPARK-35178. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6860efe4688ea1057e7a2e57b086721e62198d44) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 April 2021, 01:49:32 UTC
0208810 [SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` ### What changes were proposed in this pull request? Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32245 from harupy/SPARK-35142. Authored-by: harupy <17039389+harupy@users.noreply.github.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit b6350f5bb00f99a060953850b069a419b70c329e) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 21 April 2021, 08:29:38 UTC
2bbe0a4 [SPARK-35096][SQL] SchemaPruning should adhere spark.sql.caseSensitive config ### What changes were proposed in this pull request? As a part of the SPARK-26837 pruning of nested fields from object serializers are supported. But it is missed to handle case insensitivity nature of spark In this PR I have resolved the column names to be pruned based on `spark.sql.caseSensitive ` config **Exception Before Fix** ``` Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) at ``` ### Why are the changes needed? After Upgrade to Spark 3 `foreachBatch` API throws` java.lang.ArrayIndexOutOfBoundsException`. This issue will be fixed using this PR ### Does this PR introduce _any_ user-facing change? No, Infact fixes the regression ### How was this patch tested? Added tests and also tested verified manually Closes #32194 from sandeep-katta/SPARK-35096. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4f309cec07b1b1e5f7cddb0f98598fb1a234c2bd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 April 2021, 07:16:39 UTC
5f48abe [SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports https://github.com/apache/spark/pull/31758 to 3.1, to fix a backward compatibility issue caused by https://github.com/apache/spark/pull/28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In https://github.com/apache/spark/pull/28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after #28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` #31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes #32239 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 21 April 2021, 00:14:11 UTC
034ba76 [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated This PR updated the `foundNonEqualCorrelatedPred` logic for correlated subqueries in `CheckAnalysis` to only allow correlated equality predicates that guarantee one-to-one mapping between inner and outer attributes, instead of all equality predicates. To fix correctness bugs. Before this fix Spark can give wrong results for certain correlated subqueries that pass CheckAnalysis: Example 1: ```sql create or replace view t1(c) as values ('a'), ('b') create or replace view t2(c) as values ('ab'), ('abc'), ('bc') select c, (select count(*) from t2 where t1.c = substring(t2.c, 1, 1)) from t1 ``` Correct results: [(a, 2), (b, 1)] Spark results: ``` +---+-----------------+ |c |scalarsubquery(c)| +---+-----------------+ |a |1 | |a |1 | |b |1 | +---+-----------------+ ``` Example 2: ```sql create or replace view t1(a, b) as values (0, 6), (1, 5), (2, 4), (3, 3); create or replace view t2(c) as values (6); select c, (select count(*) from t1 where a + b = c) from t2; ``` Correct results: [(6, 4)] Spark results: ``` +---+-----------------+ |c |scalarsubquery(c)| +---+-----------------+ |6 |1 | |6 |1 | |6 |1 | |6 |1 | +---+-----------------+ ``` Yes. Users will not be able to run queries that contain unsupported correlated equality predicates. Added unit tests. Closes #32179 from allisonwang-db/spark-35080-subquery-bug. Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bad4b6f025de4946112a0897892a97d5ae6822cf) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 April 2021, 03:26:16 UTC
b436f5b [SPARK-35117][UI] Change progress bar back to highlight ratio of tasks in progress ### What changes were proposed in this pull request? Small UI update to add highlighting the number of tasks in progress in a stage/job instead of highlighting the whole in progress stage/job. This was the behavior pre Spark 3.1 and the bootstrap 4 upgrade. ### Why are the changes needed? To add back in functionality lost between 3.0 and 3.1. This provides a great visual queue of how much of a stage/job is currently being run. ### Does this PR introduce _any_ user-facing change? Small UI change. Before: ![image](https://user-images.githubusercontent.com/3536454/115216189-3fddaa00-a0d2-11eb-88e0-e3be925c92f0.png) After (and pre Spark 3.1): ![image](https://user-images.githubusercontent.com/3536454/115216216-48ce7b80-a0d2-11eb-9953-2adb3b377133.png) ### How was this patch tested? Updated existing UT. Closes #32214 from Kimahriman/progress-bar-started. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit e55ff83d775132f2c9c48a0b33e03e130abfa504) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 20 April 2021, 03:00:18 UTC
a2b5fb3 [SPARK-35136] Remove initial null value of LiveStage.info ### What changes were proposed in this pull request? To prevent potential NullPointerExceptions, this PR changes the `LiveStage` constructor to take `info` as a constructor parameter and adds a nullcheck in `AppStatusListener.activeStages`. ### Why are the changes needed? The `AppStatusListener.getOrCreateStage` would create a LiveStage object with the `info` field set to null and right after that set it to a specific StageInfo object. This can lead to a race condition when the `livestages` are read in between those calls. This could then lead to a null pointer exception in, for instance: `AppStatusListener.activeStages`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Regular CI/CD tests Closes #32233 from sander-goos/SPARK-35136-livestage. Authored-by: Sander Goos <sander.goos@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 April 2021, 15:12:09 UTC
718f395 [SPARK-35045][SQL][FOLLOW-UP] Add a configuration for CSV input buffer size ### What changes were proposed in this pull request? This PR makes the input buffer configurable (as an internal configuration). This is mainly to work around the regression in uniVocity/univocity-parsers#449. This is particularly useful for SQL workloads that requires to rewrite the `CREATE TABLE` with options. ### Why are the changes needed? To work around uniVocity/univocity-parsers#449. ### Does this PR introduce _any_ user-facing change? No, it's only internal option. ### How was this patch tested? Manually tested by modifying the unittest added in https://github.com/apache/spark/pull/31858 as below: ```diff diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index fd25a79619d..705f38dbfbd 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -2456,6 +2456,7 abstract class CSVSuite test("SPARK-34768: counting a long record with ignoreTrailingWhiteSpace set to true") { val bufSize = 128 val line = "X" * (bufSize - 1) + "| |" + spark.conf.set("spark.sql.csv.parser.inputBufferSize", 128) withTempPath { path => Seq(line).toDF.write.text(path.getAbsolutePath) assert(spark.read.format("csv") ``` Closes #32231 from HyukjinKwon/SPARK-35045-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 70b606ffdd28f94fe64a12e7e0eadbf7b212bd83) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 19 April 2021, 10:52:15 UTC
71ab0c8 [SPARK-34225][CORE][FOLLOWUP] Replace Hadoop's Path with Utils.resolveURI to make the way to get URI simple ### What changes were proposed in this pull request? This PR proposes to replace Hadoop's `Path` with `Utils.resolveURI` to make the way to get URI simple in `SparkContext`. ### Why are the changes needed? Keep the code simple. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32164 from sarutak/followup-SPARK-34225. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 767ea86ecf60dd85a925ec5111f0b16dd931c1fe) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 April 2021, 05:24:44 UTC
3553ded [SPARK-34630][PYTHON][FOLLOWUP] Add __version__ into pyspark init __all__ ### What changes were proposed in this pull request? This patch add `__version__` into pyspark.__init__.__all__ to make the `__version__` as exported explicitly, see more in https://github.com/apache/spark/pull/32110#issuecomment-817331896 ### Why are the changes needed? 1. make the `__version__` as exported explicitly 2. cleanup `noqa: F401` on `__version` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Python related CI passed Closes #32125 from Yikun/SPARK-34629-Follow. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com> (cherry picked from commit 31555f777971197f31279ce4cff22de11670255a) Signed-off-by: zero323 <mszymkiewicz@gmail.com> 14 April 2021, 21:37:35 UTC
ab4a1a6 [SPARK-34834][NETWORK] Fix a potential Netty memory leak in TransportResponseHandler ### What changes were proposed in this pull request? There is a potential Netty memory leak in TransportResponseHandler. ### Why are the changes needed? Fix a potential Netty memory leak in TransportResponseHandler. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? NO Closes #31942 from weixiuli/SPARK-34834. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit bf9f3b884fcd6bd3428898581d4b5dca9bae6538) Signed-off-by: Sean Owen <srowen@gmail.com> 14 April 2021, 16:45:00 UTC
614a28a [SPARK-35002][INFRA][FOLLOW-UP] Use localhost instead of 127.0.0.1 at SPARK_LOCAL_IP in GA builds This PR replaces 127.0.0.1 to `localhost`. - https://github.com/apache/spark/pull/32096#discussion_r610349269 - https://github.com/apache/spark/pull/32096#issuecomment-816442481 No, dev-only. I didn't test it because it's CI specific issue. I will test it in Github Actions build in this PR. Closes #32102 from HyukjinKwon/SPARK-35002. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit a3d1e00317d82bc2296a6f339b9604b5d6bc6754) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 14 April 2021, 10:45:49 UTC
06ce0ee [SPARK-35002][INFRA] Fix the java.net.BindException when testing with Github Action This PR tries to fix the `java.net.BindException` when testing with Github Action: ``` [info] org.apache.spark.sql.kafka010.producer.InternalKafkaProducerPoolSuite *** ABORTED *** (282 milliseconds) [info] java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address. [info] at sun.nio.ch.Net.bind0(Native Method) [info] at sun.nio.ch.Net.bind(Net.java:461) [info] at sun.nio.ch.Net.bind(Net.java:453) ``` https://github.com/apache/spark/pull/32090/checks?check_run_id=2295418529 Fix test framework. No. Test by Github Action. Closes #32096 from wangyum/SPARK_LOCAL_IP=localhost. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9663c4061ae07634697345560c84359574a75692) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 14 April 2021, 10:42:36 UTC
23e3626 [SPARK-35002][YARN][TESTS][FOLLOW-UP] Fix java.net.BindException in MiniYARNCluster This PR fixes two tests below: https://github.com/apache/spark/runs/2320161984 ``` [info] YarnShuffleIntegrationSuite: [info] org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite *** ABORTED *** (228 milliseconds) [info] org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.webapp.WebAppException: Error starting http server [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:373) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:128) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:503) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:322) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:95) ... [info] Cause: java.net.BindException: Port in use: fv-az186-831:0 [info] at org.apache.hadoop.http.HttpServer2.constructBindException(HttpServer2.java:1231) [info] at org.apache.hadoop.http.HttpServer2.bindForSinglePort(HttpServer2.java:1253) [info] at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:1316) [info] at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:1167) [info] at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:449) [info] at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1247) [info] at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1356) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:365) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:128) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:503) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:322) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:95) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) ... ``` https://github.com/apache/spark/runs/2323342094 ``` [info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret started [error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret failed: java.lang.AssertionError: Connecting to /10.1.0.161:39895 timed out (120000 ms), took 120.081 sec [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret(ExternalShuffleSecuritySuite.java:85) [error] ... [info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId started [error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId failed: java.lang.AssertionError: Connecting to /10.1.0.198:44633 timed out (120000 ms), took 120.08 sec [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId(ExternalShuffleSecuritySuite.java:76) [error] ... [info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid started [error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid failed: java.io.IOException: Connecting to /10.1.0.119:43575 timed out (120000 ms), took 120.089 sec [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285) [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) [error] at org.apache.spark.network.shuffle.ExternalBlockStoreClient.registerWithShuffleServer(ExternalBlockStoreClient.java:211) [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.validate(ExternalShuffleSecuritySuite.java:108) [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid(ExternalShuffleSecuritySuite.java:68) [error] ... [info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption started [error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption failed: java.io.IOException: Connecting to /10.1.0.248:35271 timed out (120000 ms), took 120.014 sec [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285) [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) [error] at org.apache.spark.network.shuffle.ExternalBlockStoreClient.registerWithShuffleServer(ExternalBlockStoreClient.java:211) [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.validate(ExternalShuffleSecuritySuite.java:108) [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption(ExternalShu ``` For Yarn cluster suites, its difficult to fix. This PR makes it skipped if it fails to bind. For shuffle related suites, it uses local host To make the tests stable No, dev-only. Its tested in GitHub Actions: https://github.com/HyukjinKwon/spark/runs/2340210765 Closes #32126 from HyukjinKwon/SPARK-35002-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit a153efa643dcb1d8e6c2242846b3db0b2be39ae7) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 14 April 2021, 10:35:21 UTC
a93a575 [MINOR][PYTHON][DOCS] Fix docstring for pyspark.sql.DataFrameWriter.json lineSep param ### What changes were proposed in this pull request? Add a new line to the `lineSep` parameter so that the doc renders correctly. ### Why are the changes needed? > <img width="608" alt="image" src="https://user-images.githubusercontent.com/8269566/114631408-5c608900-9c71-11eb-8ded-ae1e21ae48b2.png"> The first line of the description is part of the signature and is **bolded**. ### Does this PR introduce _any_ user-facing change? Yes, it changes how the docs for `pyspark.sql.DataFrameWriter.json` are rendered. ### How was this patch tested? I didn't test it; I don't have the doc rendering tool chain on my machine, but the change is obvious. Closes #32153 from AlexMooney/patch-1. Authored-by: Alex Mooney <alexmooney@fastmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit faa928cefc8c1c6d7771aacd2ae7670162346361) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 14 April 2021, 04:14:59 UTC
bd03050 [SPARK-35045][SQL] Add an internal option to control input buffer in univocity ### What changes were proposed in this pull request? This PR makes the input buffer configurable (as an internal option). This is mainly to work around uniVocity/univocity-parsers#449. ### Why are the changes needed? To work around uniVocity/univocity-parsers#449. ### Does this PR introduce _any_ user-facing change? No, it's only internal option. ### How was this patch tested? Manually tested by modifying the unittest added in https://github.com/apache/spark/pull/31858 as below: ```diff diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index fd25a79619d..b58f0bd3661 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -2460,6 +2460,7 abstract class CSVSuite Seq(line).toDF.write.text(path.getAbsolutePath) assert(spark.read.format("csv") .option("delimiter", "|") + .option("inputBufferSize", "128") .option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1) } } ``` Closes #32145 from HyukjinKwon/SPARK-35045. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 1f562159bf61dd5e536db7841b16e74a635e7a97) Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 April 2021, 12:08:17 UTC
c71126a [SPARK-35014] Fix the PhysicalAggregation pattern to not rewrite foldable expressions ### What changes were proposed in this pull request? Fix PhysicalAggregation to not transform a foldable expression. ### Why are the changes needed? It can potentially break certain queries like the added unit test shows. ### Does this PR introduce _any_ user-facing change? Yes, it fixes undesirable errors caused by a returned TypeCheckFailure from places like RegExpReplace.checkInputDataTypes. Closes #32113 from sigmod/foldable. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9cd25b46b9d1de0c7cdecdabd8cf37b25ec2d78a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 April 2021, 11:58:06 UTC
4abe6d6 [SPARK-34212][SQL][FOLLOWUP] Move the added test to ParquetQuerySuite This pr moves the added test from `SQLQuerySuite` to `ParquetQuerySuite`. 1. It can be tested by `ParquetV1QuerySuite` and `ParquetV2QuerySuite`. 2. Reduce the testing time of `SQLQuerySuite`(SQLQuerySuite ~ 3 min 17 sec, ParquetV1QuerySuite ~ 27 sec). No. Unit test. Closes #32090 from wangyum/SPARK-34212. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 April 2021, 09:20:47 UTC
edb1abf [SPARK-35019][PYTHON][SQL] Fix type hints mismatches in pyspark.sql.* ### What changes were proposed in this pull request? Fix type hints mismatches in pyspark.sql.* ### Why are the changes needed? There were some mismatches in pyspark.sql.* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? dev/lint-python passed. Closes #32122 from Yikun/SPARK-35019. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b43f7e6a974cac5aae401224c14b870c18fbded8) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 13 April 2021, 02:21:23 UTC
bfc1289 [SPARK-34926][SQL][3.1] PartitioningUtils.getPathFragment() should respect partition value is null ### What changes were proposed in this pull request? When we insert data into a partition table partition with empty DataFrame. We will call `PartitioningUtils.getPathFragment()` then to update this partition's metadata too. When we insert to a partition when partition value is `null`, it will throw exception like ``` [info] java.lang.NullPointerException: [info] at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51) [info] at scala.collection.immutable.StringOps.length(StringOps.scala:51) [info] at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35) [info] at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) [info] at scala.collection.immutable.StringOps.foreach(StringOps.scala:33) [info] at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69) [info] at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.getPartitionValueString(ExternalCatalogUtils.scala:126) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.$anonfun$getPathFragment$1(PartitioningUtils.scala:354) [info] at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) [info] at scala.collection.Iterator.foreach(Iterator.scala:941) [info] at scala.collection.Iterator.foreach$(Iterator.scala:941) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) [info] at scala.collection.IterableLike.foreach(IterableLike.scala:74) [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ``` `PartitioningUtils.getPathFragment()` should support `null` value too ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #32127 from AngersZhuuuu/SPARK-34926-3.1. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 12 April 2021, 13:47:49 UTC
b5adfaa [SPARK-34630][PYTHON] Add typehint for pyspark.__version__ ### What changes were proposed in this pull request? This PR adds the typehint of pyspark.__version__, which was mentioned in [SPARK-34630](https://issues.apache.org/jira/browse/SPARK-34630). ### Why are the changes needed? There were some short discussion happened in https://github.com/apache/spark/pull/31823#discussion_r593830911 . After further deep investigation on [1][2], we can see the `pyspark.__version__` is added by [setup.py](https://github.com/apache/spark/blob/c06758834e6192b1888b4a885c612a151588b390/python/setup.py#L201), it makes `__version__` embedded into pyspark module, that means the `__init__.pyi` is the right place to add the typehint for `__version__`. So, this patch adds the type hint `__version__` in pyspark/__init__.pyi. [1] [PEP-396 Module Version Numbers](https://www.python.org/dev/peps/pep-0396/) [2] https://packaging.python.org/guides/single-sourcing-package-version/ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. Disable the ignore_error on https://github.com/apache/spark/blob/ee7bf7d9628384548d09ac20e30992dd705c20f4/python/mypy.ini#L132 2. Run mypy: - Before fix ```shell (venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark | grep version python/pyspark/pandas/spark/accessors.py:884: error: Module has no attribute "__version__" ``` - After fix ```shell (venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark | grep version ``` no output Closes #32110 from Yikun/SPARK-34629. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 4c1ccdabe8d5c0012ce481822c07beaf8e7ead0c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 11 April 2021, 01:40:21 UTC
a4e49cf [MINOR][SS][DOC] Fix wrong Python code sample ### What changes were proposed in this pull request? This patch fixes wrong Python code sample for doc. ### Why are the changes needed? Sample code is wrong. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc only. Closes #32119 from Hisssy/ss-doc-typo-1. Authored-by: hissy <aozora@live.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 214a46aa88cd682874584dc407ad130a30761884) Signed-off-by: Max Gekk <max.gekk@gmail.com> 10 April 2021, 09:33:47 UTC
abc9a4f [SPARK-34963][SQL] Fix nested column pruning for extracting case-insensitive struct field from array of struct ### What changes were proposed in this pull request? This patch proposes a fix of nested column pruning for extracting case-insensitive struct field from array of struct. ### Why are the changes needed? Under case-insensitive mode, nested column pruning rule cannot correctly push down extractor of a struct field of an array of struct, e.g., ```scala val query = spark.table("contacts").select("friends.First", "friends.MiDDle") ``` Error stack: ``` [info] java.lang.IllegalArgumentException: Field "First" does not exist. [info] Available fields: [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274) [info] at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) [info] at scala.collection.AbstractMap.getOrElse(Map.scala:59) [info] at org.apache.spark.sql.types.StructType.apply(StructType.scala:273) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:44) [info] at org.apache.spark.sql.execution.ProjectionOverSchema$$anonfun$getProjection$3.apply(ProjectionOverSchema.scala:41) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32059 from viirya/fix-array-nested-pruning. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 364d1eaf10f51c357f507325557fb076140ced2c) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 09 April 2021, 18:53:31 UTC
c5ba6bb [SPARK-35004][TEST] Fix Incorrect assertion of `master/worker web ui available behind front-end reverseProxy` in MasterSuite ### What changes were proposed in this pull request? Line 425 in `MasterSuite` is considered as unused expression by Intellij IDE, https://github.com/apache/spark/blob/bfba7fadd2e65c853971fb2983bdea1c52d1ed7f/core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala#L421-L426 If we merge lines 424 and 425 into one as: ``` System.getProperty("spark.ui.proxyBase") should startWith (s"$reverseProxyUrl/proxy/worker-") ``` this assertion will fail: ``` - master/worker web ui available behind front-end reverseProxy *** FAILED *** The code passed to eventually never returned normally. Attempted 45 times over 5.091914027 seconds. Last failure message: "http://proxyhost:8080/path/to/spark" did not start with substring "http://proxyhost:8080/path/to/spark/proxy/worker-". (MasterSuite.scala:405) ``` `System.getProperty("spark.ui.proxyBase")` should be `reverseProxyUrl` because `Master#onStart` and `Worker#handleRegisterResponse` have not changed it. So the main purpose of this pr is to fix the condition of this assertion. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test: 1. merge lines 424 and 425 in `MasterSuite` into one to eliminate the unused expression: ``` System.getProperty("spark.ui.proxyBase") should startWith (s"$reverseProxyUrl/proxy/worker-") ``` 2. execute `mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.master.MasterSuite` **Before** ``` - master/worker web ui available behind front-end reverseProxy *** FAILED *** The code passed to eventually never returned normally. Attempted 45 times over 5.091914027 seconds. Last failure message: "http://proxyhost:8080/path/to/spark" did not start with substring "http://proxyhost:8080/path/to/spark/proxy/worker-". (MasterSuite.scala:405) Run completed in 1 minute, 14 seconds. Total number of tests run: 32 Suites: completed 2, aborted 0 Tests: succeeded 31, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` **After** ``` Run completed in 1 minute, 11 seconds. Total number of tests run: 32 Suites: completed 2, aborted 0 Tests: succeeded 32, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32105 from LuciferYang/SPARK-35004. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com> (cherry picked from commit c06758834e6192b1888b4a885c612a151588b390) Signed-off-by: Gengliang Wang <ltnwgl@gmail.com> 09 April 2021, 13:19:12 UTC
44ee132 Revert "[SPARK-34674][CORE][K8S] Close SparkContext after the Main method has finished" This reverts commit c625eb4f9f970108d93bf3342c7ccb7ec873dc27. 09 April 2021, 04:56:29 UTC
c625eb4 [SPARK-34674][CORE][K8S] Close SparkContext after the Main method has finished ### What changes were proposed in this pull request? Close SparkContext after the Main method has finished, to allow SparkApplication on K8S to complete ### Why are the changes needed? if I don't call the method sparkContext.stop() explicitly, then a Spark driver process doesn't terminate even after its Main method has been completed. This behaviour is different from spark on yarn, where the manual sparkContext stopping is not required. It looks like, the problem is in using non-daemon threads, which prevent the driver jvm process from terminating. So I have inserted code that closes sparkContext automatically. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually on the production AWS EKS environment in my company. Closes #32081 from kotlovs/close-spark-context-on-exit. Authored-by: skotlov <skotlov@joom.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ab97db75b2ab3ebb4d527610e2801df89cc23e2d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 April 2021, 23:51:49 UTC
4a6b13b [SPARK-34988][CORE][3.1] Upgrade Jetty for CVE-2021-28165 ### What changes were proposed in this pull request? This PR backports #32091. This PR upgrades the version of Jetty to 9.4.39. ### Why are the changes needed? CVE-2021-28165 affects the version of Jetty that Spark uses and it seems to be a little bit serious. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28165 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32095 from sarutak/SPARK-34988-branch-3.1. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com> 08 April 2021, 14:08:38 UTC
84d96e8 [SPARK-34922][SQL][3.1] Use a relative cost comparison function in the CBO ### What changes were proposed in this pull request? Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes. ### Why are the changes needed? In #30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`. With that we caused a performance regressions in some queries - TPCDS q19 for example. The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`. Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios. Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other. For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.00000000000001 > 1`. When working with ratios, then it would be better to multiply them. The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w * relativeSize^(1-w) < 1`. ### Does this PR introduce _any_ user-facing change? Comparison of the changed TPCDS v1.4 query execution times at sf=10:   | absolute | multiplicative |   | additive |   -- | -- | -- | -- | -- | -- q12 | 145 | 137 | -5.52% | 141 | -2.76% q13 | 264 | 271 | 2.65% | 271 | 2.65% q17 | 4521 | 4243 | -6.15% | 4348 | -3.83% q18 | 758 | 466 | -38.52% | 480 | -36.68% q19 | 38503 | 2167 | -94.37% | 2176 | -94.35% q20 | 119 | 120 | 0.84% | 126 | 5.88% q24a | 16429 | 16838 | 2.49% | 17103 | 4.10% q24b | 16592 | 16999 | 2.45% | 17268 | 4.07% q25 | 3558 | 3556 | -0.06% | 3675 | 3.29% q33 | 362 | 361 | -0.28% | 380 | 4.97% q52 | 1020 | 1032 | 1.18% | 1052 | 3.14% q55 | 927 | 938 | 1.19% | 961 | 3.67% q72 | 24169 | 13377 | -44.65% | 24306 | 0.57% q81 | 1285 | 1185 | -7.78% | 1168 | -9.11% q91 | 324 | 336 | 3.70% | 337 | 4.01% q98 | 126 | 129 | 2.38% | 131 | 3.97% All times are in ms, the change is compared to the situation in the master branch (absolute). The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved. ### How was this patch tested? PlanStabilitySuite Closes #32075 from tanelk/SPARK-34922_cbo_better_cost_function_3.1. Lead-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 08 April 2021, 02:01:53 UTC
f6b5c6f [SPARK-34970][SQL][SERCURITY][3.1] Redact map-type options in the output of explain() ### What changes were proposed in this pull request? The `explain()` method prints the arguments of tree nodes in logical/physical plans. The arguments could contain a map-type option that contains sensitive data. We should map-type options in the output of `explain()`. Otherwise, we will see sensitive data in explain output or Spark UI. ![image](https://user-images.githubusercontent.com/1097932/113719178-326ffb00-96a2-11eb-8a2c-28fca3e72941.png) ### Why are the changes needed? Data security. ### Does this PR introduce _any_ user-facing change? Yes, redact the map-type options in the output of `explain()` ### How was this patch tested? Unit tests Closes #32079 from gengliangwang/PR_TOOL_PICK_PR_32066_BRANCH-3.1. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com> 07 April 2021, 16:56:22 UTC
eefef50 [SPARK-34965][BUILD] Remove .sbtopts that duplicately sets the default memory ### What changes were proposed in this pull request? This PR removes `.sbtopts` (added in https://github.com/apache/spark/pull/29286) that duplicately sets the default memory. The default memories are set: https://github.com/apache/spark/blob/3b634f66c3e4a942178a1e322ae65ce82779625d/build/sbt-launch-lib.bash#L119-L124 ### Why are the changes needed? This file disables the memory option from the `build/sbt` script: ```bash ./build/sbt -mem 6144 ``` ``` .../jdk-11.0.3.jdk/Contents/Home as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. Error occurred during initialization of VM Initial heap size set to a larger value than the maximum heap size ``` because it adds these memory options at the last: ```bash /.../bin/java -Xms6144m -Xmx6144m -XX:ReservedCodeCacheSize=256m -Xmx4G -Xss4m -jar build/sbt-launch-1.5.0.jar ``` and Java respects the rightmost memory configurations. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually ran SBT. It will be tested in the CIs in this Pr. Closes #32062 from HyukjinKwon/SPARK-34965. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2eda1c65ea122b21c75d19bd3a6c58c905cb7304) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 April 2021, 22:17:05 UTC
5dc6b10 [SPARK-34923][SQL] Metadata output should be empty for more plans Changes the metadata propagation framework. Previously, most `LogicalPlan`'s propagated their `children`'s `metadataOutput`. This did not make sense in cases where the `LogicalPlan` did not even propagate their `children`'s `output`. I set the metadata output for plans that do not propagate their `children`'s `output` to be `Nil`. Notably, `Project` and `View` no longer have metadata output. Previously, `SELECT m from (SELECT a from tb)` would output `m` if it were metadata. This did not make sense. Yes. Now, `SELECT m from (SELECT a from tb)` will encounter an `AnalysisException`. Added unit tests. I did not cover all cases, as they are fairly extensive. However, the new tests cover major cases (and an existing test already covers Join). Closes #32017 from karenfeng/spark-34923. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3b634f66c3e4a942178a1e322ae65ce82779625d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 April 2021, 08:06:37 UTC
96f981b [SPARK-34949][CORE] Prevent BlockManager reregister when Executor is shutting down ### What changes were proposed in this pull request? This PR prevents reregistering BlockManager when a Executor is shutting down. It is achieved by checking `executorShutdown` before calling `env.blockManager.reregister()`. ### Why are the changes needed? This change is required since Spark reports executors as active, even they are removed. I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to `spark.dynamicAllocation.executorIdleTimeout`, I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. [spark.sparkContext.statusTracker.getExecutorInfos.length](https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L105) also returned a value greater than 1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test. ## Logs Following are the logs of the executor(Id:303) which re-registers `BlockManager` ``` 21/04/02 21:33:28 INFO CoarseGrainedExecutorBackend: Got assigned task 1076 21/04/02 21:33:28 INFO Executor: Running task 4.0 in stage 3.0 (TID 1076) 21/04/02 21:33:28 INFO MapOutputTrackerWorker: Updating epoch to 302 and clearing cache 21/04/02 21:33:28 INFO TorrentBroadcast: Started reading broadcast variable 3 21/04/02 21:33:28 INFO TransportClientFactory: Successfully created connection to /100.100.195.227:33703 after 76 ms (62 ms spent in bootstraps) 21/04/02 21:33:28 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.4 KB, free 168.0 MB) 21/04/02 21:33:28 INFO TorrentBroadcast: Reading broadcast variable 3 took 168 ms 21/04/02 21:33:28 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.9 KB, free 168.0 MB) 21/04/02 21:33:29 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 21/04/02 21:33:29 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTrackerda-lite-test-4-7a57e478947d206d-driver-svc.dex-app-n5ttnbmg.svc:7078) 21/04/02 21:33:29 INFO MapOutputTrackerWorker: Got the output locations 21/04/02 21:33:29 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 1 local blocks and 1 remote blocks 21/04/02 21:33:30 INFO TransportClientFactory: Successfully created connection to /100.100.80.103:40971 after 660 ms (528 ms spent in bootstraps) 21/04/02 21:33:30 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 1042 ms 21/04/02 21:33:31 INFO Executor: Finished task 4.0 in stage 3.0 (TID 1076). 1276 bytes result sent to driver . . . 21/04/02 21:34:16 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown 21/04/02 21:34:16 INFO Executor: Told to re-register on heartbeat 21/04/02 21:34:16 INFO BlockManager: BlockManager BlockManagerId(303, 100.100.122.34, 41265, None) re-registering with master 21/04/02 21:34:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(303, 100.100.122.34, 41265, None) 21/04/02 21:34:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(303, 100.100.122.34, 41265, None) 21/04/02 21:34:16 INFO BlockManager: Reporting 0 blocks to the master. 21/04/02 21:34:16 INFO MemoryStore: MemoryStore cleared 21/04/02 21:34:16 INFO BlockManager: BlockManager stopped 21/04/02 21:34:16 INFO FileDataSink: Closing sink with output file = /tmp/safari-events/.des_analysis/safari-events/hdp_spark_monitoring_random-container-037caf27-6c77-433f-820f-03cd9c7d9b6e-spark-8a492407d60b401bbf4309a14ea02ca2_events.tsv 21/04/02 21:34:16 INFO HonestProfilerBasedThreadSnapshotProvider: Stopping agent 21/04/02 21:34:16 INFO HonestProfilerHandler: Stopping honest profiler agent 21/04/02 21:34:17 INFO ShutdownHookManager: Shutdown hook called 21/04/02 21:34:17 INFO ShutdownHookManager: Deleting directory /var/data/spark-d886588c-2a7e-491d-bbcb-4f58b3e31001/spark-4aa337a0-60c0-45da-9562-8c50eaff3cea ``` Closes #32043 from sumeetgajjar/SPARK-34949. Authored-by: Sumeet Gajjar <sumeetgajjar93@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit a9ca1978ae8ecc53e2ef9e14b4be70dc8f5d9341) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 05 April 2021, 22:33:30 UTC
9cf137e [SPARK-34951][INFRA][PYTHON][TESTS] Set the system encoding as UTF-8 to recover the Sphinx build in GitHub Actions This PR proposes to set the system encoding as UTF-8. For some reasons, it looks like GitHub Actions machines changed theirs to ASCII by default. This leads to default encoding/decoding to use ASCII in Python, e.g.) `"a".encode()`, and looks like Sphinx depends on that. To recover GItHub Actions build. No, dev-only. Tested in https://github.com/apache/spark/pull/32046 Closes #32047 from HyukjinKwon/SPARK-34951. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 82ad2f9dff9b8d9cb8b6b166f14311f6066681c2) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 04 April 2021, 07:23:44 UTC
7375da2 [SPARK-34939][CORE] Throw fetch failure exception when unable to deserialize broadcasted map statuses ### What changes were proposed in this pull request? This patch catches `IOException`, which is possibly thrown due to unable to deserialize map statuses (e.g., broadcasted value is destroyed), when deserilizing map statuses. Once `IOException` is caught, `MetadataFetchFailedException` is thrown to let Spark handle it. ### Why are the changes needed? One customer encountered application error. From the log, it is caused by accessing non-existing broadcasted value. The broadcasted value is map statuses. E.g., ``` [info] Cause: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_0_piece0 of broadcast_0 [info] at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1410) [info] at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:226) [info] at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103) [info] at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) [info] at org.apache.spark.MapOutputTracker$.$anonfun$deserializeMapStatuses$3(MapOutputTracker.scala:967) [info] at org.apache.spark.internal.Logging.logInfo(Logging.scala:57) [info] at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56) [info] at org.apache.spark.MapOutputTracker$.logInfo(MapOutputTracker.scala:887) [info] at org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:967) ``` There is a race-condition. After map statuses are broadcasted and the executors obtain serialized broadcasted map statuses. If any fetch failure happens after, Spark scheduler invalidates cached map statuses and destroy broadcasted value of the map statuses. Then any executor trying to deserialize serialized broadcasted map statuses and access broadcasted value, `IOException` will be thrown. Currently we don't catch it in `MapOutputTrackerWorker` and above exception will fail the application. Normally we should throw a fetch failure exception for such case. Spark scheduler will handle this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32033 from viirya/fix-broadcast-master. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 571acc87fef6ddf8a6046bf710d5065dc02d76bd) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2021, 01:38:00 UTC
71c2f15 [SPARK-34948][K8S] Add ownerReference to executor configmap to fix leakages This PR aims to add `ownerReference` to the executor ConfigMap to fix leakage. SPARK-30985 maintains the executor config map explicitly inside Spark. However, this config map can be leaked when Spark drivers die accidentally or are killed by K8s. We need to add `ownerReference` to make K8s do the garbage collection these automatically. The number of ConfigMap is one of the resource quota. So, the leaked configMaps currently cause Spark jobs submission failures. No. Pass the CIs and check manually. K8s IT is tested manually. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 2 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` **BEFORE** ``` $ k get cm spark-exec-450b417895b3b2c7-conf-map -oyaml | grep ownerReferences ``` **AFTER** ``` $ k get cm spark-exec-bb37a27895b1c26c-conf-map -oyaml | grep ownerReferences f:ownerReferences: ``` Closes #32042 from dongjoon-hyun/SPARK-34948. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a42dc93a2abf9490d68146b3586aec7fe2f9c102) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2021, 07:01:53 UTC
12d39d5 [SPARK-34940][SQL][TEST] Fix test of BasicWriteTaskStatsTrackerSuite ### What changes were proposed in this pull request? This is to fix the minor typo in unit test of BasicWriteTaskStatsTrackerSuite (https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala#L152 ), where it should be a new file name, e.g. `f-3-3`, because the unit test expects 3 files in statistics (https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala#L160 ). ### Why are the changes needed? Fix minor bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Changed unit test `"Three files, last one empty"` itself. Closes #32034 from c21/tracker-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 280a2f359c55d8c0563ebbf54a3196ffeb65560e) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 April 2021, 06:51:32 UTC
3b57a66 [SPARK-34933][DOC][SQL] Remove the description that || and && can be used as logical operators from the document ### What changes were proposed in this pull request? This PR removes the description that `||` and `&&` can be used as logical operators from the migration guide. ### Why are the changes needed? At the `Compatibility with Apache Hive` section in the migration guide, it describes that `||` and `&&` can be used as logical operators. But, in fact, they cannot be used as described. AFAIK, Hive also doesn't support `&&` and `||` as logical operators. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that `&&` and `||` cannot be used as logical operators with both Hive's interactive shell and `spark-sql`. I also built the modified document and confirmed that the modified document doesn't break layout. Closes #32023 from sarutak/modify-hive-compatibility-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 8724f2b8b746f5742e8598ae33492e65fd4acc7c) Signed-off-by: Sean Owen <srowen@gmail.com> 01 April 2021, 22:14:54 UTC
8c9b942 [SPARK-24931][INFRA] Fix the GA failure related to R linter ### What changes were proposed in this pull request? This PR fixes the GA failure related to R linter which happens on some PRs (e.g. #32023, #32025). The reason seems `Rscript -e "devtools::install_github('jimhester/lintrv2.0.0')"` fails to download `lintrv2.0.0`. I don't know why but I confirmed we can download `v2.0.1`. ### Why are the changes needed? To keep GA healthy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #32028 from sarutak/hotfix-r. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f99a831dabfe043aa391be660f1404b657d0ca77) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 April 2021, 19:14:48 UTC
b3a27fb [SPARK-34915][INFRA] Cache Maven, SBT and Scala in all jobs that use them This PR proposes to cache Maven, SBT and Scala in all jobs that use them. For simplicity, we use the same key `build-` and just cache all SBT, Maven and Scala. The cache is not very large. To speed up the build. No, dev-only. It will be tested in this PR's GA jobs. Closes #32011 from HyukjinKwon/SPARK-34915. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com> (cherry picked from commit 48ef9bd2b3a0cb52aaa31a3fada8779b7a7b9132) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 31 March 2021, 11:21:03 UTC
557be5f [SPARK-34909][SQL] Fix conversion of negative to unsigned in conv() ### What changes were proposed in this pull request? Use `java.lang.Long.divideUnsigned()` to do integer division in `NumberConverter` to avoid a bug in `unsignedLongDiv` that produced invalid results. ### Why are the changes needed? The previous results are incorrect, the result of the below query should be 45012021522523134134555 ``` scala> spark.sql("select conv('-10', 11, 7)").show(20, 150) +-----------------------+ | conv(-10, 11, 7)| +-----------------------+ |4501202152252313413456| +-----------------------+ scala> spark.sql("select hex(conv('-10', 11, 7))").show(20, 150) +----------------------------------------------+ | hex(conv(-10, 11, 7))| +----------------------------------------------+ |3435303132303231353232353233313334313334353600| +----------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? `conv()` will produce different results because the bug is fixed. ### How was this patch tested? Added a simple unit test. Closes #32006 from timarmstrong/conv-unsigned. Authored-by: Tim Armstrong <tim.armstrong@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 13b255fefd881beb68fd8bb6741c7f88318baf9b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 March 2021, 04:58:52 UTC
20891a7 [SPARK-34900][TEST][3.1] Make sure benchmarks can run using spark-submit cmd described in the guide ### What changes were proposed in this pull request? Some `spark-submit` commands used to run benchmarks in the user's guide is wrong, we can't use these commands to run benchmarks successful. So the major changes of this pr is correct these wrong commands, for example, run a benchmark which inherits from `SqlBasedBenchmark`, we must specify `--jars <spark core test jar>,<spark catalyst test jar>` because `SqlBasedBenchmark` based benchmark extends `BenchmarkBase(defined in spark core test jar)` and `SQLHelper(defined in spark catalyst test jar)`. Another change of this pr is removed the `scalatest Assertions` dependency of Benchmarks because `scalatest-*.jar` are not in the distribution package, it will be troublesome to use. ### Why are the changes needed? Make sure benchmarks can run using spark-submit cmd described in the guide ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use the corrected `spark-submit` commands to run benchmarks successfully. Closes #32002 from LuciferYang/SPARK-34900-31. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 30 March 2021, 15:00:38 UTC
e17f68f [SPARK-34845][CORE] ProcfsMetricsGetter shouldn't return partial procfs metrics ### What changes were proposed in this pull request? In ProcfsMetricsGetter.scala, propogating IOException from addProcfsMetricsFromOneProcess to computeAllMetrics when the child pid's proc stat file is unavailable. As a result, the for-loop in computeAllMetrics() can terminate earlier and return an all-0 procfs metric. ### Why are the changes needed? In the case of a child pid's stat file missing and the subsequent child pids' stat files exist, ProcfsMetricsGetter.computeAllMetrics() will return partial metrics (the sum of a subset of child pids), which can be misleading and is undesired per the existing code comments in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214. Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise. The unit test can also explain the bug well. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A unit test is added. Closes #31945 from baohe-zhang/SPARK-34845. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b2bfe985e8adf55e5df5887340fd862776033a06) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 March 2021, 14:47:13 UTC
40921eb [SPARK-34814][SQL] LikeSimplification should handle NULL ### What changes were proposed in this pull request? LikeSimplification should handle NULL. UT will failed before this pr ``` test("SPARK-34814: LikeSimplification should handle NULL") { withSQLConf(SQLConf.OPTIMIZER_EXCLUDED_RULES.key -> ConstantFolding.getClass.getName.stripSuffix("$")) { checkEvaluation(Literal.create("foo", StringType) .likeAll("%foo%", Literal.create(null, StringType)), null) } } [info] - test *** FAILED *** (2 seconds, 443 milliseconds) [info] java.lang.NullPointerException: [info] at org.apache.spark.sql.catalyst.optimizer.LikeSimplification$.$anonfun$simplifyMultiLike$1(expressions.scala:697) [info] at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) [info] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) [info] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) [info] at scala.collection.TraversableLike.map(TraversableLike.scala:238) [info] at scala.collection.TraversableLike.map$(TraversableLike.scala:231) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:108) [info] at org.apache.spark.sql.catalyst.optimizer.LikeSimplification$.org$apache$spark$sql$catalyst$optimizer$LikeSimplification$$simplifyMultiLike(expressions.scala:697) [info] at org.apache.spark.sql.catalyst.optimizer.LikeSimplification$$anonfun$apply$9.applyOrElse(expressions.scala:722) [info] at org.apache.spark.sql.catalyst.optimizer.LikeSimplification$$anonfun$apply$9.applyOrElse(expressions.scala:714) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:321) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:321) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsDown$1(QueryPlan.scala:94) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:116) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) ``` ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #31976 from AngersZhuuuu/SPARK-34814. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 2356cdd420f600f38d0e786dc50c15f2603b7ff2) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 March 2021, 03:05:22 UTC
75c8ac1 [SPARK-34876][SQL] Fill defaultResult of non-nullable aggregates ### What changes were proposed in this pull request? Filled the `defaultResult` field on non-nullable aggregates ### Why are the changes needed? The `defaultResult` defaults to `None` and in some situations (like correlated scalar subqueries) it is used for the value of the aggregation. The UT result before the fix: ``` -- !query SELECT t1a, (SELECT count(t2d) FROM t2 WHERE t2a = t1a) count_t2, (SELECT count_if(t2d > 0) FROM t2 WHERE t2a = t1a) count_if_t2, (SELECT approx_count_distinct(t2d) FROM t2 WHERE t2a = t1a) approx_count_distinct_t2, (SELECT collect_list(t2d) FROM t2 WHERE t2a = t1a) collect_list_t2, (SELECT collect_set(t2d) FROM t2 WHERE t2a = t1a) collect_set_t2, (SELECT hex(count_min_sketch(t2d, 0.5d, 0.5d, 1)) FROM t2 WHERE t2a = t1a) collect_set_t2 FROM t1 -- !query schema struct<t1a:string,count_t2:bigint,count_if_t2:bigint,approx_count_distinct_t2:bigint,collect_list_t2:array<bigint>,collect_set_t2:array<bigint>,collect_set_t2:string> -- !query output val1a 0 0 NULL NULL NULL NULL val1a 0 0 NULL NULL NULL NULL val1a 0 0 NULL NULL NULL NULL val1a 0 0 NULL NULL NULL NULL val1b 6 6 3 [19,119,319,19,19,19] [19,119,319] 0000000100000000000000060000000100000004000000005D8D6AB90000000000000000000000000000000400000000000000010000000000000001 val1c 2 2 2 [219,19] [219,19] 0000000100000000000000020000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000001 val1d 0 0 NULL NULL NULL NULL val1d 0 0 NULL NULL NULL NULL val1d 0 0 NULL NULL NULL NULL val1e 1 1 1 [19] [19] 0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000 val1e 1 1 1 [19] [19] 0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000 val1e 1 1 1 [19] [19] 0000000100000000000000010000000100000004000000005D8D6AB90000000000000000000000000000000100000000000000000000000000000000 ``` ### Does this PR introduce _any_ user-facing change? Bugfix ### How was this patch tested? UT Closes #31973 from tanelk/SPARK-34876_non_nullable_agg_subquery. Authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 4b9e94c44412f399ba19e0ea90525d346942bf71) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 March 2021, 02:47:18 UTC
4bf5cd6 [SPARK-34829][SQL] Fix higher order function results ### What changes were proposed in this pull request? This PR fixes a correctness issue with higher order functions. The results of function expressions needs to be copied in some higher order functions as such an expression can return with internal buffers and higher order functions can call multiple times the expression. The issue was discovered with typed `ScalaUDF`s after https://github.com/apache/spark/pull/28979. ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, some queries return the right results again. ### How was this patch tested? Added new UT. Closes #31955 from peter-toth/SPARK-34829-fix-scalaudf-resultconversion. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 33821903490650463f47d03d20d2b16f77360e99) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 March 2021, 17:01:22 UTC
de4be70 [SPARK-34874][INFRA] Recover test reports for failed GA builds ### What changes were proposed in this pull request? https://github.com/dawidd6/action-download-artifact/commit/621becc6d7c440318382ce6f4cb776f27dd3fef3#r48726074 there was a behaviour change in the download artifact plugin and it disabled the test reporting in failed builds. This PR recovers it by explicitly setting the conclusion from the workflow runs to search for the artifacts to download. ### Why are the changes needed? In order to properly report the failed test cases. ### Does this PR introduce _any_ user-facing change? No, it's dev only. ### How was this patch tested? Manually tested at https://github.com/HyukjinKwon/spark/pull/30 Before: ![Screen Shot 2021-03-26 at 10 54 48 AM](https://user-images.githubusercontent.com/6477701/112566110-b7951d80-8e21-11eb-8fad-f637db9314d5.png) After: ![Screen Shot 2021-03-26 at 5 04 01 PM](https://user-images.githubusercontent.com/6477701/112606215-7588cd80-8e5b-11eb-8fdd-3afebd629f4f.png) Closes #31970 from HyukjinKwon/SPARK-34874. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit c8233f1be5c2f853f42cda367475eb135a83afd5) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 March 2021, 09:12:37 UTC
f3c1298 [SPARK-34833][SQL][FOLLOWUP] Handle outer references in all the places ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/31940 . This PR generalizes the matching of attributes and outer references, so that outer references are handled everywhere. Note that, currently correlated subquery has a lot of limitations in Spark, and the newly covered cases are not possible to happen. So this PR is a code refactor. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #31959 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 658e95c345d5aa2a98b8d2a854e003a5c77ed581) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 26 March 2021, 00:10:19 UTC
c106e01 [SPARK-34840][SHUFFLE] Fixes cases of corruption in merged shuffle … ### What changes were proposed in this pull request? This PR fixes bugs that causes corruption of push-merged blocks when a client terminates while pushing block. `RemoteBlockPushResolver` was introduced in #30062 (SPARK-32916). There are 2 scenarios where the merged blocks get corrupted: 1. `StreamCallback.onFailure()` is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order. - `exceptionCaught`. This event is propagated to `StreamInterceptor`. `StreamInterceptor.exceptionCaught()` invokes `callback.onFailure(streamId, cause)`. This is the first time StreamCallback.onFailure() will be invoked. - `channelInactive`. Since the channel closes, the `channelInactive` event gets triggered which again is propagated to `StreamInterceptor`. `StreamInterceptor.channelInactive()` invokes `callback.onFailure(streamId, new ClosedChannelException())`. This is the second time StreamCallback.onFailure() will be invoked. 2. The flag `isWriting` is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails. Also adding additional changes that improve the code. 1. Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact. 2. Additional minor changes suggested by mridulm during an internal review. ### Why are the changes needed? These are bug fixes and simplify the code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. I have also tested these changes in Linkedin's internal fork on a cluster. Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Min Shen mshenlinkedin.com Closes #31934 from otterc/SPARK-32916-followup. Lead-authored-by: Chandni Singh <singh.chandni@gmail.com> Co-authored-by: Min Shen <mshen@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 6d88212f79a67fa070a91a0123c5bb34683ddc3a) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 25 March 2021, 17:49:15 UTC
back to top