https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
1d550c4 Preparing Spark release v3.1.1-rc3 22 February 2021, 00:52:36 UTC
1cf4e1b [SPARK-34469][K8S] Ignore RegisterExecutor when SparkContext is stopped ### What changes were proposed in this pull request? This PR aims to make `KubernetesClusterSchedulerBackend` ignore `RegisterExecutor` message when `SparkContext` is stopped already. ### Why are the changes needed? If `SparkDriver` is terminated, the executors will be removed by K8s automatically. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the newly added test case. Closes #31587 from dongjoon-hyun/SPARK-34469. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 February 2021, 17:07:22 UTC
25a5f37 [SPARK-34487][K8S][TESTS] Use the runtime Hadoop version in K8s IT ### What changes were proposed in this pull request? This PR aims to use the runtime Hadoop version in K8s integration test. ### Why are the changes needed? SPARK-33212 upgrades Hadoop dependency from 3.2.0 to 3.2.2 and we will upgrade to 3.3.x+. We had better use the runtime Hadoop version instead of having a static string. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the K8s IT. This is tested locally like the following. ``` KubernetesSuite: ... - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file ... ``` Closes #31604 from dongjoon-hyun/SPARK-34487. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9942548c37ee6b08b6e29332c1e42407f4026fd3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 February 2021, 16:57:22 UTC
02be408 [SPARK-34384][CORE] Add missing docs for ResourceProfile APIs ### What changes were proposed in this pull request? This PR adds missing docs for ResourceProfile related APIs. Besides, it includes a few minor changes on API: * ResourceProfileBuilder.build -> ResourceProfileBuilder.builder() * Provides java specific API `allSupportedExecutorResourcesJList` * private `ResourceAllocator` since it was mistakenly exposed previously ### Why are the changes needed? Add missing API docs ### Does this PR introduce _any_ user-facing change? No, as Apache Spark 3.1 hasn't officially released. ### How was this patch tested? Updated unit tests due to the signature change of `build()`. Closes #31496 from Ngone51/resource-profile-api-cleanup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 546d2eb5d46813a14c7bd30113fb6bb038cdd2fc) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 February 2021, 09:29:59 UTC
7ef5c1a [SPARK-34373][SQL] HiveThriftServer2 startWithContext may hang with a race issue ### What changes were proposed in this pull request? fix a race issue by interrupting the thread ### Why are the changes needed? ``` 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error occurred during acceptance of message. org.apache.thrift.transport.TTransportException: No underlying server socket. at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.acceException in thread "Thread-15" java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) at java.io.BufferedInputStream.read(BufferedInputStream.java:336) at java.io.FilterInputStream.read(FilterInputStream.java:107) at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238) at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246) at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227) at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221) ``` when the TServer try to `serve` after `stop`, it hangs with the log above forever ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing ci Closes #31479 from yaooqinn/SPARK-34373. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1fac706db560001411672c5ade42f6608f82989e) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 February 2021, 08:37:34 UTC
093d4e2 [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator This PR is a fix for the JLS 17.5.3 violation identified in zsxwing's [19/Feb/19 11:47 comment](https://issues.apache.org/jira/browse/SPARK-20977?focusedCommentId=16772277&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772277) on the JIRA. ### What changes were proposed in this pull request? - Use a var field to hold the state of the collection accumulator ### Why are the changes needed? AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures on a regular basis https://github.com/NVIDIA/spark-rapids/issues/1522. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - this is a concurrency bug that is almost impossible to reproduce as a quick unit test. - By trial and error I crafted a command https://github.com/NVIDIA/spark-rapids/pull/1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes. After the patch, these Exceptions have not shown up after running overnight for 10+ hours - existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite` Closes #31540 from gerashegalov/SPARK-20977. Authored-by: Gera Shegalov <gera@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit fadd0f5d9bff79cbd785631aa2962b9eda644ab8) Signed-off-by: Sean Owen <srowen@gmail.com> 21 February 2021, 02:57:25 UTC
62d2810 [SPARK-34471][SS][DOCS] Document Streaming Table APIs in Structured Streaming Programming Guide ### What changes were proposed in this pull request? This change is to document the newly added streaming table APIs in Structured Streaming Programming Guide. ### Why are the changes needed? This will help our users when they try to use the new APIs. ### Does this PR introduce _any_ user-facing change? Yes. Users will see the changes in the programming guide. ### How was this patch tested? Built the HTML page and verified. Attached is a screenshot of the section added: ![Table APIs Section - Scala](https://user-images.githubusercontent.com/44179472/108581923-1ff86700-736b-11eb-8fcd-efa04ac936de.png) Closes #31590 from bozhang2820/table-api-doc. Lead-authored-by: Bo Zhang <bo.zhang@databricks.com> Co-authored-by: Bo Zhang <bozhang2820@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit 489d32aa9bb9ef9446ac8df19deb0693f305b092) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> 20 February 2021, 06:55:01 UTC
41b22c4 [SPARK-34424][SQL][TESTS][3.1][3.0] Fix failures of HiveOrcHadoopFsRelationSuite ### What changes were proposed in this pull request? Modify `RandomDataGenerator.forType()` to allow generation of dates/timestamps that are valid in both Julian and Proleptic Gregorian calendars. Currently, the function can produce a date (for example `1582-10-06`) which is valid in the Proleptic Gregorian calendar. Though it cannot be saved to ORC files AS IS since ORC format (ORC libs in fact) assumes Julian calendar. So, Spark shifts `1582-10-06` to the next valid date `1582-10-15` while saving it to ORC files. And as a consequence of that, the test fails because it compares original date `1582-10-06` and the date `1582-10-15` loaded back from the ORC files. In this PR, I propose to generate valid dates/timestamps in both calendars for ORC datasource till SPARK-34440 is resolved. ### Why are the changes needed? The changes fix failures of `HiveOrcHadoopFsRelationSuite`. For instance, the test "test all data types" fails with the seed **610710213676**: ``` == Results == !== Correct Answer - 20 == == Spark Answer - 20 == struct<index:int,col:date> struct<index:int,col:date> ... ![9,1582-10-06] [9,1582-10-15] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcHadoopFsRelationSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 03161055de0c132070354407160553363175c4d7) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #31589 from MaxGekk/fix-HiveOrcHadoopFsRelationSuite-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 February 2021, 13:26:37 UTC
fdedd73 [SPARK-34421][SQL] Resolve temporary functions and views in views with CTEs ### What changes were proposed in this pull request? This PR: - Fixes a bug that prevents analysis of: ``` CREATE TEMPORARY VIEW temp_view AS WITH cte AS (SELECT temp_func(0)) SELECT * FROM cte; SELECT * FROM temp_view ``` by throwing: ``` Undefined function: 'temp_func'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'. ``` - and doesn't report analysis error when it should: ``` CREATE TEMPORARY VIEW temp_view AS SELECT 0; CREATE VIEW view_on_temp_view AS WITH cte AS (SELECT * FROM temp_view) SELECT * FROM cte ``` by properly collecting temporary objects from VIEW definitions with CTEs. - Minor refactor to make the affected code more readable. ### Why are the changes needed? To fix a bug introduced with https://github.com/apache/spark/pull/30567 ### Does this PR introduce _any_ user-facing change? Yes, the query works again. ### How was this patch tested? Added new UT + existing ones. Closes #31550 from peter-toth/SPARK-34421-temp-functions-in-views-with-cte. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 27abb6ab5674b8663440dc738a0ba79c185fb063) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 February 2021, 10:15:11 UTC
510f04e [MINOR][SQL][DOCS] Fix the comments in the example at window function `functions.scala` window function has an comment error in the field name. The column should be `time` per `timestamp:TimestampType`. To deliver the correct documentation and examples. Yes, it fixes the user-facing docs. CI builds in this PR should test the documentation build. Closes #31582 from yzjg/yzjg-patch-1. Authored-by: yzjg <785246661@qq.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 26548edfa2445b009f63bbdbe810cdb6c289c18d) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 19 February 2021, 01:45:56 UTC
691d266 [SPARK-34449][BUILD] Upgrade Jetty to fix CVE-2020-27218 This PR upgrades Jetty from `9.4.34` to `9.4.36`. CVE-2020-27218 affects currently used Jetty 9.4.34. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-27218 No. Modified existing test and new test which comply with the new version of Jetty. Closes #31574 from sarutak/upgrade-jetty-9.4.36. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 51672281728164db731f3f607818bffea0334eb0) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 18 February 2021, 09:06:00 UTC
dd6b286 [SPARK-34446][SS][DOCS] Update doc for stream-stream join (full outer + left semi) ### What changes were proposed in this pull request? Per discussion in https://issues.apache.org/jira/browse/SPARK-32883?focusedCommentId=17285057&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17285057, we should add documentation for added new features of full outer and left semi joins into SS programming guide. * Reworded the section for "Outer Joins with Watermarking", to make it work for full outer join. Updated the code snippet to show up full outer and left semi join. * Added one section for "Semi Joins with Watermarking", similar to "Outer Joins with Watermarking". * Updated "Support matrix for joins in streaming queries" to reflect latest fact for full outer and left semi join. ### Why are the changes needed? Good for users and developers to follow guide to try out these two new features. ### Does this PR introduce _any_ user-facing change? Yes. They will see the corresponding updated guide. ### How was this patch tested? No, just documentation change. Previewed the markdown file in browser. Also attached here for the change to the "Support matrix for joins in streaming queries" table. <img width="896" alt="Screen Shot 2021-02-16 at 8 12 07 PM" src="https://user-images.githubusercontent.com/4629931/108155275-73c92e80-7093-11eb-9f0b-c8b4bb7321e5.png"> Closes #31572 from c21/ss-doc. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit a575e805a18a515f8707f74cf2b22777474f2f06) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 18 February 2021, 00:34:47 UTC
4136edb [SPARK-33210][SQL][DOCS][FOLLOWUP] Fix descriptions of the SQL configs for the parquet INT96 rebase modes ### What changes were proposed in this pull request? Fix descriptions of the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInWrite`, and mention `EXCEPTION` as the default value. ### Why are the changes needed? This fixes incorrect descriptions that can mislead users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #31557 from MaxGekk/int96-exception-by-default-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1a11fe55017a79016dd138dd2afb4edd0a6cef2f) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 16 February 2021, 02:56:03 UTC
a75dc10 [SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well. No docs were generated for `pyspark.sql.conf.RuntimeConfig`. Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch. First built the Python docs: ```bash cd $SPARK_HOME/docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve ``` Then verified all pages and links: 1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page: http://localhost:4000/api/python/reference/index.html ![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png) 2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page: http://localhost:4000/api/python/reference/pyspark.sql.html#configuration ![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)** 3. RuntimeConfig page displayed: http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html ![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png) 4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page: http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html ![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png) Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable. Authored-by: Eric Lemmon <eric@lemmon.cc> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit e3b6e4ad435b31aabd6781df63c50c1f92bdfeac) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 16 February 2021, 01:25:23 UTC
21754a1 [SPARK-34431][CORE] Only load `hive-site.xml` once ### What changes were proposed in this pull request? Lazily load Hive's configuration properties from `hive-site.xml` only once. ### Why are the changes needed? It is expensive to parse the same file over and over. ### Does this PR introduce _any_ user-facing change? Should not. The changes can improve performance slightly. ### How was this patch tested? By existing test suites such as `SparkContextSuite`. Closes #31556 from MaxGekk/load-hive-site-once. Authored-by: herman <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 4fd3247bca400f31b0175813df811352b906acbf) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 February 2021, 17:32:27 UTC
2d4a515 [SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document ### What changes were proposed in this pull request? This follows up #31160 to update score function in the document. ### Why are the changes needed? Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No, only doc change. Closes #31531 from viirya/SPARK-34080-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1fbd5764105e2c09caf4ab57a7095dd794307b02) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 10 February 2021, 00:24:41 UTC
5d0c84a [SPARK-34334][K8S] Correctly identify timed out pending pod requests as excess request ### What changes were proposed in this pull request? Fixing identification of timed-out pending pod requests as excess requests to delete when the excess is higher than the newly created timed out requests and there is some non-timed out newly created requests too. ### Why are the changes needed? After https://github.com/apache/spark/pull/29981 only timed out newly created requests and timed out pending requests are taken as excess request. But there is small bug when the excess is higher than the newly created timed out requests and there is some non-timed out newly created requests as well. Because all the newly created requests are counted as excess request when items are chosen from the timed out pod pending requests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? There is new unit test added: `SPARK-34334: correctly identify timed out pending pod requests as excess`. Closes #31445 from attilapiros/SPARK-34334. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> (cherry picked from commit b2dc38b6546552cf3fcfdcd466d7d04d9aa3078c) Signed-off-by: Holden Karau <hkarau@apple.com> 09 February 2021, 18:09:26 UTC
800be71 [MINOR][ML][TESTS] Increase tolerance to make NaiveBayesSuite more robust ### What changes were proposed in this pull request? Increase the rel tol from 0.2 to 0.35. ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT. Closes #31536 from WeichenXu123/ES-65815. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 18b30107adb37d3c7a767a20cc02813f0fdb86da) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 09 February 2021, 14:00:27 UTC
b0ffbf7 [SPARK-34407][K8S] KubernetesClusterSchedulerBackend.stop should clean up K8s resources ### What changes were proposed in this pull request? This PR aims to fix `KubernetesClusterSchedulerBackend.stop` to wrap `super.stop` with `Utils.tryLogNonFatalError`. ### Why are the changes needed? [CoarseGrainedSchedulerBackend.stop](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L559) may throw `SparkException` and this causes K8s resource (pod and configmap) leakage. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix. ### How was this patch tested? Pass the CI with the newly added test case. Closes #31533 from dongjoon-hyun/SPARK-34407. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ea339c38b43c59931257386efdd490507f7de64d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 February 2021, 05:47:35 UTC
7047923 [SPARK-34405][CORE] Fix mean value of timersLabels in the PrometheusServlet class ### What changes were proposed in this pull request? The getMetricsSnapshot method of the PrometheusServlet class has a wrong value, It should be taking the mean value but it's taking the max value. ### Why are the changes needed? The mean value of timersLabels in the PrometheusServlet class is wrong, You can look at line 105 of this class: L105. ``` sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMax}\n") ``` it should be ``` sb.append(s"${prefix}Mean$timersLabels ${snapshot.getMean}\n") ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ![image](https://user-images.githubusercontent.com/5170878/107313576-cc199280-6acd-11eb-9384-b6abf71c0f90.png) Closes #31532 from 397090770/SPARK-34405. Authored-by: wyp <wyphao.2007@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a1e75edc39c11e85d8a4917c3e82282fa974be96) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 February 2021, 05:18:45 UTC
c7e9da9 Revert "[SPARK-34352][SQL] Improve SQLQueryTestSuite so as could run on windows system" This reverts commit db8db0da1c2da24c191b0b89a0fcaa55eafeb7ef. 09 February 2021, 03:03:11 UTC
db8db0d [SPARK-34352][SQL] Improve SQLQueryTestSuite so as could run on windows system ### What changes were proposed in this pull request? The current implement of `SQLQueryTestSuite` cannot run on windows system. Becasue the code below will fail on windows system: `assume(TestUtils.testCommandAvailable("/bin/bash"))` For operation system that cannot support `/bin/bash`, we just skip some tests. ### Why are the changes needed? SQLQueryTestSuite has a bug on windows system. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #31466 from beliefer/SPARK-34352. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e65b28cf7d9680ebdf96833a6f2d38ffd61c7d21) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 09 February 2021, 01:59:12 UTC
6ca8e75 [SPARK-33438][SQL] Eagerly init objects with defined SQL Confs for command `set -v` ### What changes were proposed in this pull request? In Spark, `set -v` is defined as "Queries all properties that are defined in the SQLConf of the sparkSession". But there are other external modules that also define properties and register them to SQLConf. In this case, it can't be displayed by `set -v` until the conf object is initiated (i.e. calling the object at least once). In this PR, I propose to eagerly initiate all the objects registered to SQLConf, so that `set -v` will always output the completed properties. ### Why are the changes needed? Improve the `set -v` command to produces completed and deterministic results ### Does this PR introduce _any_ user-facing change? `set -v` command will dump more configs ### How was this patch tested? existing tests Closes #30363 from linhongliu-db/set-v. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 037bfb2dbcb73cfbd73f0fd9abe0b38789a182a2) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 08 February 2021, 13:49:03 UTC
c4d90f3 Preparing development version 3.1.2-SNAPSHOT 08 February 2021, 13:10:08 UTC
cf0115a Preparing Spark release v3.1.1-rc2 08 February 2021, 13:10:01 UTC
76daa1f [MINOR][INFRA][DOC][3.1] Change the facetFilters of Docsearch to 3.1.1 ### What changes were proposed in this pull request? As https://github.com/algolia/docsearch-configs/pull/3391 is merged, This PR changes the facetFilters of Docsearch as 3.1.1. ### Why are the changes needed? So that the search result of the published Spark site will points to https://spark.apache.org/docs/3.1.1 instead of https://spark.apache.org/docs/latest/. This is useful for searching the docs of 3.1.1 after more new Spark releases in the future. ### Does this PR introduce _any_ user-facing change? Yes, the search result of 3.1.1 Spark doc site is based on https://spark.apache.org/docs/3.1.1 instead of https://spark.apache.org/docs/latest/ ### How was this patch tested? Just configuration changes. Closes #31525 from gengliangwang/changeDocSearchVersion. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 08 February 2021, 12:42:24 UTC
0ac4f04 [SPARK-33354][DOC] Remove an unnecessary quote in doc Remove an unnecessary quote in the documentation. Super trivial. Fix a mistake. No Just doc Closes #31523 from gengliangwang/removeQuote. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 88ced28141beb696791ae67eac35219de942bf31) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 08 February 2021, 12:09:30 UTC
2fe20ef [SPARK-34346][CORE][TESTS][FOLLOWUP] Fix UT by removing core-site.xml ### What changes were proposed in this pull request? This is a follow-up for SPARK-34346 which causes a flakiness due to `core-site.xml` test resource file addition. This PR aims to remove the test resource `core/src/test/resources/core-site.xml` from `core` module. ### Why are the changes needed? Due to the test resource `core-site.xml`, YARN UT becomes flaky in GitHub Action and Jenkins. ``` $ build/sbt "yarn/testOnly *.YarnClusterSuite -- -z SPARK-16414" -Pyarn ... [info] YarnClusterSuite: [info] - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) *** FAILED *** (20 seconds, 209 milliseconds) [info] FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:210) ``` To isolate more, we may use `SPARK_TEST_HADOOP_CONF_DIR` like `yarn` module's `yarn/Client`, but it seems an overkill in `core` module. ``` // SPARK-23630: during testing, Spark scripts filter out hadoop conf dirs so that user's // environments do not interfere with tests. This allows a special env variable during // tests so that custom conf dirs can be used by unit tests. val confDirs = Seq("HADOOP_CONF_DIR", "YARN_CONF_DIR") ++ (if (Utils.isTesting) Seq("SPARK_TEST_HADOOP_CONF_DIR") else Nil) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31515 from dongjoon-hyun/SPARK-34346-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit dcaf62afea8791e49a44c2062fe14bafdcc0e92f) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 08 February 2021, 02:32:36 UTC
14d0419 [SPARK-34158] Incorrect url of the only developer Matei in pom.xml ### What changes were proposed in this pull request? Update the Incorrect URL of the only developer Matei in pom.xml ### Why are the changes needed? The current link was broken ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? change the link to https://cs.stanford.edu/people/matei/ Closes #31512 from pingsutw/SPARK-34158. Authored-by: Kevin <pingsutw@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 30ef3d6e1c00bd1f28e71511576daad223ba8b22) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 08 February 2021, 00:15:18 UTC
de51f25 [SPARK-34398][DOCS] Fix PySpark migration link ### What changes were proposed in this pull request? docs/pyspark-migration-guide.md ### Why are the changes needed? broken link ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build and check Closes #31514 from raphaelauv/patch-2. Authored-by: raphaelauv <raphaelauv@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 34a1a65b398c4469eb97cd458ee172dc76b7ef56) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 08 February 2021, 00:12:28 UTC
64444a9 [PYTHON][MINOR] Fix docstring of DataFrame.join ### What changes were proposed in this pull request? Fix docstring of PySpark `DataFrame.join`. ### Why are the changes needed? For a better view of PySpark documentation. ### Does this PR introduce _any_ user-facing change? No (only documentation changes). ### How was this patch tested? Manual test. From ![image](https://user-images.githubusercontent.com/47337188/106977730-c14ab080-670f-11eb-8df8-5aea90902104.png) To ![image](https://user-images.githubusercontent.com/47337188/106977834-ed663180-670f-11eb-9c5e-d09be26e0ca8.png) Closes #31463 from xinrong-databricks/fixDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 747ad1809b4026aae4a7bedec2cac485bddcd5f2) Signed-off-by: Sean Owen <srowen@gmail.com> 06 February 2021, 15:09:01 UTC
0099daf [SPARK-34346][CORE][SQL][3.1] io.file.buffer.size set by spark.buffer.size will override by loading hive-site.xml accidentally may cause perf regression backport #31460 to branch 3.1 ### What changes were proposed in this pull request? In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share the `hive-site.xml` for their hive jobs and make a copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop configurations, we will use `spark.buffer.size(65536)` to reset `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may ignore this behavior and reset `io.file.buffer.size` again according to `hive-site.xml`. 1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be `spark > spark.hive > spark.hadoop > hive > hadoop` 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO performance w/ HDFS if there is an existing `io.file.buffer.size` in hive-site.xml ### Why are the changes needed? bugfix for configuration behavior and fix performance regression by that behavior change ### Does this PR introduce _any_ user-facing change? this pr restores silent user face change ### How was this patch tested? new tests Closes #31482 from yaooqinn/SPARK-34346-31. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 February 2021, 14:13:45 UTC
7c87b48 [SPARK-34359][SQL][3.1] Add a legacy config to restore the output schema of SHOW DATABASES This backports https://github.com/apache/spark/pull/31474 to 3.1/3.0 ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26006 In #26006 , we merged the v1 and v2 SHOW DATABASES/NAMESPACES commands, but we missed a behavior change that the output schema of SHOW DATABASES becomes different. This PR adds a legacy config to restore the old schema, with a migration guide item to mention this behavior change. ### Why are the changes needed? Improve backward compatibility ### Does this PR introduce _any_ user-facing change? No (the legacy config is false by default) ### How was this patch tested? a new test Closes #31486 from cloud-fan/command-schema. Authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 February 2021, 12:43:48 UTC
ba60b74 [SQL][MINOR][TEST][3.1] Re-enable some DS v2 char/varchar test ### What changes were proposed in this pull request? Some tests are skipped in branch 3.1 because some bug fixes were not backported. This PR re-enables some tests that are working fine now. ### Why are the changes needed? enable test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #31481 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 February 2021, 12:41:23 UTC
48d0007 [SPARK-34331][SQL] Speed up DS v2 metadata col resolution ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/28027 https://github.com/apache/spark/pull/28027 added a DS v2 API that allows data sources to produce metadata/hidden columns that can only be seen when it's explicitly selected. The way we integrate this API into Spark is: 1. The v2 relation gets normal output and metadata output from the data source, and the metadata output is excluded from the plan output by default. 2. column resolution can resolve `UnresolvedAttribute` with metadata columns, even if the child plan doesn't output metadata columns. 3. An analyzer rule searches the query plan, trying to find a node that has missing inputs. If such node is found, transform the sub-plan of this node, and update the v2 relation to include the metadata output. The analyzer rule in step 3 brings a perf regression, for queries that do not read v2 tables at all. This rule will calculate `QueryPlan.inputSet` (which builds an `AttributeSet` from outputs of all children) and `QueryPlan.missingInput` (which does a set exclusion and creates a new `AttributeSet`) for every plan node in the query plan. In our benchmark, the TPCDS query compilation time gets increased by more than 10% This PR proposes a simple way to improve it: we add a special metadata entry to the metadata attribute, which allows us to quickly check if a plan needs to add metadata columns: we just check all the references of this plan, and see if the attribute contains the special metadata entry, instead of calculating `QueryPlan.missingInput`. This PR also fixes one bug: we should not change the final output schema of the plan, if we only use metadata columns in operators like filter, sort, etc. ### Why are the changes needed? Fix perf regression in SQL query compilation, and fix a bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run `org.apache.spark.sql.TPCDSQuerySuite`, before this PR, `AddMetadataColumns` is the top 4 rule ranked by running time ``` === Metrics of Analyzer/Optimizer Rules === Total number of runs: 407641 Total time: 47.257239779 seconds Rule Effective Time / Total Time Effective Runs / Total Runs OptimizeSubqueries 4157690003 / 8485444626 49 / 2778 Analyzer$ResolveAggregateFunctions 1238968711 / 3369351761 49 / 2141 ColumnPruning 660038236 / 2924755292 338 / 6391 Analyzer$AddMetadataColumns 0 / 2918352992 0 / 2151 ``` after this PR: ``` Analyzer$AddMetadataColumns 0 / 122885629 0 / 2151 ``` This rule is 20 times faster and is negligible to the total compilation time. This PR also add new tests to verify the bug fix. Closes #31440 from cloud-fan/metadata-col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 989eb6884d77226ab4f494a4237e09aea54a032d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 February 2021, 08:37:46 UTC
5f3b8b8 [SPARK-34318][SQL][3.1] Dataset.colRegex should work with column names and qualifiers which contain newlines ### What changes were proposed in this pull request? Backport of #31426 for the record. This PR fixes an issue that `Dataset.colRegex` doesn't work with column names or qualifiers which contain newlines. In the current master, if column names or qualifiers passed to `colRegex` contain newlines, it throws exception. ``` val df = Seq(1, 2, 3).toDF("test\n_column").as("test\n_table") val col1 = df.colRegex("`tes.*\n.*mn`") org.apache.spark.sql.AnalysisException: Cannot resolve column name "`tes.* .*mn`" among (test _column) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272) at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263) at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407) ... 47 elided val col2 = df.colRegex("test\n_table.`tes.*\n.*mn`") org.apache.spark.sql.AnalysisException: Cannot resolve column name "test _table.`tes.* .*mn`" among (test _column) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272) at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263) at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407) ... 47 elided ``` ### Why are the changes needed? Column names and qualifiers can contain newlines but `colRegex` can't work with them, so it's a bug. ### Does this PR introduce _any_ user-facing change? Yes. users can pass column names and qualifiers even though they contain newlines. ### How was this patch tested? New test. Closes #31457 from sarutak/SPARK-34318-branch-3.1. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 04 February 2021, 01:55:17 UTC
a3e2197 [SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path ### What changes were proposed in this pull request? This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted. ### Why are the changes needed? The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified UTs explain the missing points, which also do the test. Closes #31449 from HeartSaVioR/SPARK-34326-v2. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 44dcf0062c41ff4230096bee800d9b4f70c424ce) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 03 February 2021, 23:46:26 UTC
94245c4 [SPARK-34327][BUILD] Strip passwords from inlining into build information while releasing ### What changes were proposed in this pull request? Strip passwords from getting inlined into build information, inadvertently. ` https://user:passdomain/foo -> https://domain/foo` ### Why are the changes needed? This can be a serious security issue, esp. during a release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested by executing the following command on both Mac OSX and Ubuntu. ``` echo url=$(git config --get remote.origin.url | sed 's|https://\(.*\)\(.*\)|https://\2|') ``` Closes #31436 from ScrapCodes/strip_pass. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 89bf2afb3337a44f34009a36cae16dd0ff86b353) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 February 2021, 06:02:49 UTC
3eb94de Revert "[SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path" This reverts commit d9e54381e32bbc86247cf18b7d2ca1e3126bd917. 03 February 2021, 03:33:16 UTC
18def59 [SPARK-33591][3.1][SQL][FOLLOWUP] Add legacy config for recognizing null partition spec values ### What changes were proposed in this pull request? This PR is to backport https://github.com/apache/spark/pull/31421 and https://github.com/apache/spark/pull/31434 to branch 3.1 This is a follow up for https://github.com/apache/spark/pull/30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31439 from gengliangwang/backportLegacyConf3.1. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 February 2021, 00:29:35 UTC
bb0efc1 [SPARK-34212][SQL][FOLLOWUP] Parquet vectorized reader can read decimal fields with a larger precision ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/31357 #31357 added a very strong restriction to the vectorized parquet reader, that the spark data type must exactly match the physical parquet type, when reading decimal fields. This restriction is actually not necessary, as we can safely read parquet decimals with a larger precision. This PR releases this restriction a little bit. ### Why are the changes needed? To not fail queries unnecessarily. ### Does this PR introduce _any_ user-facing change? Yes, now users can read parquet decimals with mismatched `DecimalType` as long as the scale is the same and precision is larger. ### How was this patch tested? updated test. Closes #31443 from cloud-fan/improve. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 00120ea53748d84976e549969f43cf2a50778c1c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 February 2021, 00:26:51 UTC
d9e5438 [SPARK-34326][CORE][SQL] Fix UTs added in SPARK-31793 depending on the length of temp path ### What changes were proposed in this pull request? This PR proposes to fix the UTs being added in SPARK-31793, so that all things contributing the length limit are properly accounted. ### Why are the changes needed? The test `DataSourceScanExecRedactionSuite.SPARK-31793: FileSourceScanExec metadata should contain limited file paths` is failing conditionally, depending on the length of the temp directory. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified UTs explain the missing points, which also do the test. Closes #31435 from HeartSaVioR/SPARK-34326. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 63866025d2e4bb89251ba7e29160fb30dd48ddf7) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 02 February 2021, 22:35:43 UTC
a12e29b [SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas ### What changes were proposed in this pull request? Resolve duplicate attributes for `FlatMapCoGroupsInPandas`. ### Why are the changes needed? When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example, ```scala df = spark.createDataFrame([(1, 1)], ("column", "value")) row = df.groupby("ColUmn").cogroup( df.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, "column long, value long") row.join(row).show() ``` error: ```scala ... Conflicting attributes: column#163321L,value#163322L ;; ’Join Inner :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] : :- Project [ColUmn#163312L, column#163312L, value#163313L] : : +- LogicalRDD [column#163312L, value#163313L], false : +- Project [COLUMN#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] :- Project [ColUmn#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- Project [COLUMN#163312L, column#163312L, value#163313L] +- LogicalRDD [column#163312L, value#163313L], false ... ``` ### Does this PR introduce _any_ user-facing change? yes, the query like the above example won't fail. ### How was this patch tested? Adde unit tests. Closes #31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e9362c2571f4a329218ff466fce79eef45e8f992) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 February 2021, 07:25:49 UTC
6831308 [SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of `dev/lint-python` This changeset is published into the public domain. ### What changes were proposed in this pull request? Some typos and syntax issues in docstrings and the output of `dev/lint-python` have been fixed. ### Why are the changes needed? In some places, the documentation did not refer to parameters or classes by the full and correct name, potentially causing uncertainty in the reader or rendering issues in Sphinx. Also, a typo in the standard output of `dev/lint-python` was fixed. ### Does this PR introduce _any_ user-facing change? Slight improvements in documentation, and in standard output of `dev/lint-python`. ### How was this patch tested? Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change. Closes #31401 from DavidToneian/SPARK-34300. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit d99d0d27be875bba692bcfe376f90c930e170380) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 February 2021, 00:31:10 UTC
4ac9fef [SPARK-34310][CORE][SQL] Replaces map and flatten with flatMap Replaces `collection.map(f1).flatten(f2)` with `collection.flatMap` if possible. it's semantically consistent, but looks simpler. Code Simpilefications. No Pass the Jenkins or GitHub Action Closes #31416 from LuciferYang/SPARK-34310. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 9db566a8821c02427434c551ee6e4d2501563dfa) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 February 2021, 00:25:18 UTC
8091be4 [SPARK-34083][SQL][3.1] Using TPCDS original definitions for char/varchar colums ### What changes were proposed in this pull request? backport c36446d819ce16b3e25bca9033490f523cf60106 to 3.1 This PR changes the column types in the table definitions of `TPCDSBase` from string to char and varchar, with respect to the original definitions for char/varchar columns in the official doc - [TPC-DS_v2.9.0](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf). ### Why are the changes needed? Comply with both TPCDS standard and ANSI, and using string will get wrong results with those TPCDS queries ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? plan stability check Closes #31359 from yaooqinn/tpcds31. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 February 2021, 14:31:24 UTC
c1999fb [SPARK-34233][SQL][3.1] FIX NPE for char padding in binary comparison As mentioned at https://github.com/apache/spark/pull/31336#issuecomment-768786706 the PR https://github.com/apache/spark/pull/31336 has been reverted to from branch 3.1 due to test failures, this pr fixes for that, the test failure is due to another patch is missing in branch 3.1, so in this pr, we wait for https://github.com/apache/spark/commit/fc3f22645e5c542e80a086d96da384feb6afe121 to be backport first ### What changes were proposed in this pull request? we need to check whether the `lit` is null before calling `numChars` ### Why are the changes needed? fix an obvious NPE bug ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31407 from yaooqinn/npe. Lead-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 February 2021, 14:19:07 UTC
bfe4a0c [SPARK-33990][SQL][TESTS][3.1] Remove partition data by v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests. ### Why are the changes needed? This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the tests suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTablePartitionV2SQLSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Dongjoon Hyun <dhyunapple.com> (cherry picked from commit fc3f22645e5c542e80a086d96da384feb6afe121) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #31411 from MaxGekk/fix-drop-partition-v2-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 February 2021, 09:18:24 UTC
3b5956c [SPARK-34270][SS] Combine StateStoreMetrics should not override StateStoreCustomMetric This patch proposes to sum up custom metric values instead of taking arbitrary one when combining `StateStoreMetrics`. For stateful join in structured streaming, we need to combine `StateStoreMetrics` from both left and right side. Currently we simply take arbitrary one from custom metrics with same name from left and right. By doing this we miss half of metric number. Yes, this corrects metrics collected for stateful join. Unit test. Closes #31369 from viirya/SPARK-34270. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 50d14c98c3828d8d9cc62ebc61ad4d20398ee6c6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 January 2021, 04:52:16 UTC
d4ca766 [SPARK-34154][YARN][FOLLOWUP] Fix flaky LocalityPlacementStrategySuite test ### What changes were proposed in this pull request? Fixing the flaky `handle large number of containers and tasks (SPARK-18750)` by avoiding to use `DNSToSwitchMapping` as in some situation DNS lookup could be extremely slow. ### Why are the changes needed? After https://github.com/apache/spark/pull/31363 was merged the flaky `handle large number of containers and tasks (SPARK-18750)` test failed again in some other PRs but now we have the exact place where the test is stuck. It is in the DNS lookup: ``` [info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (30 seconds, 4 milliseconds) [info] Failed with an exception or a timeout at thread join: [info] [info] java.lang.RuntimeException: Timeout at waiting for thread to stop (its stack trace is added to the exception) [info] at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) [info] at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) [info] at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) [info] at java.net.InetAddress.getAllByName0(InetAddress.java:1277) [info] at java.net.InetAddress.getAllByName(InetAddress.java:1193) [info] at java.net.InetAddress.getAllByName(InetAddress.java:1127) [info] at java.net.InetAddress.getByName(InetAddress.java:1077) [info] at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:568) [info] at org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:585) [info] at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109) [info] at org.apache.spark.deploy.yarn.SparkRackResolver.coreResolve(SparkRackResolver.scala:75) [info] at org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:66) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.$anonfun$localityOfRequestedContainers$3(LocalityPreferredContainerPlacementStrategy.scala:142) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy$$Lambda$658/1080992036.apply$mcVI$sp(Unknown Source) [info] at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.localityOfRequestedContainers(LocalityPreferredContainerPlacementStrategy.scala:138) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.org$apache$spark$deploy$yarn$LocalityPlacementStrategySuite$$runTest(LocalityPlacementStrategySuite.scala:94) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite$$anon$1.run(LocalityPlacementStrategySuite.scala:40) [info] at java.lang.Thread.run(Thread.java:748) (LocalityPlacementStrategySuite.scala:61) ... ``` This could be because of the DNS servers used by those build machines are not configured to handle IPv6 queries and the client has to wait for the IPv6 query to timeout before falling back to IPv4. This even make the tests more consistent. As when a single host was given to lookup via `resolve(hostName: String)` it gave a different answer from calling `resolve(hostNames: Seq[String])` with a `Seq` containing that single host. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #31397 from attilapiros/SPARK-34154-2nd. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit d3f049cbc274ee64bb9b56d6addba4f2cb8f1f0a) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 January 2021, 14:55:04 UTC
47f2372 [SPARK-33163][SQL][TESTS][FOLLOWUP] Fix the test for the parquet metadata key 'org.apache.spark.legacyDateTime' ### What changes were proposed in this pull request? 1. Test both date and timestamp column types 2. Write the timestamp as the `TIMESTAMP_MICROS` logical type 3. Change the timestamp value to `'1000-01-01 01:02:03'` to check exception throwing. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" ``` Closes #31396 from MaxGekk/parquet-test-metakey-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 588ddcdf22fccec2ea3775d17ac3d19cd5328eb5) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 January 2021, 13:25:14 UTC
4c65231 [SPARK-34144][SQL] Exception thrown when trying to write LocalDate and Instant values to a JDBC relation ### What changes were proposed in this pull request? When writing rows to a table only the old date time API types are handled in org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#makeSetter. If the new API is used (spark.sql.datetime.java8API.enabled=true) casting Instant and LocalDate to Timestamp and Date respectively fails. The proposed change is to handle Instant and LocalDate values and transform them to Timestamp and Date. ### Why are the changes needed? In the current state writing Instant or LocalDate values to a table fails with something like: Caused by: java.lang.ClassCastException: class java.time.LocalDate cannot be cast to class java.sql.Date (java.time.LocalDate is in module java.base of loader 'bootstrap'; java.sql.Date is in module java.sql of loader 'platform') at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$11(JdbcUtils.scala:573) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeSetter$11$adapted(JdbcUtils.scala:572) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:678) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:858) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:856) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:994) at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:994) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) ... 3 more ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests Closes #31264 from cristichircu/SPARK-34144. Lead-authored-by: Chircu <chircu@arezzosky.com> Co-authored-by: Cristi Chircu <cristian.chircu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 520e5d2ab8c25e99c6149fb752b18a1f65bd9fa0) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 January 2021, 08:48:25 UTC
80f2bb8 [SPARK-34273][CORE] Do not reregister BlockManager when SparkContext is stopped ### What changes were proposed in this pull request? This PR aims to prevent `HeartbeatReceiver` asks `Executor` to re-register blocker manager when the SparkContext is already stopped. ### Why are the changes needed? Currently, `HeartbeatReceiver` blindly asks re-registration for the new heartbeat message. However, when SparkContext is stopped, we don't need to re-register new block manager. Re-registration causes unnecessary executors' logs and and a delay on job termination. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #31373 from dongjoon-hyun/SPARK-34273. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bc41c5a0e598e6b697ed61c33e1bea629dabfc57) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 January 2021, 21:06:53 UTC
391ba89 [SPARK-34281][K8S] Promote spark.kubernetes.executor.podNamePrefix to the public conf ### What changes were proposed in this pull request? This PR aims to remove `internal()` from `spark.kubernetes.executor.podNamePrefix` in order to make it the configuration public. ### Why are the changes needed? In line with K8s GA, this will allow some users control the full executor pod names officially. This is useful when we want a custom executor pod name pattern independently from the app name. ### Does this PR introduce _any_ user-facing change? No, this has been there since Apache Spark 2.3.0. ### How was this patch tested? N/A. Closes #31386 from dongjoon-hyun/SPARK-34281. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 78244bafe858370deb6638b2a8d7206a195dbe52) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 January 2021, 21:01:30 UTC
697bdca [SPARK-32866][K8S] Fix docker cross-build ### What changes were proposed in this pull request? Add `--push` to the docker script for buildx ### Why are the changes needed? Doing a separate docker push with `buildx` images doesn't work. ### Does this PR introduce _any_ user-facing change? Automatically pushes work when cross-building. ### How was this patch tested? cross-built docker containers Closes #31299 from holdenk/SPARK-32866-docker-buildx-update. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> (cherry picked from commit 497f599a37d2250c14a7bd2699bd3ac65bd08a58) Signed-off-by: Holden Karau <hkarau@apple.com> 28 January 2021, 19:58:27 UTC
451baad [SPARK-34262][SQL][3.1] Refresh cached data of v1 table in `ALTER TABLE .. SET LOCATION` ### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` in v1 implementation of the `ALTER TABLE .. SET LOCATION` command to refresh cached table data. ### Why are the changes needed? The example below portraits the issue: - Create a source table: ```sql spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0; spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0); default src_tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0 ... ``` - Set new location for the empty partition (part=0): ```sql spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> ALTER TABLE dst_tbl ADD PARTITION (part=0); spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1; spark-sql> CACHE TABLE dst_tbl; spark-sql> SELECT * FROM dst_tbl; 1 1 spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0'; spark-sql> SELECT * FROM dst_tbl; 1 1 ``` The last query does not return new loaded data. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works correctly: ```sql spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0'; spark-sql> SELECT * FROM dst_tbl; 0 0 1 1 ``` ### How was this patch tested? Added new test to `org.apache.spark.sql.hive.CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit d242166b8fd741fdd46d9048f847b2fd6e1d07b1) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #31379 from MaxGekk/refresh-cache-set-location-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 28 January 2021, 09:59:34 UTC
a9048fd [SPARK-34275][CORE][SQL][MLLIB] Replaces filter and size with count ### What changes were proposed in this pull request? Use `count` to simplify `find + size(or length)` operation, it's semantically consistent, but looks simpler. **Before** ``` seq.filter(p).size ``` **After** ``` seq.count(p) ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31374 from LuciferYang/SPARK-34275. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 15445a8d9e8dd8660aa668a5b82ba2cbc6a5a233) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 28 January 2021, 06:27:24 UTC
d147799 [SPARK-34268][SQL][DOCS] Correct the documentation of the concat_ws function ### What changes were proposed in this pull request? This pr correct the documentation of the `concat_ws` function. ### Why are the changes needed? `concat_ws` doesn't need any str or array(str) arguments: ``` scala> sql("""select concat_ws("s")""").show +------------+ |concat_ws(s)| +------------+ | | +------------+ ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` build/sbt "sql/testOnly *.ExpressionInfoSuite" ``` Closes #31370 from wangyum/SPARK-34268. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 01d11da84ef7c3abbfd1072c421505589ac1e9b2) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 28 January 2021, 05:06:52 UTC
0c8f111 [SPARK-34260][SQL] Fix UnresolvedException when creating temp view twice ### What changes were proposed in this pull request? In PR #30140, it will compare new and old plans when replacing view and uncache data if the view has changed. But the compared new plan is not analyzed which will cause `UnresolvedException` when calling `sameResult`. So in this PR, we use the analyzed plan to compare to fix this problem. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? newly added tests Closes #31360 from linhongliu-db/SPARK-34260. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit cf1400c8ddc3bd534455227c40e5fb53ecf9cdee) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 January 2021, 05:00:48 UTC
ed3c479 Revert "[SPARK-34233][SQL] FIX NPE for char padding in binary comparison" This reverts commit cf21e8898ab484a833b6696d0cf4bb0c871e7ff6. 28 January 2021, 04:07:08 UTC
4ca628e [SPARK-33867][SQL] Instant and LocalDate values aren't handled when generating SQL queries ### What changes were proposed in this pull request? When generating SQL queries only the old date time API types are handled for values in org.apache.spark.sql.jdbc.JdbcDialect#compileValue. If the new API is used (spark.sql.datetime.java8API.enabled=true) Instant and LocalDate values are not quoted and errors are thrown. The change proposed is to handle Instant and LocalDate values the same way that Timestamp and Date are. ### Why are the changes needed? In the current state if an Instant is used in a filter, an exception will be thrown. Ex (dataset was read from PostgreSQL): dataset.filter(current_timestamp().gt(col(VALID_FROM))) Stacktrace (the T11 is from an instant formatted like yyyy-MM-dd'T'HH:mm:ss.SSSSSS'Z'): Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11"Caused by: org.postgresql.util.PSQLException: ERROR: syntax error at or near "T11" Position: 285 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2103) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1836) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:512) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:388) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:273) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:304) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test added Closes #31148 from cristichircu/SPARK-33867. Lead-authored-by: Chircu <chircu@arezzosky.com> Co-authored-by: Cristi Chircu <chircu@arezzosky.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 829f118f98ef0732c8dd784f06298465e47ee3a0) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 28 January 2021, 02:58:45 UTC
6f1bd9b [SPARK-34154][YARN] Extend LocalityPlacementStrategySuite's test with a timeout ### What changes were proposed in this pull request? This PR extends the `handle large number of containers and tasks (SPARK-18750)` test with a time limit and in case of timeout it saves the stack trace of the running thread to provide extra information about the reason why it got stuck. ### Why are the changes needed? This is a flaky test which sometime runs for hours without stopping. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I checked it with a temporary code change: by adding a `Thread.sleep` to `LocalityPreferredContainerPlacementStrategy#expectedHostToContainerCount`. The stack trace showed the correct method: ``` [info] LocalityPlacementStrategySuite: [info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (30 seconds, 26 milliseconds) [info] Failed with an exception or a timeout at thread join: [info] [info] java.lang.RuntimeException: Timeout at waiting for thread to stop (its stack trace is added to the exception) [info] at java.lang.Thread.sleep(Native Method) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.$anonfun$expectedHostToContainerCount$1(LocalityPreferredContainerPlacementStrategy.scala:198) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy$$Lambda$281/381161906.apply(Unknown Source) [info] at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) [info] at scala.collection.TraversableLike$$Lambda$16/322836221.apply(Unknown Source) [info] at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) [info] at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) [info] at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) [info] at scala.collection.TraversableLike.map(TraversableLike.scala:238) [info] at scala.collection.TraversableLike.map$(TraversableLike.scala:231) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:108) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.expectedHostToContainerCount(LocalityPreferredContainerPlacementStrategy.scala:188) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.localityOfRequestedContainers(LocalityPreferredContainerPlacementStrategy.scala:112) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.org$apache$spark$deploy$yarn$LocalityPlacementStrategySuite$$runTest(LocalityPlacementStrategySuite.scala:94) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite$$anon$1.run(LocalityPlacementStrategySuite.scala:40) [info] at java.lang.Thread.run(Thread.java:748) (LocalityPlacementStrategySuite.scala:61) ... ``` Closes #31363 from attilapiros/SPARK-34154. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0dedf24cd0359b36f655adbf22bd5048b7288ba5) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 January 2021, 23:04:39 UTC
46dce63 [SPARK-34193][CORE] TorrentBroadcast block manager decommissioning race fix ### What changes were proposed in this pull request? Allow broadcast blocks to be put during decommissioning since migrations don't apply to them and they may be stored as part of job exec. ### Why are the changes needed? Potential race condition. ### Does this PR introduce _any_ user-facing change? Removal of race condition. ### How was this patch tested? New unit test. Closes #31298 from holdenk/SPARK-34193-torrentbroadcast-blockmanager-decommissioning-potential-race-condition. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 9d83d62f142ba89518194f176bb81adadc28951b) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 January 2021, 21:16:40 UTC
609af85 [SPARK-34221][WEBUI] Ensure if a stage fails in the UI page, the corresponding error message can be displayed correctly ### What changes were proposed in this pull request? Ensure that if a stage fails in the UI page, the corresponding error message can be displayed correctly. ### Why are the changes needed? errormessage is not handled properly in JavaScript. If the 'at' is not exist, the error message on the page will be blank. I made wochanges, 1. `msg.indexOf("at")` => `msg.indexOf("\n")` ![image](https://user-images.githubusercontent.com/52202080/105663531-7362cb00-5f0d-11eb-87fd-008ed65c33ca.png) As shows ablove, truncated at the 'at' position will result in a strange abstract of the error message. If there is a `\n` worit is more reasonable to truncate at the '\n' position. 2. If the `\n` does not exist check whether the msg is more than 100. If true, then truncate the display to avoid too long error message ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test shows as belows, just a js change: before modified: ![problem](https://user-images.githubusercontent.com/52202080/105712153-661cff00-5f54-11eb-80bf-e33c323c4e55.png) after modified ![after mdified](https://user-images.githubusercontent.com/52202080/105712180-6c12e000-5f54-11eb-8998-ff8bc8a0a503.png) Closes #31314 from akiyamaneko/error_message_display_empty. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f1bc37e6244e959f1d950c450010dd6024b6ba5f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 January 2021, 18:02:13 UTC
5a2eb64 [SPARK-34212][SQL][FOLLOWUP] Refine the behavior of reading parquet non-decimal fields as decimal ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/31319 . When reading parquet int/long as decimal, the behavior should be the same as reading int/long and then cast to the decimal type. This PR changes to the expected behavior. When reading parquet binary as decimal, we don't really know how to interpret the binary (it may from a string), and should fail. This PR changes to the expected behavior. ### Why are the changes needed? To make the behavior more sane. ### Does this PR introduce _any_ user-facing change? Yes, but it's a followup. ### How was this patch tested? updated test Closes #31357 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 2dbb7d5af8f498e49488cd8876bd3d0b083723b7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 January 2021, 17:34:47 UTC
374aeef [SPARK-34231][AVRO][TEST] Make proper use of resource file within AvroSuite test case Change `AvroSuite."Ignore corrupt Avro file if flag IGNORE_CORRUPT_FILES"` to use `episodesAvro`, which is loaded as a resource using the classloader, instead of trying to read `episodes.avro` directly from a relative file path. This is the proper way to read resource files, and currently this test will fail when called from my IntelliJ IDE, though it will succeed when called from Maven/sbt, presumably due to different working directory handling. No, unit test only. Previous failure from IntelliJ: ``` Source 'src/test/resources/episodes.avro' does not exist java.io.FileNotFoundException: Source 'src/test/resources/episodes.avro' does not exist at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:1072) at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:1040) at org.apache.spark.sql.avro.AvroSuite.$anonfun$new$34(AvroSuite.scala:397) at org.apache.spark.sql.avro.AvroSuite.$anonfun$new$34$adapted(AvroSuite.scala:388) ``` Now it succeeds. Closes #31332 from xkrogen/xkrogen-SPARK-34231-avrosuite-testfix. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b2c104bd87361e5d85f7c227c60419af16b718f2) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 January 2021, 07:17:52 UTC
cf21e88 [SPARK-34233][SQL] FIX NPE for char padding in binary comparison ### What changes were proposed in this pull request? we need to check whether the `lit` is null before calling `numChars` ### Why are the changes needed? fix an obvious NPE bug ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31336 from yaooqinn/SPARK-34233. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 764582c07a263ae0bef4a080a84a66be60d1aab9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 January 2021, 07:00:15 UTC
c9b0dd5 [SPARK-34236][SQL] Fix v2 Overwrite w/ null static partition raise Cannot translate expression to source filter: null ### What changes were proposed in this pull request? For v2 static partitions overwriting, we use `EqualTo ` to generate the `deleteExpr` This is not right for null partition values, and cause the problem like below because `ConstantFolding` converts it to lit(null) ```scala SPARK-34223: static partition with null raise NPE *** FAILED *** (19 milliseconds) [info] org.apache.spark.sql.AnalysisException: Cannot translate expression to source filter: null [info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.$anonfun$applyOrElse$1(V2Writes.scala:50) [info] at scala.collection.immutable.List.flatMap(List.scala:366) [info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:47) [info] at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:39) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) ``` The right way is to use EqualNullSafe instead to delete the null partitions. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? an original test to new place Closes #31339 from yaooqinn/SPARK-34236. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 91ca21d7006169d95940506b8de154b96b4fae20) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 January 2021, 04:06:13 UTC
bcad88f [SPARK-34212][SQL] Fix incorrect decimal reading from Parquet files ### What changes were proposed in this pull request? This PR aims to the correctness issues during reading decimal values from Parquet files. - For **MR** code path, `ParquetRowConverter` can read Parquet's decimal values with the original precision and scale written in the corresponding footer. - For **Vectorized** code path, `VectorizedColumnReader` throws `SchemaColumnConvertNotSupportedException`. ### Why are the changes needed? Currently, Spark returns incorrect results when the Parquet file's decimal precision and scale are different from the Spark's schema. This happens when there is multiple files with different decimal schema or HiveMetastore has a new schema. **BEFORE (Simplified example for correctness)** ```scala scala> sql("SELECT 1.0 a").write.parquet("/tmp/decimal") scala> spark.read.schema("a DECIMAL(3,2)").parquet("/tmp/decimal").show +----+ | a| +----+ |0.10| +----+ ``` This works correctly in the other data sources, `ORC/JSON/CSV`, like the following. ```scala scala> sql("SELECT 1.0 a").write.orc("/tmp/decimal_orc") scala> spark.read.schema("a DECIMAL(3,2)").orc("/tmp/decimal_orc").show +----+ | a| +----+ |1.00| +----+ ``` **AFTER** 1. **Vectorized** path: Instead of incorrect result, we will raise an explicit exception. ```scala scala> spark.read.schema("a DECIMAL(3,2)").parquet("/tmp/decimal").show java.lang.UnsupportedOperationException: Schema evolution not supported. ``` 2. **MR** path (complex schema or explicit configuration): Spark returns correct results. ```scala scala> spark.read.schema("a DECIMAL(3,2), b DECIMAL(18, 3), c MAP<INT,INT>").parquet("/tmp/decimal").show +----+-------+--------+ | a| b| c| +----+-------+--------+ |1.00|100.000|{1 -> 2}| +----+-------+--------+ scala> spark.read.schema("a DECIMAL(3,2), b DECIMAL(18, 3), c MAP<INT,INT>").parquet("/tmp/decimal").printSchema root |-- a: decimal(3,2) (nullable = true) |-- b: decimal(18,3) (nullable = true) |-- c: map (nullable = true) | |-- key: integer | |-- value: integer (valueContainsNull = true) ``` ### Does this PR introduce _any_ user-facing change? Yes. This fixes the correctness issue. ### How was this patch tested? Pass with the newly added test case. Closes #31319 from dongjoon-hyun/SPARK-34212. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit dbf051c50a17d644ecc1823e96eede4a5a6437fd) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 January 2021, 23:13:56 UTC
c46a3a9 [SPARK-34052][FOLLOWUP][DOC] Add document in SQL migration guide ### What changes were proposed in this pull request? Add document for the behavior change in SPARK-34052, in SQL migration guide. ### Why are the changes needed? Document behavior change for Spark users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #31351 from sunchao/SPARK-34052-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c2320a43c7b40c270232e6c0affcbbe01776af61) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 January 2021, 23:11:59 UTC
c2b22e3 [SPARK-34244][SQL] Remove the Scala function version of regexp_extract_all ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/27507 implements `regexp_extract_all` and added the scala function version of it. According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description. ### Why are the changes needed? `regexp_extract_all` is less common. ### Does this PR introduce _any_ user-facing change? 'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet. ### How was this patch tested? Jenkins test. Closes #31346 from beliefer/SPARK-24884-followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 99b6af2dd2f3d1dfad6b3f9110657662afa45069) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 January 2021, 21:53:06 UTC
8cf02b3 [SPARK-34235][SS] Make spark.sql.hive as a private package Follow the comment https://github.com/apache/spark/pull/31271#discussion_r562598983: - Remove the API tag `Unstable` for `HiveSessionStateBuilder` - Add document for spark.sql.hive package to emphasize it's a private package Follow the rule for a private package. No. Doc change only. Closes #31321 from xuanyuanking/SPARK-34185-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0a1a029622eb49e7943f87cfae6942d09bc121a6) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 January 2021, 10:10:23 UTC
545ab05 Revert "[SPARK-34235][SS] Make spark.sql.hive as a private package" This reverts commit dca76206c05702a43ffdd780fde82dd7295d1537. 26 January 2021, 10:09:52 UTC
82da778 [SPARK-34224][CORE][SQL][SS][DSTREAM][YARN][TEST][EXAMPLES] Ensure all resource opened by `Source.fromXXX` are closed ### What changes were proposed in this pull request? Using a function like `.mkString` or `.getLines` directly on a `scala.io.Source` opened by `fromFile`, `fromURL`, `fromURI ` will leak the underlying file handle, this pr use the `Utils.tryWithResource` method wrap the `BufferedSource` to ensure these `BufferedSource` closed. ### Why are the changes needed? Avoid file handle leak. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31323 from LuciferYang/source-not-closed. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8999e8805d7e9786cdb5b96575b264f922c232a2) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 January 2021, 10:06:50 UTC
dca7620 [SPARK-34235][SS] Make spark.sql.hive as a private package Follow the comment https://github.com/apache/spark/pull/31271#discussion_r562598983: - Remove the API tag `Unstable` for `HiveSessionStateBuilder` - Add document for spark.sql.hive package to emphasize it's a private package Follow the rule for a private package. No. Doc change only. Closes #31321 from xuanyuanking/SPARK-34185-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0a1a029622eb49e7943f87cfae6942d09bc121a6) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 January 2021, 08:15:07 UTC
6712e28 [SPARK-34232][CORE] Redact SparkListenerEnvironmentUpdate event in log ### What changes were proposed in this pull request? Redact event SparkListenerEnvironmentUpdate in log when its processing time exceeded logSlowEventThreshold ### Why are the changes needed? Credentials could be exposed when its processing time exceeded logSlowEventThreshold ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #31335 from warrenzhu25/34232. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 68b765e6b800ea7753cbf4ba5a2a5d2749eb2a57) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 January 2021, 08:11:05 UTC
f4340c4 [SPARK-34229][SQL] Avro should read decimal values with the file schema ### What changes were proposed in this pull request? This PR aims to fix Avro data source to use the decimal precision and scale of file schema. ### Why are the changes needed? The decimal value should be interpreted with its original precision and scale. Otherwise, it returns incorrect result like the following. The schema mismatch happens when we use `userSpecifiedSchema` or there are multiple files with inconsistent schema or HiveMetastore schema is updated by the user. ```scala scala> sql("SELECT 3.14 a").write.format("avro").save("/tmp/avro") scala> spark.read.schema("a DECIMAL(4, 3)").format("avro").load("/tmp/avro").show +-----+ | a| +-----+ |0.314| +-----+ ``` ### Does this PR introduce _any_ user-facing change? Yes, this will return correct result. ### How was this patch tested? Pass the CI with the newly added test case. Closes #31329 from dongjoon-hyun/SPARK-34229. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 7d09eac1ccb6a14a36fce30ae7cda575c29e1974) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 January 2021, 05:24:20 UTC
db4ff7e [SPARK-32852][SQL][FOLLOW_UP] Add notice about keep hive version consistence when config hive jars location ### What changes were proposed in this pull request? Add notice about keep hive version consistence when config hive jars location With PR #29881, if we don't keep hive version consistence. we will got below error. ``` Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8. ``` ![image](https://user-images.githubusercontent.com/46485123/105795169-512d8380-5fc7-11eb-97c3-0259a0d2aa58.png) ### Why are the changes needed? Make config doc detail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31317 from AngersZhuuuu/SPARK-32852-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 7bd4165c115509aec6e143f00c58b8c6083e9900) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 January 2021, 04:40:44 UTC
f2ab524 [SPARK-34223][SQL] FIX NPE for static partition with null in InsertIntoHadoopFsRelationCommand ### What changes were proposed in this pull request? with a simple case, the null will be passed to InsertIntoHadoopFsRelationCommand blindly, we should avoid the npe ```scala test("NPE") { withTable("t") { sql(s"CREATE TABLE t(i STRING, c string) USING $format PARTITIONED BY (c)") sql("INSERT OVERWRITE t PARTITION (c=null) VALUES ('1')") checkAnswer(spark.table("t"), Row("1", null)) } } ``` ```logtalk java.lang.NullPointerException at scala.collection.immutable.StringOps$.length(StringOps.scala:51) at scala.collection.immutable.StringOps.length(StringOps.scala:51) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35) at scala.collection.IndexedSeqOptimized.foreach at scala.collection.immutable.StringOps.foreach(StringOps.scala:33) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.orig-s0.0000030000-r30676-expand-or-complete(InsertIntoHadoopFsRelationCommand.scala:231) ``` ### Why are the changes needed? a bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31320 from yaooqinn/SPARK-34223. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b3915ddd919fac11254084a2b138ad730fa8e5b0) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 January 2021, 04:06:17 UTC
4cf94f3 [SPARK-31768][ML][FOLLOWUP] add getMetrics in Evaluators: cleanup ### What changes were proposed in this pull request? 1, make `silhouette` a method; 2, change return type of `setDistanceMeasure` to `this.type`; ### Why are the changes needed? see comments in https://github.com/apache/spark/pull/28590 ### Does this PR introduce _any_ user-facing change? No, 3.1 has not been released ### How was this patch tested? existing testsuites Closes #31334 from zhengruifeng/31768-followup. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit cb37c962bec25083a67d65387ed88e7d4ee556ca) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 26 January 2021, 03:58:14 UTC
05d5a2f [SPARK-34203][SQL][TESTS][3.1][FOLLOWUP] Fix null partition values UT failure ### What changes were proposed in this pull request? Forward port changes in tests from https://github.com/apache/spark/pull/31326. ### Why are the changes needed? This fixes a test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes #31331 from MaxGekk/insert-overwrite-null-part-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 January 2021, 22:43:38 UTC
d35b504 [SPARK-34192][SQL] Move char padding to write side and remove length check on read side too On the read-side, the char length check and padding bring issues to CBO and predicate pushdown and other issues to the catalyst. This PR reverts 6da5cdf1dbfc35cee0ce32aa9e44c0b4187373d9 that added read side length check) so that we only do length check for the write side, and data sources/vendors are responsible to enforce the char/varchar constraints for data import operations like ADD PARTITION. It doesn't make sense for Spark to report errors on the read-side if the data is already dirty. This PR also moves the char padding to the write-side, so that it 1) avoids read side issues like CBO and filter pushdown. 2) the data source can preserve char type semantic better even if it's read by systems other than Spark. fix perf regression when tables have char/varchar type columns closes #31278 yes, spark will not raise error for oversized char/varchar values in read side modified ut the dropped read side benchmark ``` ================================================================================================ Char Varchar Read Side Perf w/o Tailing Spaces ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1564 1573 9 63.9 15.6 1.0X read char with length 20 1532 1551 18 65.3 15.3 1.0X read varchar with length 20 1520 1531 13 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1573 1613 41 63.6 15.7 1.0X read char with length 40 1575 1577 2 63.5 15.7 1.0X read varchar with length 40 1568 1576 11 63.8 15.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1526 1540 23 65.5 15.3 1.0X read char with length 60 1514 1539 23 66.0 15.1 1.0X read varchar with length 60 1486 1497 10 67.3 14.9 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1531 1542 19 65.3 15.3 1.0X read char with length 80 1514 1529 15 66.0 15.1 1.0X read varchar with length 80 1524 1565 42 65.6 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1597 1623 25 62.6 16.0 1.0X read char with length 100 1499 1512 16 66.7 15.0 1.1X read varchar with length 100 1517 1524 8 65.9 15.2 1.1X ================================================================================================ Char Varchar Read Side Perf w/ Tailing Spaces ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1524 1526 1 65.6 15.2 1.0X read char with length 20 1532 1537 9 65.3 15.3 1.0X read varchar with length 20 1520 1532 15 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1556 1580 32 64.3 15.6 1.0X read char with length 40 1600 1611 17 62.5 16.0 1.0X read varchar with length 40 1648 1716 88 60.7 16.5 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1504 1524 20 66.5 15.0 1.0X read char with length 60 1509 1512 3 66.2 15.1 1.0X read varchar with length 60 1519 1535 21 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1640 1652 17 61.0 16.4 1.0X read char with length 80 1625 1666 35 61.5 16.3 1.0X read varchar with length 80 1590 1605 13 62.9 15.9 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1622 1628 5 61.6 16.2 1.0X read char with length 100 1614 1646 30 62.0 16.1 1.0X read varchar with length 100 1594 1606 11 62.7 15.9 1.0X ``` Closes #31281 from yaooqinn/SPARK-34192. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d1177b52304217f4cb86506fd1887ec98879ed16) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 January 2021, 18:30:10 UTC
b1a2466 [SPARK-34203][SQL] Convert `null` partition values to `__HIVE_DEFAULT_PARTITION__` in v1 `In-Memory` catalog In the PR, I propose to convert `null` partition values to `"__HIVE_DEFAULT_PARTITION__"` before storing in the `In-Memory` catalog internally. Currently, the `In-Memory` catalog maintains null partitions as `"__HIVE_DEFAULT_PARTITION__"` in file system but as `null` values in memory that could cause some issues like in SPARK-34203. `InMemoryCatalog` stores partitions in the file system in the Hive compatible form, for instance, it converts the `null` partition value to `"__HIVE_DEFAULT_PARTITION__"` but at the same time it keeps null as is internally. That causes an issue demonstrated by the example below: ``` $ ./bin/spark-shell -c spark.sql.catalogImplementation=in-memory ``` ```scala scala> spark.conf.get("spark.sql.catalogImplementation") res0: String = in-memory scala> sql("CREATE TABLE tbl (col1 INT, p1 STRING) USING parquet PARTITIONED BY (p1)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT OVERWRITE TABLE tbl VALUES (0, null)") res2: org.apache.spark.sql.DataFrame = [] scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") org.apache.spark.sql.catalyst.analysis.NoSuchPartitionsException: The following partitions not found in table 'tbl' database 'default': Map(p1 -> null) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.dropPartitions(InMemoryCatalog.scala:440) ``` Yes. After the changes, `ALTER TABLE .. DROP PARTITION` can drop the `null` partition in `In-Memory` catalog: ```scala scala> spark.table("tbl").show(false) +----+----+ |col1|p1 | +----+----+ |0 |null| +----+----+ scala> sql("ALTER TABLE tbl DROP PARTITION (p1 = null)") res4: org.apache.spark.sql.DataFrame = [] scala> spark.table("tbl").show(false) +----+---+ |col1|p1 | +----+---+ +----+---+ ``` Added new test to `AlterTableDropPartitionSuiteBase`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31322 from MaxGekk/insert-overwrite-null-part. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 January 2021, 15:44:13 UTC
63f07c7 [SPARK-33726][SQL] Fix for Duplicate field names during Aggregation ### What changes were proposed in this pull request? The `RowBasedKeyValueBatch` has two different implementations depending on whether the aggregation key and value uses only fixed length data types (`FixedLengthRowBasedKeyValueBatch`) or not (`VariableLengthRowBasedKeyValueBatch`). Before this PR the decision about the used implementation was based on by accessing the schema fields by their name. But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made. When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test: `with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x` where the 'x' column in the left side of the join is a Long but on the right side is a String. ### Why are the changes needed? Fixes the issue where duplicate field name aggregation has null values in the dataframe. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT, tested manually on spark shell. Closes #30788 from yliou/SPARK-33726. Authored-by: yliou <yliou@berkeley.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 512cacf7c61acb3282720192b875555543a1f3eb) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 January 2021, 06:53:37 UTC
7feb0ea [SPARK-34133][AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching ### What changes were proposed in this pull request? Make the field name matching between Avro and Catalyst schemas, on both the reader and writer paths, respect the global SQL settings for case sensitivity (i.e. case-insensitive by default). `AvroSerializer` and `AvroDeserializer` share a common utility in `AvroUtils` to search for an Avro field to match a given Catalyst field. ### Why are the changes needed? Spark SQL is normally case-insensitive (by default), but currently when `AvroSerializer` and `AvroDeserializer` perform matching between Catalyst schemas and Avro schemas, the matching is done in a case-sensitive manner. So for example the following will fail: ```scala val avroSchema = """ |{ | "type" : "record", | "name" : "test_schema", | "fields" : [ | {"name": "foo", "type": "int"}, | {"name": "BAR", "type": "int"} | ] |} """.stripMargin val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar") df.write.option("avroSchema", avroSchema).format("avro").save(savePath) ``` The same is true on the read path, if we assume `testAvro` has been written using the schema above, the below will fail to match the fields: ```scala df.read.schema(new StructType().add("FOO", IntegerType).add("bar", IntegerType)) .format("avro").load(testAvro) ``` ### Does this PR introduce _any_ user-facing change? When reading Avro data, or writing Avro data using the `avroSchema` option, field matching will be performed with case sensitivity respecting the global SQL settings. ### How was this patch tested? New tests added to `AvroSuite` to validate the case sensitivity logic in an end-to-end manner through the SQL engine. Closes #31201 from xkrogen/xkrogen-SPARK-34133-avro-serde-casesensitivity-errormessages. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9371ea8c7bd87b87c4d3dfb4c830c65643e48f54) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 January 2021, 04:54:53 UTC
3036dd3 [SPARK-34185][DOCS] Review and fix issues in API docs Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues: - Add missing `Since` annotation for new APIs - Remove the leaking class/object in API doc Fix the issues in the Spark 3.1.1 release API docs. Yes, API doc changes. Manually test. Closes #31271 from xuanyuanking/SPARK-34185. Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 59cbacaddfa05848c2237a573e11561c704554d0) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 25 January 2021, 02:40:02 UTC
2bca383 [SPARK-34217][INFRA] Fix Scala 2.12 release profile ### What changes were proposed in this pull request? This PR aims to fix the Scala 2.12 release profile in `release-build.sh`. ### Why are the changes needed? Since 3.0.0 (SPARK-26132), the release script is using `SCALA_2_11_PROFILES` to publish Scala 2.12 artifacts. After looking at the code, this is not a blocker because `-Pscala-2.11` is no-op in `branch-3.x`. In addition `scala-2.12` profile is enabled by default and it's an empty profile without any configuration technically. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is used by release manager only. Manually. This should land at `master/3.1/3.0`. Closes #31310 from dongjoon-hyun/SPARK-34217. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b8fc6f88b5cae23cc6783707c127f39b91fc0cfe) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 January 2021, 23:00:09 UTC
37e05ed [SPARK-34187][SS] Use available offset range obtained during polling when checking offset validation ### What changes were proposed in this pull request? This patch uses the available offset range obtained during polling Kafka records to do offset validation check. ### Why are the changes needed? We support non-consecutive offsets for Kafka since 2.4.0. In `fetchRecord`, we do offset validation by checking if the offset is in available offset range. But currently we obtain latest available offset range to do the check. It looks not correct as the available offset range could be changed during the batch, so the available offset range is different than the one when we polling the records from Kafka. It is possible that an offset is valid when polling, but at the time we do the above check, it is out of latest available offset range. We will wrongly consider it as data loss case and fail the query or drop the record. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This should pass existing unit tests. This is hard to have unit test as the Kafka producer and the consumer is asynchronous. Further, we also need to make the offset out of new available offset range. Closes #31275 from viirya/SPARK-34187. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ab6c0e5d10e594318d6421ee6c099f90dbda3a02) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 January 2021, 19:51:05 UTC
57120b8 [SPARK-34052][SQL][3.1] store SQL text for a temp view created using "CACHE TABLE .. AS SELECT ..." This is a backport of #31107 to branch-3.1. ### What changes were proposed in this pull request? This passes original SQL text to `CacheTableCommand` command in DSv1 so that it will be stored instead of the analyzed logical plan, similar to `CREATE VIEW` command. In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is false (which is the default value). ### Why are the changes needed? Currently, after creating a temporary view with `CACHE TABLE ... AS SELECT` command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue. For instance, in the following: ```sql > CREATE TABLE t ...; > CACHE TABLE v AS SELECT * FROM t; > DROP TABLE t; > SELECT * FROM v; ``` The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table `t`, but the temporary view `v` still exist. On the other hand, the following: ```sql > CREATE TABLE t ...; > CREATE TEMPORARY VIEW v AS SELECT * FROM t; > CACHE TABLE v; > DROP TABLE t; > SELECT * FROM v; ``` will throw "Table or view not found" error in the last select query. This is related to #30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles `CreateView` case but not the `CacheTableAsSelect` case. This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache. ### Does this PR introduce _any_ user-facing change? Yes, now when `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated. ### How was this patch tested? Added new test cases. Also modified and enhanced some existing related tests. Closes #31300 from sunchao/SPARK-34052-branch-3.1. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 24 January 2021, 03:31:56 UTC
5d829b9 [SPARK-34213][SQL] Refresh cached data of v1 table in `LOAD DATA` Invoke `CatalogImpl.refreshTable()` instead of `SessionCatalog.refreshTable` in v1 implementation of the `LOAD DATA` command. `SessionCatalog.refreshTable` just refreshes metadata comparing to `CatalogImpl.refreshTable()` which refreshes cached table data as well. The example below portraits the issue: - Create a source table: ```sql spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0; spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0); default src_tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0 ... ``` - Load data from the source table to a cached destination table: ```sql spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1; spark-sql> CACHE TABLE dst_tbl; spark-sql> SELECT * FROM dst_tbl; 1 1 spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0); spark-sql> SELECT * FROM dst_tbl; 1 1 ``` The last query does not return new loaded data. Yes. After the changes, the example above works correctly: ```sql spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0); spark-sql> SELECT * FROM dst_tbl; 0 0 1 1 ``` Added new test to `org.apache.spark.sql.hive.CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #31304 from MaxGekk/load-data-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f8bf72ed5d1c25cb9068dc80d3996fcd5aade3ae) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2021, 23:50:33 UTC
0f7e4bc [SPARK-34202][SQL][TEST] Add ability to fetch spark release package from internal environment in HiveExternalCatalogVersionsSuite ### What changes were proposed in this pull request? `HiveExternalCatalogVersionsSuite` can't run in orgs internal environment where access to outside internet is not allowed because `HiveExternalCatalogVersionsSuite` will download spark release package from internet. Similar to SPARK-32998, this pr add 1 environment variables `SPARK_RELEASE_MIRROR` to let user can specify an accessible download address of spark release package and run `HiveExternalCatalogVersionsSuite` in orgs internal environment. ### Why are the changes needed? Let `HiveExternalCatalogVersionsSuite` can run in orgs internal environment without relying on external spark release download address. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test with and without env variables set in internal environment can't access internet. execute ``` mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl sql/hive -am -DskipTests mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl sql/hive -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest=none ``` **Without env** ``` HiveExternalCatalogVersionsSuite: 19:50:35.123 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed) 19:50:35.126 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://dist.apache.org/repos/dist/release/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed) org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** Exception encountered when invoking run on a nested suite - Unable to download Spark 3.0.1 (HiveExternalCatalogVersionsSuite.scala:125) Run completed in 2 seconds, 669 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 ``` **With env** ``` export SPARK_RELEASE_MIRROR=${spark-release.internal.com}/dist/release/ ``` ``` HiveExternalCatalogVersionsSuite - backward compatibility Run completed in 1 minute, 32 seconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #31294 from LuciferYang/SPARK-34202. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e48a8ad1a20446fcaaee6750084faa273028df3d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2021, 16:03:08 UTC
2e9be48 [SPARK-34191][PYTHON][SQL] Add typing for udf overload ### What changes were proposed in this pull request? Added typing for keyword-only single argument udf overload. ### Why are the changes needed? The intended use case is: ``` udf(returnType="string") def f(x): ... ``` ### Does this PR introduce _any_ user-facing change? Yes - a new typing for udf is considered valid. ### How was this patch tested? Existing tests. Closes #31282 from pgrz/patch-1. Authored-by: pgrz <grzegorski.piotr@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 121eb0130eaaa3ed13b366bda236ce499fbc6b4e) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 22 January 2021, 12:19:34 UTC
2a89a8e [SPARK-34200][SQL] Ambiguous column reference should consider attribute availability ### What changes were proposed in this pull request? This is a long-standing bug that exists since we have the ambiguous self-join check. A column reference is not ambiguous if it can only come from one join side (e.g. the other side has a project to only pick a few columns). An example is ``` Join(b#1 = 3) TableScan(t, [a#0, b#1]) Project(a#2) TableScan(t, [a#2, b#3]) ``` It's a self-join, but `b#1` is not ambiguous because it can't come from the right side, which only has column `a`. ### Why are the changes needed? to not fail valid self-join queries. ### Does this PR introduce _any_ user-facing change? yea as a bug fix ### How was this patch tested? a new test Closes #31287 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b8a69066271e82f146bbf6cd5638c544e49bb27f) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 22 January 2021, 11:12:08 UTC
cd009c8 [SPARK-33813][SQL][3.1] Fix the issue that JDBC source can't treat MS SQL Server's spatial types ### What changes were proposed in this pull request? This PR backports SPARK-33813 (#31283). This PR fixes the issue that reading tables which contain spatial datatypes from MS SQL Server fails. MS SQL server supports two non-standard spatial JDBC types, `geometry` and `geography` but Spark SQL can't treat them ``` java.sql.SQLException: Unrecognized SQL type -157 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381) ``` Considering the [data type mapping](https://docs.microsoft.com/ja-jp/sql/connect/jdbc/using-basic-data-types?view=sql-server-ver15) says, I think those spatial types can be mapped to Catalyst's `BinaryType`. ### Why are the changes needed? To provide better support. ### Does this PR introduce _any_ user-facing change? Yes. MS SQL Server users can use `geometry` and `geography` types in datasource tables. ### How was this patch tested? New test case added to `MsSqlServerIntegrationSuite`. Closes #31288 from sarutak/SPARK-33813-branch-3.1. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 January 2021, 08:53:14 UTC
5eaa872 [SPARK-34190][DOCS] Supplement the description for Python Package Management ### What changes were proposed in this pull request? This PR supplements the contents in the "Python Package Management". If there is no Python installed in the local for all nodes when using `venv-pack`, job would fail as below. ```python >>> from pyspark.sql.functions import pandas_udf >>> pandas_udf('double') ... def pandas_plus_one(v: pd.Series) -> pd.Series: ... return v + 1 ... >>> spark.range(10).select(pandas_plus_one("id")).show() ... Cannot run program "./environment/bin/python": error=2, No such file or directory ... ``` This is because the Python in the [packed environment via `venv-pack` has a symbolic link](https://github.com/jcrist/venv-pack/issues/5) that connects Python to the local one. To avoid this confusion, it seems better to have an additional explanation for this. ### Why are the changes needed? To provide more detailed information to users so that they don’t get confused ### Does this PR introduce _any_ user-facing change? Yes, this PR fixes the part of "Python Package Management" in the "User Guide" documents. ### How was this patch tested? Manually built the doc. ![Screen Shot 2021-01-21 at 7 10 38 PM](https://user-images.githubusercontent.com/44108233/105336258-5e8bec00-5c1c-11eb-870c-86acfc77c082.png) Closes #31280 from itholic/SPARK-34190. Authored-by: itholic <haejoon309@naver.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 28131a7794568944173e66de930c86d498ab55b5) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 January 2021, 13:16:07 UTC
acb803b [SPARK-33901][SQL][FOLLOWUP] Add drop table in charvarchar test ### What changes were proposed in this pull request? Add `drop table` in charvarchar sql test. ### Why are the changes needed? 1. `drop table` is also a test case, for better coverage. 2. It's more clear to drop table which created in current test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31277 from ulysses-you/SPARK-33901-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit da4b50f8e2f7299c9e6f8bafb541c9f785938e4a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 January 2021, 12:42:04 UTC
ec4dd18 [SPARK-34181][DOC] Update Prerequisites for build doc of ruby 3.0 issue ### What changes were proposed in this pull request? When ruby version is 3.0, jekyll server will failed with ``` yi.zhu$ SKIP_API=1 jekyll serve --watch Configuration file: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_config.yml Source: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs Destination: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_site Incremental build: disabled. Enable with --incremental Generating... done in 5.085 seconds. Auto-regeneration: enabled for '/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs' ------------------------------------------------ Jekyll 4.2.0 Please append `--trace` to the `serve` command for any additional information or backtrace. ------------------------------------------------ <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require': cannot load such file -- webrick (LoadError) from <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve/servlet.rb:3:in `<top (required)>' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `require_relative' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `setup' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:100:in `process' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `each' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `process_with_graceful_fail' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:86:in `block (2 levels) in init_with_program' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/exe/jekyll:15:in `<top (required)>' from /usr/local/bin/jekyll:23:in `load' from /usr/local/bin/jekyll:23:in `<main>' ``` This issue is solved in https://github.com/jekyll/jekyll/issues/8523 ### Why are the changes needed? Fix build issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31263 from AngersZhuuuu/SPARK-34181. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit faa4f0c2bd6aae82c79067cb255d5708aa632078) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 January 2021, 02:36:20 UTC
dad201e [MINOR][TESTS] Increase tolerance to 0.2 for NaiveBayesSuite ### What changes were proposed in this pull request? This test fails flakily. I found it failing in 1 out of 80 runs. ``` Expected -0.35667494393873245 and -0.41914521201224453 to be within 0.15 using relative tolerance. ``` Increasing relative tolerance to 0.2 should improve flakiness. ``` 0.2 * 0.35667494393873245 = 0.071 > 0.062 = |-0.35667494393873245 - (-0.41914521201224453)| ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #31266 from Loquats/NaiveBayesSuite-reltol. Authored-by: Andy Zhang <yue.zhang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c8c70d50026a8ed0b202f456b02df5adc905c4f7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 January 2021, 00:38:20 UTC
c2ac1da [SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance ### What changes were proposed in this pull request? Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ### Why are the changes needed? ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125 However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated a unit test Closes #31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f4989772229e2ba35f1d005727b7d4d9f1369895) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 January 2021, 13:36:30 UTC
7b870e3 [SPARK-34005][CORE][3.1] Update peak memory metrics for each Executor on task end ### What changes were proposed in this pull request? This PR backports SPARK-34005 (#31029). This PR makes `AppStatusListener` update the peak memory metrics for each Executor on task end like other peak memory metrics (e.g, stage, executors in a stage). ### Why are the changes needed? When `AppStatusListener#onExecutorMetricsUpdate` is called, peak memory metrics for Executors, stages and executors in a stage are updated but currently, the metrics only for Executors are not updated on task end. ### Does this PR introduce _any_ user-facing change? Yes. Executor peak memory metrics is updated more accurately. ### How was this patch tested? After I run a job with `local-cluster[1,1,1024]` and visited `/api/v1/<appid>/executors`, I confirmed `peakExecutorMemory` metrics is shown for an Executor even though the life time of each job is very short . I also modify the json files for `HistoryServerSuite`. Closes #31261 from sarutak/SPARK-34005-branch-3.1. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 20 January 2021, 11:50:05 UTC
back to top