https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
6bb3523 Preparing Spark release v3.2.0-rc1 20 August 2021, 12:40:40 UTC
fafdc14 Revert "Preparing Spark release v3.2.0-rc1" This reverts commit 8e58fafb05b6f2bde76af69b827b3b6357c33a5b. 20 August 2021, 12:07:02 UTC
c829ed5 Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit 4f1d21571d21392c7fcf8aaa03e18b15964f15d5. 20 August 2021, 12:07:01 UTC
f47a519 [SPARK-36551][BUILD] Add sphinx-plotly-directive in Spark release Dockerfile ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`. This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully. Also, this PR mentions it in the README of docs. ### Why are the changes needed? Fix release script and update README of docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test locally. Closes #33797 from gengliangwang/fixReleaseDocker. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 42eebb84f5b1624d4c0f8dee4fbbdd5e2b02f2a3) Signed-off-by: Gengliang Wang <gengliang@apache.org> 20 August 2021, 12:02:44 UTC
6357b22 [SPARK-36547][BUILD] Downgrade scala-maven-plugin to 4.3.0 ### What changes were proposed in this pull request? When preparing Spark 3.2.0 RC1, I hit the same issue of https://github.com/apache/spark/pull/31031. ``` [INFO] Compiling 21 Scala sources and 3 Java sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes ... [ERROR] ## Exception when compiling 24 sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes java.lang.SecurityException: class "javax.servlet.SessionCookieConfig"'s signer information does not match signer information of other classes in the same package java.lang.ClassLoader.checkCerts(ClassLoader.java:891) java.lang.ClassLoader.preDefineClass(ClassLoader.java:661) ``` This PR is to apply the same fix again by downgrading scala-maven-plugin to 4.3.0 ### Why are the changes needed? To unblock the release process. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build test Closes #33791 from gengliangwang/downgrade. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit f0775d215e73de9e34d36b3ea2468d97e6c79b3f) Signed-off-by: Gengliang Wang <gengliang@apache.org> 20 August 2021, 02:45:35 UTC
36c24a0 [SPARK-35312][SS][FOLLOW-UP] More documents and checking logic for the new options ### What changes were proposed in this pull request? Add more documents and checking logic for the new options `minOffsetPerTrigger` and `maxTriggerDelay`. ### Why are the changes needed? Have a clear description of the behavior introduced in SPARK-35312 ### Does this PR introduce _any_ user-facing change? Yes. If the user set minOffsetsPerTrigger > maxOffsetsPerTrigger, the new code will throw an AnalysisException. The original behavior is to ignore the maxOffsetsPerTrigger silenctly. ### How was this patch tested? Existing tests. Closes #33792 from xuanyuanking/SPARK-35312-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit a0b24019edcd268968a7e0074b0a54988e408699) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 20 August 2021, 01:41:54 UTC
4f1d215 Preparing development version 3.2.1-SNAPSHOT 19 August 2021, 14:08:32 UTC
8e58faf Preparing Spark release v3.2.0-rc1 19 August 2021, 14:08:26 UTC
7041c0f [SPARK-36428][TESTS][FOLLOWUP] Revert mistake change to DateExpressionsSuite ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/33775 commits the debug code mistakely. This PR revert the test path. ### Why are the changes needed? Revoke debug code. ### Does this PR introduce _any_ user-facing change? 'No'. Just adjust test. ### How was this patch tested? Revert non-ansi test path. Closes #33787 from beliefer/SPARK-36428-followup2. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 462aa7cd3cd761b62ed9ea365427d5e816f95c5a) Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 August 2021, 13:33:39 UTC
fb56627 Revert "[SPARK-35083][FOLLOW-UP][CORE] Add migration guide for the re… …mote scheduler pool files support" This reverts commit e3902d1975ee6a6a6f672eb6b4f318bcdd237e3f. The feature is improvement instead of behavior change. Closes #33789 from gengliangwang/revertDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit b36b1c7e8a69f0aa02d7471fc7dadd32ed57ade1) Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 August 2021, 13:30:19 UTC
5b97165 [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning ### What changes were proposed in this pull request? Remove `OptimizeSubqueries` from batch of `PartitionPruning` to make DPP support more cases. For example: ```sql SELECT date_id, product_id FROM fact_sk f JOIN (select store_id + 3 as new_store_id from dim_store where country = 'US') s ON f.store_id = s.new_store_id ``` Before this PR: ``` == Physical Plan == *(2) Project [date_id#3998, product_id#3999] +- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- *(2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(true)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#274] +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997] +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) +- *(1) ColumnarToRow +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> ``` After this PR: ``` == Physical Plan == *(2) Project [date_id#3998, product_id#3999] +- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- *(2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN dynamicpruning#4007)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> : +- SubqueryBroadcast dynamicpruning#4007, 0, [new_store_id#3997], [id=#263] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262] : +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997] : +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) : +- *(1) ColumnarToRow : +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> +- ReusedExchange [new_store_id#3997], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262] ``` This is because `OptimizeSubqueries` will infer more filters, so we cannot reuse broadcasts. The following is the plan if disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`: ``` == Physical Plan == *(2) Project [date_id#3998, product_id#3999] +- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false :- *(2) ColumnarToRow : +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN subquery#4009)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int> : +- Subquery subquery#4009, [id=#284] : +- *(2) HashAggregate(keys=[new_store_id#3997#4008], functions=[]) : +- Exchange hashpartitioning(new_store_id#3997#4008, 5), ENSURE_REQUIREMENTS, [id=#280] : +- *(1) HashAggregate(keys=[new_store_id#3997 AS new_store_id#3997#4008], functions=[]) : +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997] : +- *(1) Filter (((isnotnull(store_id#4002) AND isnotnull(country#4004)) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) : +- *(1) ColumnarToRow : +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(store_id#4002), isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002..., Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(store_id), IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#305] +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997] +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3))) +- *(1) ColumnarToRow +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string> ``` ### Why are the changes needed? Improve DPP to support more cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test: SQL | Before this PR(Seconds) | After this PR(Seconds) -- | -- | -- TPC-DS q58 | 40 | 20 TPC-DS q83 | 18 | 14 Closes #33664 from wangyum/SPARK-36444. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 2310b99e146b9856766611a2b4359f0fbee2dd44) Signed-off-by: Yuming Wang <yumwang@ebay.com> 19 August 2021, 08:45:22 UTC
9544c24 [SPARK-35083][FOLLOW-UP][CORE] Add migration guide for the remote scheduler pool files support ### What changes were proposed in this pull request? Add remote scheduler pool files support to the migration guide. ### Why are the changes needed? To highlight this useful support. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exiting tests. Closes #33785 from Ngone51/SPARK-35083-follow-up. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit e3902d1975ee6a6a6f672eb6b4f318bcdd237e3f) Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 August 2021, 08:29:19 UTC
54cca7f [SPARK-36519][SS] Store RocksDB format version in the checkpoint for streaming queries ### What changes were proposed in this pull request? RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version. A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format. In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed. We also provide a config `spark.sql.streaming.stateStore.rocksdb.formatVersion` for users who don't need to rollback their Spark versions to overwrite the format version specified in the checkpoint. ### Why are the changes needed? Provide the Spark version rollback guarantee for streaming queries when a new RocksDB introduces an incompatible format change. ### Does this PR introduce _any_ user-facing change? No. RocksDB state store is a new feature in Spark 3.2, which has not yet released. ### How was this patch tested? The new unit tests. Closes #33749 from zsxwing/SPARK-36519. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit ea4919801aa91800bf91c561a0c1c9f3f7dfd0e7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 19 August 2021, 07:23:52 UTC
8f3b4c4 [SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30648 ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page. ### Why are the changes needed? simplify the doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33781 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 07d173a8b0a19a2912905387bcda10e9db3c43c6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 August 2021, 03:04:20 UTC
3d69d0d [SPARK-36428][SQL][FOLLOWUP] Simplify the implementation of make_timestamp ### What changes were proposed in this pull request? The implement of https://github.com/apache/spark/pull/33665 make `make_timestamp` could accepts integer type as the seconds parameter. This PR let `make_timestamp` accepts `decimal(16,6)` type as the seconds parameter and cast integer to `decimal(16,6)` is safe, so we can simplify the code. ### Why are the changes needed? Simplify `make_timestamp`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? New tests. Closes #33775 from beliefer/SPARK-36428-followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 707eefa3c706561f904dad65f3e347028dafb6ea) Signed-off-by: Gengliang Wang <gengliang@apache.org> 18 August 2021, 14:57:27 UTC
181d33e [SPARK-36532][CORE] Fix deadlock in CoarseGrainedExecutorBackend.onDisconnected to avoid executor shutdown hang ### What changes were proposed in this pull request? Instead of exiting the executor within the RpcEnv's thread, exit the executor in a separate thread. ### Why are the changes needed? The current exit way in `onDisconnected` can cause the deadlock, which has the exact same root cause with https://github.com/apache/spark/pull/12012: * `onDisconnected` -> `System.exit` are called in sequence in the thread of `MessageLoop.threadpool` * `System.exit` triggers shutdown hooks and `executor.stop` is one of the hooks. * `executor.stop` stops the `Dispatcher`, which waits for the `MessageLoop.threadpool` to shutdown further. * Thus, the thread which runs `System.exit` waits for hooks to be done, but the `MessageLoop.threadpool` in the hook waits that thread to finish. Finally, this mutual dependence results in the deadlock. ### Does this PR introduce _any_ user-facing change? Yes, the executor shutdown won't hang. ### How was this patch tested? Pass existing tests. Closes #33759 from Ngone51/fix-executor-shutdown-hang. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 996551fecee8c3549438c4f536f8ab9607c644c5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 August 2021, 14:47:06 UTC
b749b49 [SPARK-36400][SPARK-36398][SQL][WEBUI] Make ThriftServer recognize spark.sql.redaction.string.regex ### What changes were proposed in this pull request? This PR fixes an issue that ThriftServer doesn't recognize `spark.sql.redaction.string.regex`. The problem is that sensitive information included in queries can be exposed. ![thrift-password1](https://user-images.githubusercontent.com/4736016/129440772-46379cc5-987b-41ac-adce-aaf2139f6955.png) ![thrift-password2](https://user-images.githubusercontent.com/4736016/129440775-fd328c0f-d128-4a20-82b0-46c331b9fd64.png) ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')` Then, confirmed UI. ![thrift-hide-password1](https://user-images.githubusercontent.com/4736016/129440863-cabea247-d51f-41a4-80ac-6c64141e1fb7.png) ![thrift-hide-password2](https://user-images.githubusercontent.com/4736016/129440874-96cd0f0c-720b-4010-968a-cffbc85d2be5.png) Closes #33743 from sarutak/thrift-redact. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit b914ff7d54bd7c07e7313bb06a1fa22c36b628d2) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 18 August 2021, 04:32:03 UTC
528fca8 [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version ### What changes were proposed in this pull request? This is a follow-up of #33687. Use `LooseVersion` instead of `pkg_resources.parse_version`. ### Why are the changes needed? In the previous PR, `pkg_resources.parse_version` was used, but we should use `LooseVersion` instead to be consistent in the code base. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33768 from ueshin/issues/SPARK-36370/LooseVersion. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7fb8ea319e4931f7721ac6f9c12100c95d252cd2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 August 2021, 01:36:17 UTC
5107ad3 [SPARK-36535][SQL] Refine the sql reference doc ### What changes were proposed in this pull request? Refine the SQL reference doc: - remove useless subitems in the sidebar - remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`) - avoid using `#####` in `sql-ref-literals.md` ### Why are the changes needed? The subitems in the sidebar are quite useless, as the menu page serves the same functionalities: <img width="1040" alt="WX20210817-2358402x" src="https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png"> It's also extra work to keep the manu page and sidebar subitems in sync (The ANSI compliance page is already out of sync). The sub-menu-pages are only referenced by the sidebar, and duplicates the content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already outdated compared to the menu page. It's easier to just look at the menu page. The `#####` is not rendered properly: <img width="776" alt="WX20210818-0001192x" src="https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png"> It's better to avoid using it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #33767 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 4b015e8d7d6f5972341104f2a359bb9d09c4385b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 August 2021, 19:46:49 UTC
e15daa3 [SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined ### What changes were proposed in this pull request? Suggesting to refactor the way the _builtin_table is defined in the `python/pyspark/pandas/groupby.py` module. Pandas has recently refactored the way we import the _builtin_table and is now part of the pandas.core.common module instead of being an attribute of the pandas.core.base.SelectionMixin class. ### Why are the changes needed? This change is not fully needed but the current implementation redefines this table within pyspark, so any changes of this table from the pandas library would need to be updated in the pyspark repository as well. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the following command successfully : ```sh python/run-tests --testnames 'pyspark.pandas.tests.test_groupby' ``` Tests passed in 327 seconds Closes #33687 from Cedric-Magnan/_builtin_table_from_pandas. Authored-by: Cedric-Magnan <cedric.magnan@artefact.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit 964dfe254ff8ebf9d7f5c7115ff8f79da3f28261) Signed-off-by: Takuya UESHIN <ueshin@databricks.com> 17 August 2021, 17:47:01 UTC
70635b4 Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases" ### What changes were proposed in this pull request? Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129) ### Why are the changes needed? It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable. Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too. As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode. ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? Unit tests Closes #33758 from gengliangwang/revertGroupByAlias. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 8bfb4f1e72f33205b94957f7dacf298b0c8bde17) Signed-off-by: Gengliang Wang <gengliang@apache.org> 17 August 2021, 12:24:09 UTC
07c6976 [SPARK-36524][SQL] Common class for ANSI interval types ### What changes were proposed in this pull request? Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType` ### Why are the changes needed? To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance: ```scala case _: YearMonthIntervalType | _: DayTimeIntervalType => false ``` by ```scala case _: AnsiIntervalType => false ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing test suites. Closes #33753 from MaxGekk/ansi-interval-type-trait. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 82a31508afffd089048e28276c75b5deb1ada47f) Signed-off-by: Max Gekk <max.gekk@gmail.com> 17 August 2021, 09:28:07 UTC
eb09be9 [SPARK-36052][K8S] Introducing a limit for pending PODs Introducing a limit for pending PODs (newly created/requested executors included). This limit is global for all the resource profiles. So first we have to count all the newly created and pending PODs (decreased by the ones which requested to be deleted) then we can share the remaining pending POD slots among the resource profiles. Without this PR dynamic allocation could request too many PODs and the K8S scheduler could be overloaded and scheduling of PODs will be affected by the load. No. With new unit tests. Closes #33492 from attilapiros/SPARK-36052. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1dced492fb286a7ada73d886fe264f5df0e2b3da) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 August 2021, 23:06:29 UTC
41e5144 [SPARK-36521][SQL] Disallow comparison between Interval and String ### What changes were proposed in this pull request? Disallow comparison between Interval and String in the default type coercion rules. ### Why are the changes needed? If a binary comparison contains interval type and string type, we can't decide which interval type the string should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc. ### Does this PR introduce _any_ user-facing change? No, the new interval type is not released yet. ### How was this patch tested? Existing UT Closes #33750 from gengliangwang/disallowCom. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 26d6b952dcf7d387930701396de9cef679df7432) Signed-off-by: Max Gekk <max.gekk@gmail.com> 16 August 2021, 19:41:25 UTC
4caa43e [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide ### What changes were proposed in this pull request? Add the document for the new RocksDBStateStoreProvider. ### Why are the changes needed? User guide for the new feature. ### Does this PR introduce _any_ user-facing change? No, doc only. ### How was this patch tested? Doc only. Closes #33683 from xuanyuanking/SPARK-36041. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 3d57e00a7f8d8f2f7dc0bfbfb0466ef38fb3da08) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 16 August 2021, 19:32:19 UTC
2fb62e0 [SPARK-35548][CORE][SHUFFLE] Handling new attempt has started error message in BlockPushErrorHandler in client ### What changes were proposed in this pull request? Add a new type of error message in BlockPushErrorHandler which indicates the PushblockStream message is received after a new application attempt has started. This error message should be correctly handled in client without retrying the block push. ### Why are the changes needed? When we get a block push failure because of the too old attempt, we will not retry pushing the block nor log the exception on the client side. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add the corresponding unit test. Closes #33617 from zhuqi-lucas/master. Authored-by: zhuqi-lucas <821684824@qq.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 05cd5f97c3dea25dacdbdb319243cdab9667c774) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 16 August 2021, 18:59:54 UTC
cb14a32 [SPARK-36469][PYTHON] Implement Index.map ### What changes were proposed in this pull request? Implement `Index.map`. The PR is based on https://github.com/databricks/koalas/pull/2136. Thanks awdavidson for the prototype. `map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs. ### Why are the changes needed? Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html). We shall also support hat. ### Does this PR introduce _any_ user-facing change? Yes. `Index.map` is available now. ```py >>> psidx = ps.Index([1, 2, 3]) >>> psidx.map({1: "one", 2: "two", 3: "three"}) Index(['one', 'two', 'three'], dtype='object') >>> psidx.map(lambda id: "{id} + 1".format(id=id)) Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object') >>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3]) >>> psidx.map(pser) Index(['one', 'two', 'three'], dtype='object') ``` ### How was this patch tested? Unit tests. Closes #33694 from xinrong-databricks/index_map. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit 4dcd74602571d36a3b9129f0886e1cfc33d7fdc8) Signed-off-by: Takuya UESHIN <ueshin@databricks.com> 16 August 2021, 18:06:23 UTC
9149cad [SPARK-32210][CORE] Fix NegativeArraySizeException in MapOutputTracker with large spark.default.parallelism ### What changes were proposed in this pull request? The current `MapOutputTracker` class may throw `NegativeArraySizeException` with a large number of partitions. Within the serializeOutputStatuses() method, it is trying to compress an array of mapStatuses and outputting the binary data into (Apache)ByteArrayOutputStream . Inside the (Apache)ByteArrayOutputStream.toByteArray(), negative index exception happens because the index is int and overflows (2GB limit) when the output binary size is too large. This PR proposes two high-level ideas: 1. Use `org.apache.spark.util.io.ChunkedByteBufferOutputStream`, which has a way to output the underlying buffer as `Array[Array[Byte]]`. 2. Change the signatures from `Array[Byte]` to `Array[Array[Byte]]` in order to handle over 2GB compressed data. ### Why are the changes needed? This issue seems to be missed out in the earlier effort of addressing 2GB limitations [SPARK-6235](https://issues.apache.org/jira/browse/SPARK-6235) Without this fix, `spark.default.parallelism` needs to be kept at the low number. The drawback of setting smaller spark.default.parallelism is that it requires more executor memory (more data per partition). Setting `spark.io.compression.zstd.level` to higher number (default 1) hardly helps. That essentially means we have the data size limit that for shuffling and does not scale. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed existing tests ``` build/sbt "core/testOnly org.apache.spark.MapOutputTrackerSuite" ``` Also added a new unit test ``` build/sbt "core/testOnly org.apache.spark.MapOutputTrackerSuite -- -z SPARK-32210" ``` Ran the benchmark using GitHub Actions and didn't not observe any performance penalties. The results are attached in this PR ``` core/benchmarks/MapStatusesSerDeserBenchmark-jdk11-results.txt core/benchmarks/MapStatusesSerDeserBenchmark-results.txt ``` Closes #33721 from kazuyukitanimura/SPARK-32210. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8ee464cd7a09302cacc47a4cbc98fdf307f39dbd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 August 2021, 16:11:51 UTC
233af3d [SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation ### What changes were proposed in this pull request? Document the push-based shuffle feature with a high level overview of the feature and corresponding configuration options for both shuffle server side as well as client side. This is how the changes to the doc looks on the browser ([img](https://user-images.githubusercontent.com/8871522/129231582-ad86ee2f-246f-4b42-9528-4ccd693e86d2.png)) ### Why are the changes needed? Helps users understand the feature ### Does this PR introduce _any_ user-facing change? Docs ### How was this patch tested? N/A Closes #33615 from venkata91/SPARK-36374. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 2270ecf32f7ae478570145219d2ce71a642076cf) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 16 August 2021, 15:25:33 UTC
3aa933b [SPARK-36465][SS] Dynamic gap duration in session window ### What changes were proposed in this pull request? This patch supports dynamic gap duration in session window. ### Why are the changes needed? The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows. ### Does this PR introduce _any_ user-facing change? Yes, users can specify dynamic gap duration. ### How was this patch tested? Modified existing tests and new test. Closes #33691 from viirya/dynamic-session-window-gap. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 8b8d91cf64aeb4ccc51dfe914f307e28c57081f8) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 16 August 2021, 02:06:16 UTC
b8c1014 Update Spark key negotiation protocol 14 August 2021, 14:08:29 UTC
ede1d1e [SPARK-34952][SQL][FOLLOWUP] Normalize pushed down aggregate col name and group by col name ### What changes were proposed in this pull request? Normalize pushed down aggregate col names and group by col names ... ### Why are the changes needed? to handle case sensitive col names ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modify existing test Closes #33739 from huaxingao/normalize. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3f8ec0dae4ddfd7ee55370dad5d44d03a9f10387) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 August 2021, 05:31:30 UTC
c898a94 [SPARK-36508][SQL] ANSI type coercion: disallow binary operations between Interval and String literal ### What changes were proposed in this pull request? If a binary operation contains interval type and string literal, we can't decide which interval type the string literal should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc. The related binary operation for Interval contains - Add - Subtract - Comparisions Note that `Interval Multiple/Divide StringLiteral` is valid as them is not binary operators(the left and right are not of the same type). This PR also add tests for them. ### Why are the changes needed? Avoid ambiguously implicit casting string literals to interval types. ### Does this PR introduce _any_ user-facing change? No, the ANSI type coercion is not released yet. ### How was this patch tested? New tests. Closes #33737 from gengliangwang/disallowStringAndInterval. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit ecdea9160213fefb1936ef41f9968d394d109389) Signed-off-by: Gengliang Wang <gengliang@apache.org> 14 August 2021, 02:50:43 UTC
101f720 [SPARK-32920][CORE][FOLLOW-UP] Fix string interpolator in the log ### What changes were proposed in this pull request? fix string interpolator ### Why are the changes needed? To log the correct stage info. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existed tests. Closes #33738 from Ngone51/fix. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a47ceaf5492040063e31e17570678dc06846c36c) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 August 2021, 12:44:30 UTC
09a1ddb [SPARK-36500][CORE] Fix temp_shuffle file leaking when a task is interrupted ### What changes were proposed in this pull request? When a task thread is interrupted, the underlying output stream referred by `DiskBlockObjectWriter.mcs` may have been closed, then we get IOException when flushing the buffered data. This breaks the assumption that `revertPartialWritesAndClose()` should not throw exceptions. To fix the issue, we can catch the IOException in `ManualCloseOutputStream.manualClose()`. ### Why are the changes needed? Previously the IOException was not captured, thus `revertPartialWritesAndClose()` threw an exception. When this happens, `BypassMergeSortShuffleWriter.stop()` would stop deleting the temp_shuffle files tracked by `partitionWriters`, hens lead to temp_shuffle file leak issues. ### Does this PR introduce _any_ user-facing change? No, this is an internal bug fix. ### How was this patch tested? Tested by running a longevity stress test. After the fix, there is no more leaked temp_shuffle files. Closes #33731 from jiangxb1987/temp_shuffle. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ec5f3a17e33f7afe03e48f8b7690a8b18ae0c058) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 August 2021, 10:25:27 UTC
eaf92be [SPARK-36499][SQL][TESTS] Test Interval multiply / divide null ### What changes were proposed in this pull request? Test the following valid operations: ``` year-month interval * null null * year-month interval year-month interval / null ``` and invalid operations: ``` null / interval int / interval ``` ### Why are the changes needed? Improve test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CI Closes #33729 from gengliangwang/addTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit eb6be7f1ee076aeaa312f7a3ff0c88db516b793b) Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 August 2021, 08:06:09 UTC
eb84057 [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type ### What changes were proposed in this pull request? With ANSI mode, `SELECT make_timestamp(1, 1, 1, 1, 1, 1)` fails, because the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be implicitly casted to DECIMAL(8,6) under ANSI mode. ``` org.apache.spark.sql.AnalysisException cannot resolve 'make_timestamp(1, 1, 1, 1, 1, 1)' due to data type mismatch: argument 6 requires decimal(8,6) type, however, '1' is of int type.; line 1 pos 7 ``` We should update the function `make_timestamp` to allow integer type 'seconds' parameter. ### Why are the changes needed? Make `make_timestamp` could accepts integer as 'seconds' parameter. ### Does this PR introduce _any_ user-facing change? 'Yes'. `make_timestamp` could accepts integer as 'seconds' parameter. ### How was this patch tested? New tests. Closes #33665 from beliefer/SPARK-36428. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7d823367348c0235ffb212b09afcf053d070068d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 August 2021, 05:13:15 UTC
ca91292 [SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances. The major changes include: 1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, `WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved version of `With`, only that: 1) `WithCTE` is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one `WithCTE`, which means nested With-CTEs are combined. 2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a `CTERelationRef` per reference. 3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update `CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) `InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) `UpdateCTERelationStats` - to update stats for `CTERelationRef`s that are not inlined. 4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse. A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces. This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results. No. Added UTs. Regenerated golden files for TPCDS plan stability tests. There is NO change to the `simplified.txt` files, the only differences are expression IDs. Closes #33671 from maryannxue/spark-36447. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 29b1e394c688599963294e4f5a875a8d97233fbd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 August 2021, 03:48:46 UTC
f701769 [SPARK-36497][SQL] Support Interval add/subtract NULL ### What changes were proposed in this pull request? Currently, `null + interval` will become `cast(cast(null as timestamp) + interval) as null`. This is a unexpected behavior and the result should not be of null type. This weird behavior applies to `null - interval`, `interval + null`, `interval - null` as well. To change it, I propose to cast the null as the same data type of the other element in the add/subtract: ``` null + interval => cast(null as interval) + interval null - interval => cast(null as interval) - interval interval + null=> interval + cast(null as interval) interval - null => interval - cast(null as interval) ``` ### Why are the changes needed? Change the confusing behavior of `Interval +/- NULL` and `NULL +/- Interval` ### Does this PR introduce _any_ user-facing change? No, the new interval type is not released yet. ### How was this patch tested? Existing UT Closes #33727 from gengliangwang/intervalTypeCoercion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d4466d55cadbcea9233cb8fbb90a62a7e56a2da8) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 August 2021, 03:10:45 UTC
99a0085 [SPARK-36501][ML] Fix random col names in LSHModel.approxSimilarityJoin ### What changes were proposed in this pull request? Random.nextString() can include characters that are not valid in identifiers or likely to be buggy, e.g. non-printing characters, ".", "`". Instead use a utility that will always generate valid alphanumeric identifiers ### Why are the changes needed? To deflake BucketedRandomProjectionLSHSuite and avoid similar failures that could be encountered by users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran org.apache.spark.ml.feature.BucketedRandomProjectionLSHSuite Closes #33730 from timarmstrong/flaky-lsb. Authored-by: Tim Armstrong <tim.armstrong@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 886dbe01cdd9082f3a82bb31598e22fd4c9a7e5a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 August 2021, 03:04:54 UTC
32d67d1 [SPARK-36483][CORE][TESTS] Fix intermittent test failures at Netty 4.1.52+ ### What changes were proposed in this pull request? Fix an intermittent test failure due to Netty dependency version bump. Starting from Netty 4.1.52, its AbstractChannel will throw a new `StacklessClosedChannelException` for channel closed exception. A hardcoded list of Strings to match for channel closed exception in `RPCIntegrationSuite` was not updated, thus leading to the intermittent test failure reported in #33613 ### Why are the changes needed? Fix intermittent test failure ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #33713 from Victsm/SPARK-36378-followup. Authored-by: Min Shen <mshen@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit b8e2186fe13d609a777da9bf68fa31ea38da50fe) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 13 August 2021, 01:15:58 UTC
3785738 [SPARK-36445][SQL][FOLLOWUP] ANSI type coercion: revisit promoting string literals in datetime expressions ### What changes were proposed in this pull request? 1. Promote more string literal in subtractions. In the ANSI type coercion rule, we already promoted ``` string - timestamp => cast(string as timestamp) - timestamp ``` This PR is to promote the following string literals: ``` string - date => cast(string as date) - date date - string => date - cast(date as string) timestamp - string => timestamp ``` It is very straightforward to cast the string literal as the data type of the other side in the subtraction. 2. Merge the string promotion logic from the rule `StringLiteralCoercion`: ``` date_sub(date, string) => date_sub(date, cast(string as int)) date_add(date, string) => date_add(date, cast(string as int)) ``` ### Why are the changes needed? 1. Promote the string literal in the subtraction as the data type of the other side. This is straightforward and consistent with PostgreSQL 2. Certerize all the string literal promotion in the ANSI type coercion rule ### Does this PR introduce _any_ user-facing change? No, the new ANSI type coercion rules are not released yet. ### How was this patch tested? Existing UT Closes #33724 from gengliangwang/datetimeTypeCoercion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 48e333af5473f849274c4247313ea7b51de70faf) Signed-off-by: Gengliang Wang <gengliang@apache.org> 12 August 2021, 17:02:54 UTC
89cc547 [SPARK-35881][SQL][FOLLOWUP] Remove the AQE post stage creation extension ### What changes were proposed in this pull request? This is a followup of #33140 It turns out that we may be able to complete the AQE and columnar execution integration without the AQE post stage creation extension. The rule `ApplyColumnarRulesAndInsertTransitions` can add to-columnar transition if the shuffle/broadcast supports columnar. ### Why are the changes needed? remove APIs that are not needed. ### Does this PR introduce _any_ user-facing change? No, the APIs are not released yet. ### How was this patch tested? existing and manual tests Closes #33701 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 124d011ee73f9805ac840aa5a6eddc27cd09b2e1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 August 2021, 13:35:44 UTC
3b3eb6f [SPARK-36489][SQL] Aggregate functions over no grouping keys, on tables with a single bucket, return multiple rows ### What changes were proposed in this pull request? This PR fixes a bug in `DisableUnnecessaryBucketedScan`. When running any aggregate function, without any grouping keys, on a table with a single bucket, multiple rows are returned. This happens because the aggregate function satisfies the `AllTuples` distribution, no `Exchange` will be planned, and the bucketed scan will be disabled. ### Why are the changes needed? Bug fixing. Aggregates over no grouping keys should return a single row. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test in `DisableUnnecessaryBucketedScanSuite`. Closes #33711 from IonutBoicuAms/fix-bug-disableunnecessarybucketedscan. Authored-by: IonutBoicuAms <ionut.boicu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2b665751d9c7e4fb07ea18ce6611328e24f3dce9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 August 2021, 07:22:50 UTC
3e3c33d [SPARK-36479][SQL][TEST] Improve datetime test coverage in SQL files ### What changes were proposed in this pull request? This PR adds more datetime tests in `date.sql` and `timestamp.sql`, especially for string promotion. ### Why are the changes needed? improve test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33707 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 00a4364f3869a844168f20374d2246f6bfc092e3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 August 2021, 04:52:03 UTC
9df850d [SPARK-36480][SS] SessionWindowStateStoreSaveExec should not filter input rows against watermark ### What changes were proposed in this pull request? This PR proposes to remove the filter applying to input rows against watermark in SessionWindowStateStoreSaveExec, since SessionWindowStateStoreSaveExec is expected to store all inputs into state store, and apply eviction later. ### Why are the changes needed? The code is logically not right, though I can't reproduce the actual problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. I can't come up with broken case failing on previous code, but we can review the logic instead. Closes #33708 from HeartSaVioR/SPARK-36480. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit fac4e5eb3eae4cafe7bd6672911792612c2aaca0) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 12 August 2021, 03:11:07 UTC
1a371fb [SPARK-36482][BUILD] Bump orc to 1.6.10 ### What changes were proposed in this pull request? This PR aims to bump ORC to 1.6.10 ### Why are the changes needed? This will bring the latest bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33712 from williamhyun/orc. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit aff1b5594aa29dce1828bb58c47ee457ef5c2c24) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 August 2021, 18:32:18 UTC
8dbcbeb [SPARK-36468][SQL][DOCS] Update docs about ANSI interval literals ### What changes were proposed in this pull request? In the PR, I propose to update the doc page https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal, and describe formats of ANSI interval literals. <img width="1032" alt="Screenshot 2021-08-11 at 10 31 36" src="https://user-images.githubusercontent.com/1580697/128988454-7a6ac435-409b-4961-9b79-ebecfb141d5e.png"> <img width="1030" alt="Screenshot 2021-08-10 at 20 58 04" src="https://user-images.githubusercontent.com/1580697/128912018-a4ea3ee5-f252-49c7-a90e-5beaf7ac868f.png"> ### Why are the changes needed? To improve UX with Spark SQL, and inform users about recently added ANSI interval literals. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually checked the generated docs: ``` $ SKIP_API=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 SKIP_SCALADOC=1 bundle exec jekyll build ``` Closes #33693 from MaxGekk/doc-ansi-interval-literals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bbf988bd73b00d18dd1d443f225b3915a2c4433f) Signed-off-by: Max Gekk <max.gekk@gmail.com> 11 August 2021, 10:38:52 UTC
293a6cb [SPARK-36445][SQL] ANSI type coercion rule for date time operations ### What changes were proposed in this pull request? Implement a new rule for the date-time operations in the ANSI type coercion system: 1. Date will be converted to Timestamp when it is in the subtraction with Timestmap. 2. Promote string literals in date_add/date_sub/time_add ### Why are the changes needed? Currently the type coercion rule `DateTimeOperations` doesn't match the design of the ANSI type coercion system: 1. For date_add/date_sub, if the input is timestamp type, Spark should not convert it into date type since date type is narrower than the timestamp type. 2. For date_add/date_sub/time_add, string value can be implicit cast to date/timestamp only when it is literal. Thus, we need to have a new rule for the date-time operations in the ANSI type coercion system. ### Does this PR introduce _any_ user-facing change? No, the ANSI type coercion rules are not releaesd. ### How was this patch tested? New UT Closes #33666 from gengliangwang/datetimeOp. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 3029e62a828e33a6f45828dec57f6c7709cb32f7) Signed-off-by: Gengliang Wang <gengliang@apache.org> 11 August 2021, 03:56:00 UTC
161908c [SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window ### What changes were proposed in this pull request? This PR proposes to prohibit update mode in streaming aggregation with session window. UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode. This PR also cleans up test code via deduplicating. ### Why are the changes needed? The semantic of "update" mode for session window based streaming aggregation is quite unclear. For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed. This doesn't hold true for session window based streaming aggregation, as session range is changing. If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows. If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions. ### Does this PR introduce _any_ user-facing change? No, as we haven't released this feature. ### How was this patch tested? Updated tests. Closes #33689 from HeartSaVioR/SPARK-36463. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit ed60aaa9f19f619e61e34e4aa948e15ca78060bd) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 11 August 2021, 01:46:03 UTC
c6b683e [SPARK-36378][SHUFFLE] Switch to using RPCResponse to communicate common block push failures to the client We have run performance evaluations on the version of push-based shuffle committed to upstream so far, and have identified a few places for further improvements: 1. On the server side, we have noticed that the usage of `String.format`, especially when receiving a block push request, has a much higher overhead compared with string concatenation. 2. On the server side, the usage of `Throwables.getStackTraceAsString` in the `ErrorHandler.shouldRetryError` and `ErrorHandler.shouldLogError` has generated quite some overhead. These 2 issues are related to how we are currently handling certain common block push failures. We are communicating such failures via `RPCFailure` by transmitting the exception stack trace. This generates the overhead on both server and client side for creating these exceptions and makes checking the type of failures fragile and inefficient with string matching of exception stack trace. To address these, this PR also proposes to encode the common block push failure as an error code and send that back to the client with a proper RPC message. Improve shuffle service efficiency for push-based shuffle. Improve code robustness for handling block push failures. No Existing unit tests. Closes #33613 from Victsm/SPARK-36378. Lead-authored-by: Min Shen <mshen@linkedin.com> Co-authored-by: Min Shen <victor.nju@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 3f09093a21306b0fbcb132d4c9f285e56ac6b43c) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 10 August 2021, 21:54:21 UTC
c230ca9 [SPARK-36464][CORE] Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data ### What changes were proposed in this pull request? The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` This PR proposes to change the underlying `_size` variable from `Int` to `Long` at the initialization ### Why are the changes needed? Be cause the `size` method of `ChunkedByteBufferOutputStream` incorrectly returns a negative value when over 2GB data is written. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passed existing tests ``` build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite" ``` Also added a new unit test ``` build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite – -z SPARK-36464" ``` Closes #33690 from kazuyukitanimura/SPARK-36464. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c888bad6a12b45f3eda8d898bdd90405985ee05c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 10 August 2021, 17:30:07 UTC
6018d44 [SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported ### What changes were proposed in this pull request? Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`. ``` -- !query select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>> -- !query output {"t":null} ``` ``` -- !query select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<> -- !query output java.lang.Exception Unsupported type: timestamp_ntz ``` We should make `from_json` throws exception too. This PR fix the discussion below https://github.com/apache/spark/pull/33640#discussion_r682862523 ### Why are the changes needed? Make the behavior of `from_json` more reasonable. ### Does this PR introduce _any_ user-facing change? 'Yes'. from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ. ### How was this patch tested? Tests updated. Closes #33684 from beliefer/SPARK-36429-new. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 186815be1c0832e2f976faf6f2d5c31e42eebf1a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 August 2021, 14:52:39 UTC
fb6f379 [SPARK-36431][SQL] Support TypeCoercion of ANSI intervals with different fields ### What changes were proposed in this pull request? Support TypeCoercion of ANSI intervals with different fields ### Why are the changes needed? Support TypeCoercion of ANSI intervals with different fields ### Does this PR introduce _any_ user-facing change? After this pr user can - use comparison function with different fields of DayTimeIntervalType/YearMonthIntervalType such as `INTERVAL '1' YEAR` > `INTERVAL '11' MONTH` - support different field of ansi interval type in collection function such as `array(INTERVAL '1' YEAR, INTERVAL '11' MONTH)` - support different field of ansi interval type in `coalesce` etc.. ### How was this patch tested? Added UT Closes #33661 from AngersZhuuuu/SPARK-SPARK-36431. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 89d8a4eacfd09f67ad31bf1cbf7d4b88de3b1e24) Signed-off-by: Max Gekk <max.gekk@gmail.com> 10 August 2021, 11:22:47 UTC
45acd00 [SPARK-36466][SQL] Table in unloaded catalog referenced by view should load correctly ### What changes were proposed in this pull request? Retain `spark.sql.catalog.*` confs when resolving view. ### Why are the changes needed? Currently, if a view in default catalog ref a table in another catalog (e.g. jdbc), `org.apache.spark.sql.AnalysisException: Table or view not found: cat.t` will be thrown on accessing the view if the catalog has not been loaded yet. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? Add UT. Closes #33692 from pan3793/SPARK-36466. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7f56b73cad0e38498aa3b2bdf6a5c22388175dea) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 August 2021, 09:31:36 UTC
882ef6d [SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. ### Why are the changes needed? To check the duplicate columns during analysis. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the following: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33676 from imback82/replace_cols_duplicates. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e1a5d9411733437b5a18045bbd18b48f7aa40f46) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 August 2021, 05:20:50 UTC
1a432fe [SPARK-36460][SHUFFLE] Pull out NoOpMergedShuffleFileManager inner class outside ### What changes were proposed in this pull request? Pull out NoOpMergedShuffleFileManager inner class outside. This is required since passing dollar sign ($) for the config (`spark.shuffle.server.mergedShuffleFileManagerImpl`) value can be an issue. Currently `spark.shuffle.server.mergedShuffleFileManagerImpl` is by default set to `org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager`. After this change the default value be set to `org.apache.spark.network.shuffle.NoOpMergedShuffleFileManager` ### Why are the changes needed? Passing `$` for the config value can be an issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified existing unit tests. Closes #33688 from venkata91/SPARK-36460. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit df0de83c4679a5c8a9c0e3c7995bc7d692e121f0) Signed-off-by: Gengliang Wang <gengliang@apache.org> 10 August 2021, 02:19:41 UTC
38dd42a [SPARK-36332][SHUFFLE] Cleanup RemoteBlockPushResolver log messages ### What changes were proposed in this pull request? Cleanup `RemoteBlockPushResolver` log messages by using `AppShufflePartitionInfo#toString()` to avoid duplications. Currently this is based off of https://github.com/apache/spark/pull/33034 will remove those changes once it is merged and remove the WIP at that time. ### Why are the changes needed? Minor cleanup to make code more readable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No tests, just changing log messages Closes #33561 from venkata91/SPARK-36332. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: yi.wu <yi.wu@databricks.com> (cherry picked from commit ab897109a3fa9f83a20857a292dd68fe97e447a8) Signed-off-by: yi.wu <yi.wu@databricks.com> 10 August 2021, 01:54:54 UTC
10f7f6e [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 ### What changes were proposed in this pull request? not push down partition filter to `ORCScan` for DSv2 ### Why are the changes needed? Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1 ``` == Physical Plan == *(1) Filter (isnotnull(value#19) AND NOT (value#19 = a)) +- *(1) ColumnarToRow +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string> ``` Also, we don't push down partition filter for parquet in DSv2. https://github.com/apache/spark/pull/30652 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suites Closes #33680 from huaxingao/orc_filter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b04330cd38e2817748ff50a7bf63b7145ea85cd4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 August 2021, 17:47:23 UTC
9dc8d0c [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState ### What changes were proposed in this pull request? This PR proposes to add a new example of complex sessionization, which leverages flatMapGroupsWithState. ### Why are the changes needed? We have replaced an example of sessionization from flatMapGroupsWithState to native support of session window. Given there are still use cases on sessionization which native support of session window cannot cover, it would be nice if we can demonstrate such case. It will also be used as an example of flatMapGroupsWithState. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Example data is given in class doc. Closes #33681 from HeartSaVioR/SPARK-36455. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 004b87d9a1fbe9063e52523505a247806d50455a) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 09 August 2021, 10:40:37 UTC
5bddafe [SPARK-36430][SQL] Adaptively calculate the target size when coalescing shuffle partitions in AQE ### What changes were proposed in this pull request? This PR fixes a performance regression introduced in https://github.com/apache/spark/pull/33172 Before #33172 , the target size is adaptively calculated based on the default parallelism of the spark cluster. Sometimes it's very small and #33172 sets a min partition size to fix perf issues. Sometimes the calculated size is reasonable, such as dozens of MBs. After #33172 , we no longer calculate the target size adaptively, and by default always coalesce the partitions into 1 MB. This can cause perf regression if the adaptively calculated size is reasonable. This PR brings back the code that adaptively calculate the target size based on the default parallelism of the spark cluster. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33655 from cloud-fan/minor. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9a539d5846814f5fd5317b9d0b7eb1a41299f092) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 August 2021, 09:26:11 UTC
2a46bf6 [SPARK-36271][SQL] Unify V1 insert check field name before prepare writter ### What changes were proposed in this pull request? Unify DataSource V1 insert schema check field name before prepare writer. And in this PR we add check for avro V1 insert too. ### Why are the changes needed? Unify code and add check for avro V1 insert too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33566 from AngersZhuuuu/SPARK-36271. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f3e079b09b5877c97fc10864937e76d866935880) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 August 2021, 09:18:20 UTC
a5ecf2a [SPARK-36352][SQL] Spark should check result plan's output schema name ### What changes were proposed in this pull request? Spark should check result plan's output schema name ### Why are the changes needed? In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema. For example, for SchemaPruning, if we have a plan ``` Project[a, B] |--Scan[A, b, c] ``` the origin output schema is `a, B`, after SchemaPruning. it become ``` Project[A, b] |--Scan[A, b] ``` It change the plan's schema. when we use CTAS, the schema is same as query plan's output. Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existed UT Closes #33583 from AngersZhuuuu/SPARK-36352. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e051a540a10cdda42dc86a6195c0357aea8900e4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 August 2021, 08:48:10 UTC
94dc3c7 [SPARK-35881][SQL][FOLLOWUP] Add a boolean flag in AdaptiveSparkPlanExec to ask for columnar output ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/33140 to propose a simpler idea for integrating columnar execution into AQE. Instead of making the `ColumnarToRowExec` and `RowToColumnarExec` dynamic to handle `AdaptiveSparkPlanExec`, it's simpler to let the consumer decide if it needs columnar output or not, and pass a boolean flag to `AdaptiveSparkPlanExec`. For Spark vendors, they can set the flag differently in their custom columnar parquet writing command when the input plan is `AdaptiveSparkPlanExec`. One argument is if we need to look at the final plan of AQE and consume the data differently (either row or columnar format). I can't think of a use case and I think we can always statically know if the AQE plan should output row or columnar data. ### Why are the changes needed? code simplification. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual test Closes #33624 from cloud-fan/aqe. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8714eefe6f975e6b106b59e2ab3af53df4555dce) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 August 2021, 08:34:10 UTC
f0844f7 [SPARK-36432][BUILD] Upgrade Jetty version to 9.4.43 ### What changes were proposed in this pull request? This PR upgrades Jetty version to `9.4.43.v20210629`. ### Why are the changes needed? To address vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429 which affects Jetty `9.4.42.v20210604`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Closes #33656 from this/upgrade-jetty-9.4.43. Lead-authored-by: Sajith Ariyarathna <sajith.janaprasad@gmail.com> Co-authored-by: Sajith Ariyarathna <this@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5a22f9ceaf4fa61e4c2d3ad47e7bf5071c0d6eaa) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 August 2021, 01:14:16 UTC
552a332 [SPARK-36423][SHUFFLE] Randomize order of blocks in a push request to improve block merge ratio for push-based shuffle ### What changes were proposed in this pull request? On the client side, we are currently randomizing the order of push requests before processing each request. In addition we can further randomize the order of blocks within each push request before pushing them. In our benchmark, this has resulted in a 60%-70% reduction of blocks that fail to be merged due to bock collision (the existing block merge ratio is already pretty good in general, and this further improves it). ### Why are the changes needed? Improve block merge ratio for push-based shuffle ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Straightforward small change, no additional test needed. Closes #33649 from Victsm/SPARK-36423. Lead-authored-by: Min Shen <mshen@linkedin.com> Co-authored-by: Min Shen <victor.nju@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 6e729515fd2bb228afed964b50f0d02329684934) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 06 August 2021, 14:48:31 UTC
a5d0eaf [SPARK-595][DOCS] Add local-cluster mode option in Documentation ### What changes were proposed in this pull request? Add local-cluster mode option to submitting-applications.md ### Why are the changes needed? Help users to find/use this option for unit tests. ### Does this PR introduce _any_ user-facing change? Yes, docs changed. ### How was this patch tested? `SKIP_API=1 bundle exec jekyll build` <img width="460" alt="docchange" src="https://user-images.githubusercontent.com/87687356/127125380-6beb4601-7cf4-4876-b2c6-459454ce2a02.png"> Closes #33537 from yutoacts/SPARK-595. Lead-authored-by: Yuto Akutsu <yuto.akutsu@jp.nttdata.com> Co-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 41b011e416286374e2e8e8dea36ba79f4c403040) Signed-off-by: Thomas Graves <tgraves@apache.org> 06 August 2021, 14:27:24 UTC
586eb5d Revert "[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported" ### What changes were proposed in this pull request? This PR reverts the change in SPARK-36429 (#33654). See [conversation](https://github.com/apache/spark/pull/33654#issuecomment-894160037). ### Why are the changes needed? To recover CIs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33670 from sarutak/revert-SPARK-36429. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit e17612d0bfa1b1dc719f6f2c202e2a4ea7870ff1) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 06 August 2021, 11:56:40 UTC
33e4ce5 [SPARK-36339][SQL] References to grouping that not part of aggregation should be replaced ### What changes were proposed in this pull request? Currently, references to grouping sets are reported as errors after aggregated expressions, e.g. ``` SELECT count(name) c, name FROM VALUES ('Alice'), ('Bob') people(name) GROUP BY name GROUPING SETS(name); ``` Error in query: expression 'people.`name`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ### Why are the changes needed? Fix the map anonymous function in the constructAggregateExprs function does not use underscores to avoid ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #33574 from gaoyajun02/SPARK-36339. Lead-authored-by: gaoyajun02 <gaoyajun02@gmail.com> Co-authored-by: gaoyajun02 <gaoyajun02@meituan.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 888f8f03c89ea7ee8997171eadf64c87e17c4efe) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 August 2021, 08:35:01 UTC
f3761bd [SPARK-36429][SQL][FOLLOWUP] Update a golden file to comply with the change in SPARK-36429 ### What changes were proposed in this pull request? This PR updates a golden to comply with the change in SPARK-36429 (#33654). ### Why are the changes needed? To recover GA failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #33663 from sarutak/followup-SPARK-36429. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 63c7d1847d97dca5ceb9a46c77a623cb78565f5b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 August 2021, 07:21:29 UTC
be19270 [SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported ### What changes were proposed in this pull request? Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`. ``` -- !query select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>> -- !query output {"t":null} ``` ``` -- !query select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<> -- !query output java.lang.Exception Unsupported type: timestamp_ntz ``` We should make `from_json` throws exception too. This PR fix the discussion below https://github.com/apache/spark/pull/33640#discussion_r682862523 ### Why are the changes needed? Make the behavior of `from_json` more reasonable. ### Does this PR introduce _any_ user-facing change? 'Yes'. from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ. ### How was this patch tested? Tests updated. Closes #33654 from beliefer/SPARK-36429. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 August 2021, 06:01:13 UTC
f719e9c [SPARK-36409][SQL][TESTS] Splitting test cases from datetime.sql ### What changes were proposed in this pull request? Currently `datetime.sql` contains a lot of tests and will be run 3 times: default mode, ansi mode, ntz mode. It wastes the test time and also large test files are hard to read. This PR proposes to split it into smaller ones: 1. `date.sql`, which contains date literals, functions and operations. It will be run twice with default and ansi mode. 2. `timestamp.sql`, which contains timestamp (no ltz or ntz suffix) literals, functions and operations. It will be run 4 times: default mode + ans off, defaul mode + ansi on, ntz mode + ansi off, ntz mode + ansi on. 3. `datetime_special.sql`, which create datetime values whose year is outside of [0, 9999]. This is a separated file as JDBC doesn't support them and need to ignore this test file. It will be run 4 times as well. 4. `timestamp_ltz.sql`, which contains timestamp_ltz literals and constructors. It will be run twice with default and ntz mode, to make sure its result doesn't change with the timestamp mode. Note that, operations with ltz are tested by `timestamp.sql` 5. `timestamp_ntz.sql`, which contains timestamp_ntz literals and constructors. It will be run twice with default and ntz mode, to make sure its result doesn't change with the timestamp mode. Note that, operations with ntz are tested by `timestamp.sql` ### Why are the changes needed? reduce test run time. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33640 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 06 August 2021, 04:55:31 UTC
87291dc [SPARK-36415][SQL][DOCS] Add docs for try_cast/try_add/try_divide ### What changes were proposed in this pull request? Add documentation for new functions try_cast/try_add/try_divide ### Why are the changes needed? Better documentation. These new functions are useful when migrating to the ANSI dialect. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build docs and preview: ![image](https://user-images.githubusercontent.com/1097932/128209312-34a6cc6a-a73d-4aed-8646-22b1cb7ce702.png) Closes #33638 from gengliangwang/addDocForTry. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8a35243fa7494f3650e3277956da60823f691c8f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 August 2021, 03:33:19 UTC
bcf2169 [SPARK-36393][BUILD][FOLLOW-UP] Try to raise memory for GHA ### What changes were proposed in this pull request? As followup to raise memory for two places forgotten. ### Why are the changes needed? Raise memory for GHA. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #33658 from viirya/increasing-mem-ga-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit cd070f1b9ce92e616d9d8af535a7d643ed95d065) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 06 August 2021, 03:09:40 UTC
0bb88c9 [SPARK-36421][SQL][DOCS] Use ConfigEntry.key to fix docs and set command results ### What changes were proposed in this pull request? This PR fixes the issue that `ConfigEntry` to be introduced to the doc field directly without calling `.key`, which causes malformed documents on the web site and in the result of `SET -v` 1. https://spark.apache.org/docs/3.1.2/configuration.html#static-sql-configuration - spark.sql.hive.metastore.jars 2. set -v ![image](https://user-images.githubusercontent.com/8326978/128292412-85100f95-24fd-4b40-a14f-d31a256dab7d.png) ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no, but contains doc fix ### How was this patch tested? new tests Closes #33647 from yaooqinn/SPARK-36421. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c7fa3c9090d88115c088570627138e1171c7a0eb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 August 2021, 02:01:54 UTC
efd33b7 [SPARK-36441][INFRA] Fix GA failure related to downloading lintr dependencies ### What changes were proposed in this pull request? This PR fixes a GA failure which is related to downloading lintr dependencies. ``` * installing *source* package ‘devtools’ ... ** package ‘devtools’ successfully unpacked and MD5 sums checked ** using staged installation ** R ** inst ** byte-compile and prepare package for lazy loading ** help *** installing help indices *** copying figures ** building package indices ** installing vignettes ** testing if installed package can be loaded from temporary location ** testing if installed package can be loaded from final location ** testing if installed package keeps a record of temporary installation path * DONE (devtools) The downloaded source packages are in ‘/tmp/Rtmpv53Ix4/downloaded_packages’ Using bundled GitHub PAT. Please add your own PAT to the env var `GITHUB_PAT` Error: Failed to install 'unknown package' from GitHub: HTTP error 401. Bad credentials ``` I re-triggered the GA job but it still fail with the same error. https://github.com/apache/spark/runs/3257853825 The issue seems to happen when downloading lintr dependencies from GitHub. So, the solution is to change the way to download them. ### Why are the changes needed? To recover GA. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #33660 from sarutak/fix-r-package-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e13bd586f1854df872f32ebd3c7686fc4013e06e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 August 2021, 01:49:37 UTC
1785ead [SPARK-36414][SQL] Disable timeout for BroadcastQueryStageExec in AQE ### What changes were proposed in this pull request? This reverts SPARK-31475, as there are always more concurrent jobs running in AQE mode, especially when running multiple queries at the same time. Currently, the broadcast timeout does not record accurately for the BroadcastQueryStageExec only, but also including the time waiting for being scheduled. If all the resources are currently being occupied for materializing other stages, it timeouts without a chance to run actually.   ![image](https://user-images.githubusercontent.com/8326978/128169612-4c96c8f6-6f8e-48ed-8eaf-450f87982c3b.png)   The default value is 300s, and it's hard to adjust the timeout for AQE mode. Usually, you need an extremely large number for real-world cases. As you can see in the example, above, the timeout we used for it was 1800s, and obviously, it needed 3x more or something   ### Why are the changes needed? AQE is default now, we can make it more stable with this PR ### Does this PR introduce _any_ user-facing change? yes, broadcast timeout now is not used for AQE ### How was this patch tested? modified test Closes #33636 from yaooqinn/SPARK-36414. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0c94e47aecab0a8c346e1a004686d1496a9f2b07) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2021, 13:15:48 UTC
bf4edb5 [SPARK-36353][SQL] RemoveNoopOperators should keep output schema ### What changes were proposed in this pull request? RemoveNoopOperators should keep output schema ### Why are the changes needed? Expand function ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #33587 from AngersZhuuuu/SPARK-36355. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 02810eecbfae6cbfcd91c2a8f9a95aee93031451) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2021, 12:43:48 UTC
712c311 [SPARK-36393][BUILD] Try to raise memory for GHA ### What changes were proposed in this pull request? According to the feedback from GitHub, the change causing memory issue has been rolled back. We can try to raise memory again for GA. ### Why are the changes needed? Trying higher memory settings for GA. It could speed up the testing time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #33623 from viirya/increasing-mem-ga. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 7d13ac177ba745e42c406ee8aa1594cc3448dbc6) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 05 August 2021, 08:31:45 UTC
b4c065b [SPARK-36391][SHUFFLE] When state is remove will throw NPE, and we should improve the error message ### What changes were proposed in this pull request? When channel terminated will call `connectionTerminated` and remove corresponding StreamState, then all coming request on this StreamState will throw NPE like ``` 2021-07-31 22:00:24,810 ERROR server.ChunkFetchRequestHandler (ChunkFetchRequestHandler.java:lambda$respond$1(146)) - Error sending result ChunkFetchFailure[streamChunkId=StreamChunkId[streamId=1119950114515,chunkIndex=0],errorString=java.lang.NullPointerException at org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:80) at org.apache.spark.network.server.ChunkFetchRequestHandler.processFetchRequest(ChunkFetchRequestHandler.java:101) at org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:82) at org.apache.spark.network.server.ChunkFetchRequestHandler.channelRead0(ChunkFetchRequestHandler.java:51) at org.sparkproject.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:61) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:370) at org.sparkproject.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at org.sparkproject.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at org.sparkproject.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at org.sparkproject.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.sparkproject.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.sparkproject.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ] to /ip:53818; closing connection java.nio.channels.ClosedChannelException at org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) at org.sparkproject.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) at org.sparkproject.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702) at org.sparkproject.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:112) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702) at org.sparkproject.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) at org.sparkproject.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104) at org.sparkproject.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at org.sparkproject.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at org.sparkproject.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at org.sparkproject.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.sparkproject.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.sparkproject.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` Since JVM will not show stack of NPE exception if it happen many times. ``` 021-07-28 08:25:44,720 ERROR server.ChunkFetchRequestHandler (ChunkFetchRequestHandler.java:lambda$respond$1(146)) - Error sending result ChunkFetchFailure[streamChunkId=StreamChunkId[streamId=1187623335353,chunkIndex=11],errorString=java.lang.NullPoint erException ] to /10.130.10.5:42148; closing connection java.nio.channels.ClosedChannelException ``` Makes user confused. We should improved this error message? ### Why are the changes needed? Improve error message ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #33622 from AngersZhuuuu/SPARK-36391. Lead-authored-by: Angerszhuuuu <angers.zhu@gmaihu@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: yi.wu <yi.wu@databricks.com> (cherry picked from commit b377ea26e2f03e263f317caeabbf8ec866e890c3) Signed-off-by: yi.wu <yi.wu@databricks.com> 05 August 2021, 07:32:05 UTC
cc2a5ab [SPARK-36384][CORE][DOC] Add doc for shuffle checksum ### What changes were proposed in this pull request? Add doc for the shuffle checksum configs in `configuration.md`. ### Why are the changes needed? doc ### Does this PR introduce _any_ user-facing change? No, since Spark 3.2 hasn't been released. ### How was this patch tested? Pass existed tests. Closes #33637 from Ngone51/SPARK-36384. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3b92c721b5c08c76c3aad056d3170553d0b52f85) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 August 2021, 01:16:53 UTC
070169e [SPARK-36068][BUILD][TEST] No tests in hadoop-cloud run unless hadoop-3.2 profile is activated explicitly ### What changes were proposed in this pull request? This PR fixes an issue that no tests in `hadoop-cloud` are compiled and run unless `hadoop-3.2` profile is activated explicitly. The root cause seems similar to SPARK-36067 (#33276) so the solution is to activate `hadoop-3.2` profile in `hadoop-cloud/pom.xml` by default. This PR introduced an empty profile for `hadoop-2.7`. Without this, building with `hadoop-2.7` fails. ### Why are the changes needed? `hadoop-3.2` profile should be activated by default so tests in `hadoop-cloud` also should be compiled and run without activating `hadoop-3.2` profile explicitly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed tests in `hadoop-cloud` ran with both SBT and Maven. ``` build/sbt -Phadoop-cloud "hadoop-cloud/test" ... [info] CommitterBindingSuite: [info] - BindingParquetOutputCommitter binds to the inner committer (258 milliseconds) [info] - committer protocol can be serialized and deserialized (11 milliseconds) [info] - local filesystem instantiation (3 milliseconds) [info] - reject dynamic partitioning (1 millisecond) [info] Run completed in 1 second, 234 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. build/mvn -Phadoop-cloud -pl hadoop-cloud test ... CommitterBindingSuite: - BindingParquetOutputCommitter binds to the inner committer - committer protocol can be serialized and deserialized - local filesystem instantiation - reject dynamic partitioning Run completed in 560 milliseconds. Total number of tests run: 4 Suites: completed 2, aborted 0 Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` I also confirmed building with `-Phadoop-2.7` successfully finishes with both SBT and Maven. ``` build/sbt -Phadoop-cloud -Phadoop-2.7 "hadoop-cloud/Test/compile" build/mvn -Phadoop-cloud -Phadoop-2.7 -pl hadoop-cloud testCompile ``` Closes #33277 from sarutak/fix-hadoop-3.2-cloud. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0f5c3a4fd642dcfcdfcf1ccfba4556acd333b764) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 August 2021, 00:39:37 UTC
7d685df [SPARK-36354][CORE] EventLogFileReader should skip rolling event log directories with no logs ### What changes were proposed in this pull request? This PR aims to skip rolling event log directories which has only `appstatus` file. ### Why are the changes needed? Currently, Spark History server shows `IllegalArgumentException` warning, but the event log might arrive later. The situation also can happen when the job is killed before uploading its first log to the remote storages like S3. ``` 21/07/30 07:38:26 WARN FsHistoryProvider: Error while reading new log s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771 java.lang.IllegalArgumentException: requirement failed: Log directory must contain at least one event log file! ... at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216) ``` ### Does this PR introduce _any_ user-facing change? Yes. Users will not see `IllegalArgumentException` warnings. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #33586 from dongjoon-hyun/SPARK-36354. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 28a2a2238fbaf4fad3c98cfef2b3049c1f4616c8) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 04 August 2021, 11:26:24 UTC
0a98e51 [SPARK-32923][FOLLOW-UP] Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received ### What changes were proposed in this pull request? Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received when no blocks pushed for the corresponding shuffleMergeId. This is identified as part of https://github.com/apache/spark/pull/33034#discussion_r680610872. ### Why are the changes needed? Without this change, older shuffleMergeId files won't be cleaned up properly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added changes to existing unit test to address this case. Closes #33605 from venkata91/SPARK-32923-follow-on. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit d8169493b662acac31d3fc5e6c5051917428c974) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 04 August 2021, 08:30:56 UTC
f7ab2bf [SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io ### What changes were proposed in this pull request? This PR is followup for https://github.com/apache/spark/pull/32964, to improve the warning message. ### Why are the changes needed? To improve the warning message. ### Does this PR introduce _any_ user-facing change? The warning is changed from "Deprecated in 3.2, Use `spark.to_spark_io` instead." to "Deprecated in 3.2, Use `DataFrame.spark.to_spark_io` instead." ### How was this patch tested? Manually run `dev/lint-python` Closes #33631 from itholic/SPARK-35811-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3d72c20e64c18e6e10dc862ab19f07342fcdb2d6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 August 2021, 07:20:37 UTC
6e8187b [MINOR][DOC] Remove obsolete `contributing-to-spark.md` ### What changes were proposed in this pull request? This PR removes obsolete `contributing-to-spark.md` which is not referenced from anywhere. ### Why are the changes needed? Just clean up. ### Does this PR introduce _any_ user-facing change? No. Users can't have access to contributing-to-spark.html unless they directly point to the URL. ### How was this patch tested? Built the document and confirmed that this change doesn't affect the result. Closes #33619 from sarutak/remove-obsolete-contribution-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c31b653806f19ffce018651b6953bf47d019d7e8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 August 2021, 01:19:32 UTC
8c42232 [SPARK-36381][SQL] Add case sensitive and case insensitive compare for checking column name exist when alter table ### What changes were proposed in this pull request? Add the Resolver to `checkColumnNotExists` to check name exist in case sensitive. ### Why are the changes needed? At now the resolver is `_ == _` of `findNestedField` called by `checkColumnNotExists` Add `alter.conf.resolver` to it. [SPARK-36381](https://issues.apache.org/jira/browse/SPARK-36381) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut tests Closes #33618 from Peng-Lei/sensitive-cloumn-name. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 87d49cbcb1b9003763148dec6a3b067cf86f6ab3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 August 2021, 01:04:25 UTC
bd33408 [SPARK-36349][SQL] Disallow ANSI intervals in file-based datasources ### What changes were proposed in this pull request? In the PR, I propose to ban `YearMonthIntervalType` and `DayTimeIntervalType` at the analysis phase while creating a table using a built-in filed-based datasource or writing a dataset to such datasource. In particular, add the following case: ```scala case _: DayTimeIntervalType | _: YearMonthIntervalType => false ``` to all methods that override either: - V2 `FileTable.supportsDataType()` - V1 `FileFormat.supportDataType()` ### Why are the changes needed? To improve user experience with Spark SQL, and output a proper error message at the analysis phase. ### Does this PR introduce _any_ user-facing change? Yes but ANSI interval types haven't released yet. So, for users this is new behavior. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 "test:testOnly *HiveOrcSourceSuite" ``` Closes #33580 from MaxGekk/interval-ban-in-ds. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 67cbc932638179925ebbeb76d6d6e6f25a3cb2e2) Signed-off-by: Max Gekk <max.gekk@gmail.com> 03 August 2021, 17:30:33 UTC
144040a [SPARK-36383][CORE][3.2] Avoid NullPointerException during executor shutdown ### What changes were proposed in this pull request? Fix `NullPointerException` in `Executor.stop()`. ### Why are the changes needed? Some initialization steps could fail before the initialization of `metricsPoller`, `heartbeater`, `threadPool`, which results in the null of `metricsPoller`, `heartbeater`, `threadPool`. For example, I encountered a failure of: https://github.com/apache/spark/blob/c20af535803a7250fef047c2bf0fe30be242369d/core/src/main/scala/org/apache/spark/executor/Executor.scala#L137 where the executor itself failed to register at the driver. This PR helps to eliminate the error messages when the issue happens to not confuse users: <details> <summary><mark><font color=darkred>[click to see the detailed error message]</font></mark></summary> <pre> 21/07/23 16:04:10 WARN Executor: Unable to stop executor metrics poller java.lang.NullPointerException at org.apache.spark.executor.Executor.stop(Executor.scala:318) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/07/23 16:04:10 WARN Executor: Unable to stop heartbeater java.lang.NullPointerException at org.apache.spark.executor.Executor.stop(Executor.scala:324) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/07/23 16:04:10 ERROR Utils: Uncaught exception in thread shutdown-hook-0 java.lang.NullPointerException at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:334) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:231) at org.apache.spark.executor.Executor.stop(Executor.scala:334) at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) </pre> </details> ### Does this PR introduce _any_ user-facing change? Yes, users won't see error messages of `NullPointerException` after this fix. ### How was this patch tested? Pass existing tests. Closes #33620 from Ngone51/spark-36383-3.2. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 03 August 2021, 15:50:11 UTC
cc75618 [SPARK-35815][SQL][FOLLOWUP] Add test considering the case spark.sql.legacy.interval.enabled is true ### What changes were proposed in this pull request? This PR adds test considering the case `spark.sql.legacy.interval.enabled` is `true` for SPARK-35815. ### Why are the changes needed? SPARK-35815 (#33456) changes `Dataset.withWatermark` to accept ANSI interval literals as `delayThreshold` but I noticed the change didn't work with `spark.sql.legacy.interval.enabled=true`. We can't detect this issue because there is no test which considers the legacy interval type at that time. In SPARK-36323 (#33551), this issue was resolved but it's better to add test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33606 from sarutak/test-watermark-with-legacy-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 92cdb17d1a005d4f22647c1f6ec0b0761ac8b7cb) Signed-off-by: Max Gekk <max.gekk@gmail.com> 03 August 2021, 10:48:53 UTC
8d817dc [SPARK-36315][SQL] Only skip AQEShuffleReadRule in the final stage if it breaks the distribution requirement ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30494 This PR proposes a new way to optimize the final query stage in AQE. We first collect the effective user-specified repartition (semantic-wise, user-specified repartition is only effective if it's the root node or under a few simple nodes), and get the required distribution for the final plan. When we optimize the final query stage, we skip certain `AQEShuffleReadRule` if it breaks the required distribution. ### Why are the changes needed? The current solution for optimizing the final query stage is pretty hacky and overkill. As an example, the newly added rule `OptimizeSkewInRebalancePartitions` can hardly apply as it's very common that the query plan has shuffles with origin `ENSURE_REQUIREMENTS`, which is not supported by `OptimizeSkewInRebalancePartitions`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes #33541 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dd80457ffb1c129a1ca3c53bcf3ea5feed7ebc57) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 August 2021, 10:29:07 UTC
7c58684 [SPARK-36380][SQL] Simplify the logical plan names for ALTER TABLE ... COLUMN ### What changes were proposed in this pull request? This a followup of the recent work such as https://github.com/apache/spark/pull/33200 For `ALTER TABLE` commands, the logical plans do not have the common `AlterTable` prefix in the name and just use names like `SetTableLocation`. This PR proposes to follow the same naming rule in `ALTER TABE ... COLUMN` commands. This PR also moves these AlterTable commands to a individual file and give them a base trait. ### Why are the changes needed? name simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #33609 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 7cb9c1c2415a0984515e4d4733f816673e4ae3c8) Signed-off-by: Max Gekk <max.gekk@gmail.com> 03 August 2021, 07:43:15 UTC
c22a25b [SPARK-36192][PYTHON] Better error messages for DataTypeOps against lists ### What changes were proposed in this pull request? Better error messages for DataTypeOps against lists. ### Why are the changes needed? Currently, DataTypeOps against lists throw a Py4JJavaError, we shall throw a TypeError with proper messages instead. ### Does this PR introduce _any_ user-facing change? Yes. A TypeError message will be showed rather than a Py4JJavaError. From: ```py >>> import pyspark.pandas as ps >>> ps.Series([1, 2, 3]) > [3, 2, 1] Traceback (most recent call last): ... py4j.protocol.Py4JJavaError: An error occurred while calling o107.gt. : java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [3, 2, 1] ... ``` To: ```py >>> import pyspark.pandas as ps >>> ps.Series([1, 2, 3]) > [3, 2, 1] Traceback (most recent call last): ... TypeError: The operation can not be applied to list. ``` ### How was this patch tested? Unit tests. Closes #33581 from xinrong-databricks/data_type_ops_list. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8ca11fe39f6828bb08f123d05c2a4b44da5231b7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2021, 07:25:59 UTC
369781a [SPARK-36389][CORE][SHUFFLE] Revert the change that accepts negative mapId in ShuffleBlockId ### What changes were proposed in this pull request? With SPARK-32922, we added a change that ShuffleBlockId can have a negative mapId. This was to support push-based shuffle where -1 as mapId indicated a push-merged block. However with SPARK-32923, a different type of BlockId was introduced - ShuffleMergedId, but reverting the change to ShuffleBlockId was missed. ### Why are the changes needed? This reverts the changes to `ShuffleBlockId` which will never have a negative mapId. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified the unit test to verify the newly added ShuffleMergedBlockId. Closes #33616 from otterc/SPARK-36389. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2712343a276a11b46f0771fe6a6d26ee1834a34f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 03 August 2021, 07:18:06 UTC
71e4c56 [SPARK-36367][3.2][PYTHON] Partially backport to avoid unexpected error with pandas 1.3 ### What changes were proposed in this pull request? Partially backport from #33598 to avoid unexpected error caused by pandas 1.3. ### Why are the changes needed? If uses tries to use pandas 1.3 as the underlying pandas, it will raise unexpected errors caused by removed APIs or behavior change. Note that pandas API on Spark 3.2 will still follow the pandas 1.2 behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33614 from ueshin/issues/SPARK-36367/3.2/partially_backport. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2021, 05:03:35 UTC
de837a0 [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines ### What changes were proposed in this pull request? Adds ANSI/ISO SQLSTATE standards to the error guidelines. ### Why are the changes needed? Provides visibility and consistency to the SQLSTATEs assigned to error classes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not needed; docs only Closes #33560 from karenfeng/sqlstate-manual. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 63517eb430a55176bdf5ad9192d72e80f25e61e8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2021, 04:58:07 UTC
9eec11b [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode. Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on. Now, we fail explicitly if `null` is passed when the input array contains `null`. Note that this is consistent with non-array JSON input: **Permissive mode:** ```scala spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([str], [null]) ``` **Failfast mode**: ```scala spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` To make the permissive mode to proceed and parse without throwing an exception. **Permissive mode:** ```scala spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` NOTE that this behaviour is consistent when JSON object is malformed: ```scala spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` Since we're parsing _one_ JSON array, related records all fail together. **Failfast mode:** ```scala spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` Manually tested, and unit test was added. Closes #33608 from HyukjinKwon/SPARK-36379. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0bbcbc65080cd67a9997f49906d9d48fdf21db10) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 August 2021, 17:02:09 UTC
ea559ad [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name *** FAILED *** (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33576 from AngersZhuuuu/SPARK-36086. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f3173956cbd64c056424b743aff8d17dd7c61fd7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 August 2021, 16:08:30 UTC
e26cb96 [SPARK-36224][SQL] Use Void as the type name of NullType ### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2f700773c2e8fac26661d0aa8024253556a921ba) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 August 2021, 15:20:11 UTC
df43300 [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown. After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis. Please check out https://github.com/apache/spark/pull/32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes #33451 from Ngone51/SPARK-36206. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit a98d919da470abaf2e99060f99007a5373032fe1) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 02 August 2021, 14:59:30 UTC
back to top