https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
78a5825 Preparing Spark release v3.2.2-rc1 11 July 2022, 15:13:02 UTC
ba978b3 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases Add missed dependencies to `dev/create-release/spark-rm/Dockerfile`. To be able to build Spark releases. No. By building the Spark 3.3 release via: ``` $ dev/create-release/do-release-docker.sh -d /home/ubuntu/max/spark-3.3-rc1 ``` Closes #36449 from MaxGekk/deps-Dockerfile. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 4b1c2fb7a27757ebf470416c8ec02bb5c1f7fa49) Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 6a61f95a359e6aa9d09f8044019074dc7effcf30) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 July 2022, 00:36:26 UTC
001d8b0 [SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image dependencies ### What changes were proposed in this pull request? This PR proposes to add plotly, pyarrow and pandas dependencies for generating the API documentation for pandas API on Spark. The versions of `pandas==1.1.5 pyarrow==3.0.0 plotly==5.4.0` are matched with the current versions being used in branch-3.2 at Python 3.6. ### Why are the changes needed? Currently, the function references for pandas API on Spark are all missing: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/series.html due to missing dependencies when building the docs. ### Does this PR introduce _any_ user-facing change? Yes, the broken links of documentation at https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/series.html will all be recovered. ### How was this patch tested? To be honest, it has not been tested. I don't have the nerve to run Docker releasing script for the sake of testing so I defer to the next release manager. The combinations of the dependency versions are being tested in GitHub Actions at `branch-3.2`. Closes #34813 from HyukjinKwon/SPARK-37554. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 03750c046b55f60b43646c8108e5f2e540782755) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 July 2022, 00:27:52 UTC
9dd4c07 [SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle check ### What changes were proposed in this pull request? This is a follow-up of a backporting commit, https://github.com/apache/spark/commit/bc54a3f0c2e08893702c3929bfe7a9d543a08cdb . ### Why are the changes needed? The original commit doesn't pass the linter check because there was no lint check between SPARK-37380 and SPARK-37834. The content of this PR is a part of SPARK-37834. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Python Linter. Closes #37146 from dongjoon-hyun/SPARK-37730. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 July 2022, 00:24:11 UTC
bc54a3f [SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot._append_legend_handles_labels ### What changes were proposed in this pull request? Replace use of MPLPlot._add_legend_handle (removed in pandas) with MPLPlot._append_legend_handles_labels in histogram and KDE plots. Based on: https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40 ### Why are the changes needed? Fix of SPARK-37730. plot.hist and plot.kde don't throw AttributeError for pandas=1.3.5. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ~~Tested with existing plot test on CI (for older pandas only).~~ (it seems that CI doesn't run matplotlib tests, see https://github.com/apache/spark/pull/35000#issuecomment-1001267197) I've run tests on a local computer, see https://github.com/apache/spark/pull/35000#issuecomment-1001494019 : ``` $ python python/pyspark/pandas/tests/plot/test_series_plot_matplotlib.py ``` :question: **QUESTION:** Maybe add plot testing for pandas 1.3.5 on CI? (I've noticed that CI uses `pandas=1.3.4`, maybe update it to `1.3.5`?) Closes #35000 from mslapek/fixpythonplot. Authored-by: Michał Słapek <28485371+mslapek@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 371e307686debc4f7b44a37d2345a1a512f3fdcc) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 July 2022, 21:39:38 UTC
c5983c1 [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/35314 to branch 3.2. See that original PR for context. ### Why are the changes needed? To fix potential correctness issue. ### Does this PR introduce _any_ user-facing change? No but fix the exiting correctness issue when reading partition column with CalendarInterval type. ### How was this patch tested? Added unit test in `ColumnVectorSuite.scala`. Closes #37114 from c21/branch-3.2. Authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 July 2022, 14:51:40 UTC
32aff86 [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast This is a backport of https://github.com/apache/spark/pull/36974 for branch-3.2 ### What changes were proposed in this pull request? Change `currentPhysicalPlan` to `inputPlan ` when we restore the broadcast exchange for DPP. ### Why are the changes needed? The currentPhysicalPlan can be wrapped with broadcast query stage so it is not safe to match it. For example: The broadcast exchange which is added by DPP is running before than the normal broadcast exchange(e.g. introduced by join). ### Does this PR introduce _any_ user-facing change? yes bug fix ### How was this patch tested? add test Closes #37087 from ulysses-you/inputplan-3.2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 July 2022, 14:49:03 UTC
be891ad [SPARK-39551][SQL][3.2] Add AQE invalid plan check ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/36953 This PR adds a check for invalid plans in AQE replanning process. The check will throw exceptions when it detects an invalid plan, causing AQE to void the current replanning result and keep using the latest valid plan. ### Why are the changes needed? AQE logical optimization rules can lead to invalid physical plans and cause runtime exceptions as certain physical plan nodes are not compatible with others. E.g., `BroadcastExchangeExec` can only work as a direct child of broadcast join nodes, but it could appear under other incompatible physical plan nodes because of empty relation propagation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #37108 from dongjoon-hyun/SPARK-39551. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 July 2022, 04:35:04 UTC
1c0bd4c [SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec backport https://github.com/apache/spark/pull/37049 for branch-3.2 <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. 8. If you want to add or modify an error type or message, please read the guideline first in 'core/src/main/resources/error/README.md'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> DescribeNamespaceExec change ns.last to ns.quoted ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> DescribeNamespaceExec should show the whole namespace rather than last ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> yes, a small bug fix ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> fix test Closes #37072 from ulysses-you/desc-namespace-3.2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> 05 July 2022, 19:40:26 UTC
3d084fe [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like functions ### What changes were proposed in this pull request? In the PR, I propose to fix args formatting of some regexp functions by adding explicit new lines. That fixes the following items in arg lists. Before: <img width="745" alt="Screenshot 2022-07-05 at 09 48 28" src="https://user-images.githubusercontent.com/1580697/177274234-04209d43-a542-4c71-b5ca-6f3239208015.png"> After: <img width="704" alt="Screenshot 2022-07-05 at 11 06 13" src="https://user-images.githubusercontent.com/1580697/177280718-cb05184c-8559-4461-b94d-dfaaafda7dd2.png"> ### Why are the changes needed? To improve readability of Spark SQL docs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By building docs and checking manually: ``` $ SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 bundle exec jekyll build ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 4e42f8b12e8dc57a15998f22d508a19cf3c856aa) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #37093 from MaxGekk/fix-regexp-docs-3.2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 July 2022, 15:28:35 UTC
6ae97e2 [SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__ ### What changes were proposed in this pull request? This PR fix the wrong aliases in `__array_ufunc__` ### Why are the changes needed? When running test with numpy 1.23.0 (current latest), hit a bug: `NotImplementedError: pandas-on-Spark objects currently do not support <ufunc 'divide'>.` In `__array_ufunc__` we first call `maybe_dispatch_ufunc_to_dunder_op` to try dunder methods first, and then we try pyspark API. `maybe_dispatch_ufunc_to_dunder_op` is from pandas code. pandas fix a bug https://github.com/pandas-dev/pandas/pull/44822#issuecomment-991166419 https://github.com/pandas-dev/pandas/pull/44822/commits/206b2496bc6f6aa025cb26cb42f52abeec227741 when upgrade to numpy 1.23.0, we need to also sync this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Current CI passed - The exsiting UT `test_series_datetime` already cover this, I also test it in my local env with 1.23.0 ```shell pip install "numpy==1.23.0" python/run-tests --testnames 'pyspark.pandas.tests.test_series_datetime SeriesDateTimeTest.test_arithmetic_op_exceptions' ``` Closes #37078 from Yikun/SPARK-39611. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fb48a14a67940b9270390b8ce74c19ae58e2880e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 July 2022, 11:52:56 UTC
9adfc3a [SPARK-39650][SS] Fix incorrect value schema in streaming deduplication with backward compatibility ### What changes were proposed in this pull request? This PR proposes to fix the incorrect value schema in streaming deduplication. It stores the empty row having a single column with null (using NullType), but the value schema is specified as all columns, which leads incorrect behavior from state store schema compatibility checker. This PR proposes to set the schema of value as `StructType(Array(StructField("__dummy__", NullType)))` to fit with the empty row. With this change, the streaming queries creating the checkpoint after this fix would work smoothly. To not break the existing streaming queries having incorrect value schema, this PR proposes to disable the check for value schema on streaming deduplication. Disabling the value check was there for the format validation (we have two different checkers for state store), but it has been missing for state store schema compatibility check. To avoid adding more config, this PR leverages the existing config "format validation" is using. ### Why are the changes needed? This is a bug fix. Suppose the streaming query below: ``` # df has the columns `a`, `b`, `c` val df = spark.readStream.format("...").load() val query = df.dropDuplicate("a").writeStream.format("...").start() ``` while the query is running, df can produce a different set of columns (e.g. `a`, `b`, `c`, `d`) from the same source due to schema evolution. Since we only deduplicate the rows with column `a`, the change of schema should not matter for streaming deduplication, but state store schema checker throws error saying "value schema is not compatible" before this fix. ### Does this PR introduce _any_ user-facing change? No, this is basically a bug fix which end users wouldn't notice unless they encountered a bug. ### How was this patch tested? New tests. Closes #37041 from HeartSaVioR/SPARK-39650. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit fe536033bdd00d921b3c86af329246ca55a4f46a) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 02 July 2022, 14:07:54 UTC
1387af7 [SPARK-39553][CORE] Multi-thread unregister shuffle shouldn't throw NPE when using Scala 2.13 This pr add a `shuffleStatus != null` condition to `o.a.s.MapOutputTrackerMaster#unregisterShuffle` method to avoid throwing NPE when using Scala 2.13. Ensure that no NPE is thrown when `o.a.s.MapOutputTrackerMaster#unregisterShuffle` is called by multiple threads, this pr is only for Scala 2.13. `o.a.s.MapOutputTrackerMaster#unregisterShuffle` method will be called concurrently by the following two paths: - BlockManagerStorageEndpoint: https://github.com/apache/spark/blob/6f1046afa40096f477b29beecca5ca6286dfa7f3/core/src/main/scala/org/apache/spark/storage/BlockManagerStorageEndpoint.scala#L56-L62 - ContextCleaner: https://github.com/apache/spark/blob/6f1046afa40096f477b29beecca5ca6286dfa7f3/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L234-L241 When test with Scala 2.13, for example `sql/core` module, there are many log as follows,although these did not cause UTs failure: ``` 17:44:09.957 WARN org.apache.spark.storage.BlockManagerMaster: Failed to remove shuffle 87 - null java.lang.NullPointerException at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1(MapOutputTracker.scala:882) at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1$adapted(MapOutputTracker.scala:881) at scala.Option.foreach(Option.scala:437) at org.apache.spark.MapOutputTrackerMaster.unregisterShuffle(MapOutputTracker.scala:881) at org.apache.spark.storage.BlockManagerStorageEndpoint$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$3(BlockManagerStorageEndpoint.scala:59) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.scala:17) at org.apache.spark.storage.BlockManagerStorageEndpoint.$anonfun$doAsync$1(BlockManagerStorageEndpoint.scala:89) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:678) at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 17:44:09.958 ERROR org.apache.spark.ContextCleaner: Error cleaning shuffle 94 java.lang.NullPointerException at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1(MapOutputTracker.scala:882) at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1$adapted(MapOutputTracker.scala:881) at scala.Option.foreach(Option.scala:437) at org.apache.spark.MapOutputTrackerMaster.unregisterShuffle(MapOutputTracker.scala:881) at org.apache.spark.ContextCleaner.doCleanupShuffle(ContextCleaner.scala:241) at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$3(ContextCleaner.scala:202) at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$3$adapted(ContextCleaner.scala:195) at scala.Option.foreach(Option.scala:437) at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$1(ContextCleaner.scala:195) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1432) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:189) at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:79) ``` I think this is a bug of Scala 2.13.8 and already submit an issue to https://github.com/scala/bug/issues/12613, this PR is only for protection, we should remove this protection after Scala 2.13(maybe https://github.com/scala/scala/pull/9957) fixes this issue. No - Pass GA - Add new test `SPARK-39553: Multi-thread unregister shuffle shouldn't throw NPE` to `MapOutputTrackerSuite`, we can test manually as follows: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl core -am -Pscala-2.13 mvn clean test -pl core -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.MapOutputTrackerSuite ``` **Before** ``` - SPARK-39553: Multi-thread unregister shuffle shouldn't throw NPE *** FAILED *** 3 did not equal 0 (MapOutputTrackerSuite.scala:971) Run completed in 17 seconds, 505 milliseconds. Total number of tests run: 25 Suites: completed 2, aborted 0 Tests: succeeded 24, failed 1, canceled 0, ignored 1, pending 0 *** 1 TEST FAILED *** ``` **After** ``` - SPARK-39553: Multi-thread unregister shuffle shouldn't throw NPE Run completed in 17 seconds, 996 milliseconds. Total number of tests run: 25 Suites: completed 2, aborted 0 Tests: succeeded 25, failed 0, canceled 0, ignored 1, pending 0 All tests passed. ``` Closes #37024 from LuciferYang/SPARK-39553. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 29258964cae45cea43617ade971fb4ea9fe2902a) Signed-off-by: Sean Owen <srowen@gmail.com> 29 June 2022, 23:11:38 UTC
c0d9109 [SPARK-39570][SQL] Inline table should allow expressions with alias ### What changes were proposed in this pull request? `ResolveInlineTables` requires the column expressions to be foldable, however, `Alias` is not foldable. Inline-table does not use the names in the column expressions, and we should trim aliases before checking foldable. We did something similar in `ResolvePivot`. ### Why are the changes needed? To make inline-table handle more cases, and also fixed a regression caused by https://github.com/apache/spark/pull/31844 . After https://github.com/apache/spark/pull/31844 , we always add an alias for function literals like `current_timestamp`, which breaks inline table. ### Does this PR introduce _any_ user-facing change? yea, some failed queries can be run after this PR. ### How was this patch tested? new tests Closes #36967 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1df992f03fd935ac215424576530ab57d1ab939b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 June 2022, 14:41:40 UTC
359bc3f [SPARK-39621][PYTHON][TESTS] Make `run-tests.py` robust by avoiding `rmtree` on MacOS This PR aims to make `run-tests.py` robust by avoiding `rmtree` on MacOS. There exists a race condition in Python and it causes flakiness in MacOS - https://bugs.python.org/issue29699 - https://github.com/python/cpython/issues/73885 No. After passing CIs, this should be tested on MacOS. Closes #37010 from dongjoon-hyun/SPARK-39621. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 432945db743965f1beb59e3a001605335ca2df83) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 June 2022, 19:53:45 UTC
61dc08d [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr… …oDeserializer ### What changes were proposed in this pull request? Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer. ### Why are the changes needed? - HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes. - The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add ut in AvroCatalystDataConversionSuite. Closes #36973 from wzx140/avro-fix. Authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 558b395880673ec45bf9514c98983e50e21d9398) Signed-off-by: Sean Owen <srowen@gmail.com> 27 June 2022, 02:05:27 UTC
f5bc48b [SPARK-39548][SQL] CreateView Command with a window clause query hit a wrong window definition not found issue 1. In the inline CTE code path, fix a bug that top down style unresolved window expression check leads to mis-clarification of a defined window expression. 2. Move unresolved window expression check in project to `CheckAnalysis`. This bug fails a correct query. No. UT Closes #36947 from amaliujia/improvewindow. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4718d59c6c4e201bf940303a4311dfb753372395) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 June 2022, 05:32:59 UTC
725ce33 [SPARK-39543] The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1 The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1, to support something such as compressed formats example: `spark.range(0, 100).writeTo("t1").option("compression", "zstd").using("parquet").create` **before** gen: part-00000-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.**snappy**.parquet ... **after** gen: part-00000-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.**zstd**.parquet ... No new test Closes #36941 from Yikf/writeV2option. Authored-by: Yikf <yikaifei1@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e5b7fb85b2d91f2e84dc60888c94e15b53751078) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 June 2022, 05:07:28 UTC
22dae38 [SPARK-38614][SQL] Don't push down limit through window that's using percent_rank Change `LimitPushDownThroughWindow` so that it no longer supports pushing down a limit through a window using percent_rank. Given a query with a limit of _n_ rows, and a window whose child produces _m_ rows, percent_rank will label the _nth_ row as 100% rather than the _mth_ row. This behavior conflicts with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268. Assume this data: ``` create table t1 stored as parquet as select * from range(101); ``` And also assume this query: ``` select id, percent_rank() over (order by id) as pr from t1 limit 3; ``` With Spark 3.2.1, 3.3.0, and master, the limit is applied before the percent_rank: ``` 0 0.0 1 0.5 2 1.0 ``` With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied _after_ percent_rank: Spark 3.1.3: ``` 0 0.0 1 0.01 2 0.02 ``` Hive 2.3.9: ``` 0: jdbc:hive2://localhost:10000> select id, percent_rank() over (order by id) as pr from t1 limit 3; . . . . . . . . . . . . . . . .> . . . . . . . . . . . . . . . .> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. +-----+-------+ | id | pr | +-----+-------+ | 0 | 0.0 | | 1 | 0.01 | | 2 | 0.02 | +-----+-------+ 3 rows selected (4.621 seconds) 0: jdbc:hive2://localhost:10000> ``` Prestodb 0.268: ``` id | pr ----+------ 0 | 0.0 1 | 0.01 2 | 0.02 (3 rows) ``` With this PR, Spark will apply the limit after percent_rank. No (besides changing percent_rank's behavior to be more like Spark 3.1.3, Hive, and Prestodb). New unit tests. Closes #36951 from bersprockets/percent_rank_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit a63ce5676e79f15903e9fd533a26a6c3ec9bf7a8) Signed-off-by: Yuming Wang <yumwang@ebay.com> 23 June 2022, 00:46:53 UTC
497d17f [SPARK-39496][SQL] Handle null struct in `Inline.eval` Change `Inline.eval` to return a row of null values rather than a null row in the case of a null input struct. Consider the following query: ``` set spark.sql.codegen.wholeStage=false; select inline(array(named_struct('a', 1, 'b', 2), null)); ``` This query fails with a `NullPointerException`: ``` 22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122) ``` (In Spark 3.1.3, you don't need to set `spark.sql.codegen.wholeStage` to false to reproduce the error, since Spark 3.1.3 has no codegen path for `Inline`). This query fails regardless of the setting of `spark.sql.codegen.wholeStage`: ``` val dfWide = (Seq((1)) .toDF("col0") .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*)) val df = (dfWide .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as struct_array")) df.selectExpr("*", "inline(struct_array)").collect ``` It fails with ``` 22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown Source) ``` When `Inline.eval` returns a null row in the collection, GenerateExec gets a NullPointerException either when joining the null row with required child output, or projecting the null row. This PR avoids producing the null row and produces a row of null values instead: ``` spark-sql> set spark.sql.codegen.wholeStage=false; spark.sql.codegen.wholeStage false Time taken: 3.095 seconds, Fetched 1 row(s) spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null)); 1 2 NULL NULL Time taken: 1.214 seconds, Fetched 2 row(s) spark-sql> ``` No. New unit test. Closes #36903 from bersprockets/inline_eval_null_struct_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c4d5390dd032d17a40ad50e38f0ed7bd9bbd4698) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 June 2022, 00:27:07 UTC
07edae9 [SPARK-39505][UI] Escape log content rendered in UI ### What changes were proposed in this pull request? Escape log content rendered to the UI. ### Why are the changes needed? Log content may contain reserved characters or other code in the log and be misinterpreted in the UI as HTML. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #36902 from srowen/LogViewEscape. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 June 2022, 18:32:14 UTC
380177d [SPARK-39476][SQL] Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float Cast from Integer to Float or from Long to Double/Float may loss precision if the length of Integer/Long beyonds the **significant digits** of a Double(which is 15 or 16 digits) or Float(which is 7 or 8 digits). For example, ```select *, cast(a as int) from (select cast(33554435 as foat) a )``` gives `33554436` instead of `33554435`. When it comes the optimization rule `UnwrapCastInBinaryComparison`, it may result in incorrect (confused) result . We can reproduce it with following script. ``` spark.range(10).map(i => 64707595868612313L).createOrReplaceTempView("tbl") val df = sql("select * from tbl where cast(value as double) = cast('64707595868612313' as double)") df.explain(true) df.show() ``` With we disable this optimization rule , it returns 10 records. But if we enable this optimization rule, it returns empty, since the sql is optimized to ``` select * from tbl where value = 64707595868612312L ``` Fix the behavior that may confuse users (or maybe a bug?) No Add a new UT Closes #36873 from WangGuangxin/SPARK-24994-followup. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9612db3fc9c38204b2bf9f724dedb9ec5f636556) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 June 2022, 01:33:18 UTC
f23a544 [SPARK-39061][SQL] Set nullable correctly for `Inline` output attributes Change `Inline#elementSchema` to make each struct field nullable when the containing array has a null element. This query returns incorrect results (the last row should be `NULL NULL`): ``` spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null)); 1 2 -1 -1 Time taken: 4.053 seconds, Fetched 2 row(s) spark-sql> ``` And this query gets a NullPointerException: ``` spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null)); 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NullPointerException: null at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(Buffere ``` When an array of structs is created by `CreateArray`, and no struct field contains a literal null value, the schema for the struct will have non-nullable fields, even if the array itself has a null entry (as in the example above). As a result, the output attributes for the generator will be non-nullable. When the output attributes for `Inline` are non-nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` generates incorrect code for null structs. In more detail, the issue is this: `GenerateExec#codeGenCollection` generates code that will check if the struct instance (i.e., array element) is null and, if so, set a boolean for each struct field to indicate that the field contains a null. However, unless the generator's output attributes are nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` will not generate any code to check those booleans. Instead it will generate code to write out whatever is in the variables that normally hold the struct values (which will be garbage if the array element is null). Arrays of structs from file sources do not have this issue. In that case, each `StructField` will have nullable=true due to [this](https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L417). (Note: the eval path for `Inline` has a different bug with null array elements that occurs even when `nullable` is set correctly in the schema, but I will address that in a separate PR). No. New unit test. Closes #36883 from bersprockets/inline_struct_nullability_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fc385dafabe3c609b38b81deaaf36e5eb6ee341b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 June 2022, 00:39:42 UTC
333df85 [SPARK-39355][SQL] Single column uses quoted to construct UnresolvedAttribute Use `UnresolvedAttribute.quoted` in `Alias.toAttribute` to avoid calling `UnresolvedAttribute.apply` causing `ParseException`. ```sql SELECT * FROM ( SELECT '2022-06-01' AS c1 ) a WHERE c1 IN ( SELECT date_add('2022-06-01', 0) ); ``` ``` Error in query: mismatched input '(' expecting {<EOF>, '.', '-'}(line 1, pos 8) == SQL == date_add(2022-06-01, 0) --------^^^ ``` No add UT Closes #36740 from cxzl25/SPARK-39355. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 15 June 2022, 08:04:51 UTC
861df43 [SPARK-39419][SQL][3.2] Fix ArraySort to throw an exception when the comparator returns null ### What changes were proposed in this pull request? Backport of #36812. Fixes `ArraySort` to throw an exception when the comparator returns `null`. Also updates the doc to follow the corrected behavior. ### Why are the changes needed? When the comparator of `ArraySort` returns `null`, currently it handles it as `0` (equal). According to the doc, ``` It returns -1, 0, or 1 as the first element is less than, equal to, or greater than the second element. If the comparator function returns other values (including null), the function will fail and raise an error. ``` It's fine to return non -1, 0, 1 integers to follow the Java convention (still need to update the doc, though), but it should throw an exception for `null` result. ### Does this PR introduce _any_ user-facing change? Yes, if a user uses a comparator that returns `null`, it will throw an error after this PR. The legacy flag `spark.sql.legacy.allowNullComparisonResultInArraySort` can be used to restore the legacy behavior that handles `null` as `0` (equal). ### How was this patch tested? Added some tests. Closes #36835 from ueshin/issues/SPARK-39419/3.2/array_sort. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 June 2022, 23:50:20 UTC
7c0b9e6 [SPARK-38918][SQL][3.2] Nested column pruning should filter out attributes that do not belong to the current relation ### What changes were proposed in this pull request? Backport #36216 to branch-3.2 ### Why are the changes needed? To fix a bug in `SchemaPruning`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #36386 from allisonwang-db/spark-38918-branch-3.2. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 June 2022, 22:49:24 UTC
af92121 [SPARK-39437][3.2][SQL][TEST] Normalize plan id separately in PlanStabilitySuite backport https://github.com/apache/spark/pull/36827 ### What changes were proposed in this pull request? In `PlanStabilitySuite`, we normalize expression IDs by matching `#\d+` in the explain string. However, this regex can match plan id in `Exchange` node as well, which will mess up the normalization if expression IDs and plan IDs overlap. This PR normalizes plan id separately in `PlanStabilitySuite`. ### Why are the changes needed? Make the plan golden file more stable. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #36828 from cloud-fan/test2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 June 2022, 15:19:15 UTC
7fd2e96 [SPARK-39422][SQL] Improve error message for 'SHOW CREATE TABLE' with unsupported serdes ### What changes were proposed in this pull request? This PR improves the error message that is thrown when trying to run `SHOW CREATE TABLE` on a Hive table with an unsupported serde. Currently this results in an error like ``` org.apache.spark.sql.AnalysisException: Failed to execute SHOW CREATE TABLE against table rcFileTable, which is created by Hive and uses the following unsupported serde configuration SERDE: org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe INPUTFORMAT: org.apache.hadoop.hive.ql.io.RCFileInputFormat OUTPUTFORMAT: org.apache.hadoop.hive.ql.io.RCFileOutputFormat ``` This patch improves this error message by adding a suggestion to use `SHOW CREATE TABLE ... AS SERDE`: ``` org.apache.spark.sql.AnalysisException: Failed to execute SHOW CREATE TABLE against table rcFileTable, which is created by Hive and uses the following unsupported serde configuration SERDE: org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe INPUTFORMAT: org.apache.hadoop.hive.ql.io.RCFileInputFormat OUTPUTFORMAT: org.apache.hadoop.hive.ql.io.RCFileOutputFormat Please use `SHOW CREATE TABLE rcFileTable AS SERDE` to show Hive DDL instead. ``` The suggestion's wording is consistent with other error messages thrown by SHOW CREATE TABLE. ### Why are the changes needed? The existing error message is confusing. ### Does this PR introduce _any_ user-facing change? Yes, it improves a user-facing error message. ### How was this patch tested? Manually tested with ``` CREATE TABLE rcFileTable(i INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' SHOW CREATE TABLE rcFileTable ``` to trigger the error. Confirmed that the `AS SERDE` suggestion actually works. Closes #36814 from JoshRosen/suggest-show-create-table-as-serde-in-error-message. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit 8765eea1c08bc58a0cfc22b7cfbc0b5645cc81f9) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 09 June 2022, 19:41:00 UTC
a74a3e3 [SPARK-39421][PYTHON][DOCS] Pin the docutils version <0.18 in documentation build This PR fixes the Sphinx build failure below (see https://github.com/singhpk234/spark/runs/6799026458?check_suite_focus=true): ``` Moving to python/docs directory and building sphinx. Running Sphinx v3.0.4 WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched. /__w/spark/spark/python/pyspark/pandas/supported_api_gen.py:101: UserWarning: Warning: Latest version of pandas(>=1.4.0) is required to generate the documentation; however, your version was 1.3.5 warnings.warn( Warning, treated as error: node class 'meta' is already registered, its visitors will be overridden make: *** [Makefile:35: html] Error 2 ------------------------------------------------ Jekyll 4.2.1 Please append `--trace` to the `build` command for any additional information or backtrace. ------------------------------------------------ ``` Sphinx build fails apparently with the latest docutils (see also https://issues.apache.org/jira/browse/FLINK-24662). we should pin the version. To recover the CI. No, dev-only. CI in this PR should test it out. Closes #36813 from HyukjinKwon/SPARK-39421. Lead-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c196ff4dfa1d9f1a8e20b884ee5b4a4e6e65a6e3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 June 2022, 05:27:39 UTC
d42f53b [SPARK-39393][SQL] Parquet data source only supports push-down predicate filters for non-repeated primitive types ### What changes were proposed in this pull request? In Spark version 3.1.0 and newer, Spark creates extra filter predicate conditions for repeated parquet columns. These fields do not have the ability to have a filter predicate, according to the [PARQUET-34](https://issues.apache.org/jira/browse/PARQUET-34) issue in the parquet library. This PR solves this problem until the appropriate functionality is provided by the parquet. Before this PR: Assume follow Protocol buffer schema: ``` message Model { string name = 1; repeated string keywords = 2; } ``` Suppose a parquet file is created from a set of records in the above format with the help of the parquet-protobuf library. Using Spark version 3.1.0 or newer, we get following exception when run the following query using spark-shell: ``` val data = spark.read.parquet("/path/to/parquet") data.registerTempTable("models") spark.sql("select * from models where array_contains(keywords, 'X')").show(false) ``` ``` Caused by: java.lang.IllegalArgumentException: FilterPredicates do not currently support repeated columns. Column keywords is repeated. at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:176) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:149) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:89) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:56) at org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:192) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:61) at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:95) at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:45) at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149) at org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:72) at org.apache.parquet.hadoop.ParquetFileReader.filterRowGroups(ParquetFileReader.java:870) at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:789) at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:657) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:373) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) ... ``` The cause of the problem is due to a change in the data filtering conditions: ``` spark.sql("select * from log where array_contains(keywords, 'X')").explain(true); // Spark 3.0.2 and older == Physical Plan == ... +- FileScan parquet [link#0,keywords#1] DataFilters: [array_contains(keywords#1, Google)] PushedFilters: [] ... // Spark 3.1.0 and newer == Physical Plan == ... +- FileScan parquet [link#0,keywords#1] DataFilters: [isnotnull(keywords#1), array_contains(keywords#1, Google)] PushedFilters: [IsNotNull(keywords)] ... ``` Pushing filters down for repeated columns of parquet is not necessary because it is not supported by parquet library for now. So we can exclude them from pushed predicate filters and solve issue. ### Why are the changes needed? Predicate filters that are pushed down to parquet should not be created on repeated-type fields. ### Does this PR introduce any user-facing change? No, It's only fixed a bug and before this, due to the limitations of the parquet library, no more work was possible. ### How was this patch tested? Add an extra test to ensure problem solved. Closes #36781 from Borjianamin98/master. Authored-by: Amin Borjian <borjianamin98@outlook.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit ac2881a8c3cfb196722a5680a62ebd6bb9fba728) Signed-off-by: huaxingao <huaxin_gao@apple.com> 08 June 2022, 20:31:28 UTC
d611d1f [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries ### What changes were proposed in this pull request? Apply the optimizer rule ComputeCurrentTime consistently across subqueries. This is a backport of https://github.com/apache/spark/pull/36654. ### Why are the changes needed? At the moment timestamp functions like now() can return different values within a query if subqueries are involved ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test was added Closes #36753 from olaky/SPARK-39259-spark_3_2. Lead-authored-by: Ole Sasse <ole.sasse@databricks.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> 07 June 2022, 10:54:46 UTC
d9477dd [SPARK-39376][SQL] Hide duplicated columns in star expansion of subquery alias from NATURAL/USING JOIN ### What changes were proposed in this pull request? Follows up from https://github.com/apache/spark/pull/31666. This PR introduced a bug where the qualified star expansion of a subquery alias containing a NATURAL/USING output duplicated columns. ### Why are the changes needed? Duplicated, hidden columns should not be output from a star expansion. ### Does this PR introduce _any_ user-facing change? The query ``` val df1 = Seq((3, 8)).toDF("a", "b") val df2 = Seq((8, 7)).toDF("b", "d") val joinDF = df1.join(df2, "b") joinDF.alias("r").select("r.*") ``` Now outputs a single column `b`, instead of two (duplicate) columns for `b`. ### How was this patch tested? UTs Closes #36763 from karenfeng/SPARK-39376. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 June 2022, 13:01:04 UTC
4a529a0 [SPARK-39367][DOCS][SQL][3.2] Make SQL error objects private ### What changes were proposed in this pull request? Partially backport https://github.com/apache/spark/pull/36754 to branch-3.2, make the following objects as private so that they won't show up in API doc: * QueryCompilationErrors * QueryExecutionErrors * QueryParsingErrors ### Why are the changes needed? Fix bug in doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests Closes #36761 from gengliangwang/fix3.2Doc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 June 2022, 12:53:39 UTC
5f571a2 [SPARK-39373][SPARK-39273][SPARK-39252][PYTHON][3.2] Recover branch-3.2 build broken by and ### What changes were proposed in this pull request? Backporting SPARK-39273 and SPARK-39252 brought some mistakes together into branch-3.2. This PR fixes it. One is to avoid exact match (by a different pandas version). Another one is unused import. ### Why are the changes needed? To recover the build ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should test it out. Closes #36759 from HyukjinKwon/SPARK-39373. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 June 2022, 12:52:11 UTC
19ce3e6 [SPARK-38807][CORE] Fix the startup error of spark shell on Windows ### What changes were proposed in this pull request? The File.getCanonicalPath method will return the drive letter in the windows system. The RpcEnvFileServer.validateDirectoryUri method uses the File.getCanonicalPath method to process the baseuri, which will cause the baseuri not to comply with the URI verification rules. For example, the / classes is processed into F: \ classes.This causes the sparkcontext to fail to start on windows. This PR modifies the RpcEnvFileServer.validateDirectoryUri method and replaces `new File(baseUri).getCanonicalPath` with `new URI(baseUri).normalize().getPath`. This method can work normally in windows. ### Why are the changes needed? Fix the startup error of spark shell on Windows system [[SPARK-35691](https://issues.apache.org/jira/browse/SPARK-35691)] introduced this regression. ### Does this PR introduce any user-facing change? No ### How was this patch tested? CI Closes #36447 from 1104056452/master. Lead-authored-by: Ming Li <1104056452@qq.com> Co-authored-by: ming li <1104056452@qq.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a760975083ea0696e8fd834ecfe3fb877b7f7449) Signed-off-by: Sean Owen <srowen@gmail.com> 02 June 2022, 12:44:38 UTC
606830e [SPARK-39283][CORE] Fix deadlock between TaskMemoryManager and UnsafeExternalSorter.SpillableIterator ### What changes were proposed in this pull request? This PR fixes a deadlock between TaskMemoryManager and UnsafeExternalSorter.SpillableIterator. ### Why are the changes needed? We are facing the deadlock issue b/w TaskMemoryManager and UnsafeExternalSorter.SpillableIterator during the join. It turns out that in UnsafeExternalSorter.SpillableIterator#spill() function, it tries to get lock on UnsafeExternalSorter`SpillableIterator` and UnsafeExternalSorter and call `freePage` to free all allocated pages except the last one which takes the lock on TaskMemoryManager. At the same time, there can be another `MemoryConsumer` using `UnsafeExternalSorter` as part of sorting can try to allocatePage needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. There is a similar fix here as well: https://issues.apache.org/jira/browse/SPARK-27338 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #36680 from sandeepvinayak/SPARK-39283. Authored-by: sandeepvinayak <sandeep.pal@outlook.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit 8d0c035f102b005c2e85f03253f1c0c24f0a539f) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 31 May 2022, 22:31:17 UTC
30b9e3a [SPARK-39340][SQL][3.2] DS v2 agg pushdown should allow dots in the name of top-level columns ### What changes were proposed in this pull request? In the first version of DS v2 aggregate pushdown, we don't want to support nested fields and we picked `PushableColumnWithoutNestedColumn` to match top-level columns. However, this introduced an unexpected limitation: the column name cannot contain dots. This makes sense for DS v2 filter pushdown to keep backward compatibility, but DS v2 agg pushdown is a completely new feature and we don't need to worry about backward compatibility. This PR removes this limitation. Note that in Spark 3.3, DS v2 agg pushdown supports nested fields and this limitation is gone. ### Why are the changes needed? Unexpected limitation is a mistake ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new tests Closes #36727 from cloud-fan/agg. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 May 2022, 14:16:38 UTC
7b33e39 [SPARK-39293][SQL] Fix the accumulator of ArrayAggregate to handle complex types properly ### What changes were proposed in this pull request? Fix the accumulator of `ArrayAggregate` to handle complex types properly. The accumulator of `ArrayAggregate` should copy the intermediate result if string, struct, array, or map. ### Why are the changes needed? If the intermediate data of `ArrayAggregate` holds reusable data, the result will be duplicated. ```scala import org.apache.spark.sql.functions._ val reverse = udf((s: String) => s.reverse) val df = Seq(Array("abc", "def")).toDF("array") val testArray = df.withColumn( "agg", aggregate( col("array"), array().cast("array<string>"), (acc, s) => concat(acc, array(reverse(s))))) aggArray.show(truncate=false) ``` should be: ``` +----------+----------+ |array |agg | +----------+----------+ |[abc, def]|[cba, fed]| +----------+----------+ ``` but: ``` +----------+----------+ |array |agg | +----------+----------+ |[abc, def]|[fed, fed]| +----------+----------+ ``` ### Does this PR introduce _any_ user-facing change? Yes, this fixes the correctness issue. ### How was this patch tested? Added a test. Closes #36674 from ueshin/issues/SPARK-39293/array_aggregate. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d6a11cb4b411c8136eb241aac167bc96990f5421) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 May 2022, 01:36:23 UTC
e5f6c77 [SPARK-39252][PYSPARK][TESTS] Remove flaky test_df_is_empty This PR removes flaky `test_df_is_empty` as reported in https://issues.apache.org/jira/browse/SPARK-39252. I will open a follow-up PR to reintroduce the test and fix the flakiness (or see if it was a regression). No. Existing unit tests. Closes #36656 from sadikovi/SPARK-39252. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9823bb385cd6dca7c4fb5a6315721420ad42f80a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 May 2022, 02:42:02 UTC
d0fd5fc [SPARK-39273][PS][TESTS] Make PandasOnSparkTestCase inherit ReusedSQLTestCase ### What changes were proposed in this pull request? This PR proposes to make `PandasOnSparkTestCase` inherit `ReusedSQLTestCase`. ### Why are the changes needed? We don't need this: ```python classmethod def tearDownClass(cls): # We don't stop Spark session to reuse across all tests. # The Spark session will be started and stopped at PyTest session level. # Please see pyspark/pandas/conftest.py. pass ``` anymore in Apache Spark. This has existed to speed up the tests when the codes are in Koalas repository where the tests run sequentially in single process. In Apache Spark, we run in multiple processes, and we don't need this anymore. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Existing CI should test it out. Closes #36652 from HyukjinKwon/SPARK-39273. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a6dd6076d708713d11585bf7f3401d522ea48822) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 May 2022, 00:57:17 UTC
66172be [SPARK-39258][TESTS] Fix `Hide credentials in show create table` ### What changes were proposed in this pull request? [SPARK-35378-FOLLOWUP](https://github.com/apache/spark/pull/36632) changes the return value of `CommandResultExec.executeCollect()` from `InternalRow` to `UnsafeRow`, this change causes the result of `r.tostring` in the following code: https://github.com/apache/spark/blob/de73753bb2e5fd947f237e731ff05aa9f2711677/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L1143-L1148 change from ``` [CREATE TABLE tab1 ( NAME STRING, THEID INT) USING org.apache.spark.sql.jdbc OPTIONS ( 'dbtable' = 'TEST.PEOPLE', 'password' = '*********(redacted)', 'url' = '*********(redacted)', 'user' = 'testUser') ] ``` to ``` [0,10000000d5,5420455441455243,62617420454c4241,414e20200a282031,4e4952545320454d,45485420200a2c47,a29544e49204449,726f20474e495355,6568636170612e67,732e6b726170732e,a6362646a2e6c71,20534e4f4954504f,7462642720200a28,203d2027656c6261,45502e5453455427,200a2c27454c504f,6f77737361702720,2a27203d20276472,2a2a2a2a2a2a2a2a,6574636164657228,2720200a2c272964,27203d20276c7275,2a2a2a2a2a2a2a2a,746361646572282a,20200a2c27296465,3d20277265737527,7355747365742720,a29277265] ``` and the UT `JDBCSuite$Hide credentials in show create table` failed in master branch. This pr is change to use `executeCollectPublic()` instead of `executeCollect()` to fix this UT. ### Why are the changes needed? Fix UT failed in mater branch after [SPARK-35378-FOLLOWUP](https://github.com/apache/spark/pull/36632) ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? - GitHub Action pass - Manual test Run `mvn clean install -DskipTests -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.JDBCSuite` **Before** ``` - Hide credentials in show create table *** FAILED *** "[0,10000000d5,5420455441455243,62617420454c4241,414e20200a282031,4e4952545320454d,45485420200a2c47,a29544e49204449,726f20474e495355,6568636170612e67,732e6b726170732e,a6362646a2e6c71,20534e4f4954504f,7462642720200a28,203d2027656c6261,45502e5453455427,200a2c27454c504f,6f77737361702720,2a27203d20276472,2a2a2a2a2a2a2a2a,6574636164657228,2720200a2c272964,27203d20276c7275,2a2a2a2a2a2a2a2a,746361646572282a,20200a2c27296465,3d20277265737527,7355747365742720,a29277265]" did not contain "TEST.PEOPLE" (JDBCSuite.scala:1146) ``` **After** ``` Run completed in 24 seconds, 868 milliseconds. Total number of tests run: 93 Suites: completed 2, aborted 0 Tests: succeeded 93, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #36637 from LuciferYang/SPARK-39258. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6eb15d12ae6bd77412dbfbf46eb8dbeec1eab466) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2022, 21:28:36 UTC
ff682c4 [MINOR][ML][DOCS] Fix sql data types link in the ml-pipeline page ### What changes were proposed in this pull request? <img width="939" alt="image" src="https://user-images.githubusercontent.com/8326978/169767919-6c48554c-87ff-4d40-a47d-ec4da0c993f7.png"> [Spark SQL datatype reference](https://spark.apache.org/docs/latest/sql-reference.html#data-types) - `https://spark.apache.org/docs/latest/sql-reference.html#data-types` is invalid and it shall be [Spark SQL datatype reference](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) - `https://spark.apache.org/docs/latest/sql-ref-datatypes.html` https://spark.apache.org/docs/latest/ml-pipeline.html#dataframe ### Why are the changes needed? doc fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? `bundle exec jekyll serve` Closes #36633 from yaooqinn/minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit de73753bb2e5fd947f237e731ff05aa9f2711677) Signed-off-by: huaxingao <huaxin_gao@apple.com> 23 May 2022, 14:46:50 UTC
687f36c [SPARK-35378][SQL][FOLLOW-UP] Fix incorrect return type in CommandResultExec.executeCollect() ### What changes were proposed in this pull request? This PR is a follow-up for https://github.com/apache/spark/pull/32513 and fixes an issue introduced by that patch. CommandResultExec is supposed to return `UnsafeRow` records in all of the `executeXYZ` methods but `executeCollect` was left out which causes issues like this one: ``` Error in SQL statement: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow ``` We need to return `unsafeRows` instead of `rows` in `executeCollect` similar to other methods in the class. ### Why are the changes needed? Fixes a bug in CommandResultExec. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test to check the return type of all commands. Closes #36632 from sadikovi/fix-command-exec. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a0decfc7db68c464e3ba2c2fb0b79a8b0c464684) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 May 2022, 08:59:24 UTC
0eb0abe [SPARK-39237][DOCS][3.2] Update the ANSI SQL mode documentation ### What changes were proposed in this pull request? 1. Remove the Experimental notation in ANSI SQL compliance doc 2. Update the description of `spark.sql.ansi.enabled` ### Why are the changes needed? 1. The ANSI SQL dialect is GAed in Spark 3.2 release: https://spark.apache.org/releases/spark-release-3-2-0.html We should not mark it as "Experimental" in the doc. 2. Mention type coercion in the doc of `spark.sql.ansi.enabled` ### Does this PR introduce _any_ user-facing change? No, just doc change ### How was this patch tested? Doc preview: <img width="700" alt="image" src="https://user-images.githubusercontent.com/1097932/169444094-de9c33c2-1b01-4fc3-b583-b752c71e16d8.png"> <img width="699" alt="image" src="https://user-images.githubusercontent.com/1097932/169499090-690ce919-1de8-4a64-bd12-32fe5549d890.png"> Closes #36618 from gengliangwang/backportAnsiDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 21 May 2022, 05:32:57 UTC
68d8989 [SPARK-39240][INFRA][BUILD] Source and binary releases using different tool to generate hashes for integrity ### What changes were proposed in this pull request? unify the hash generator for release files. ### Why are the changes needed? Currently, we use `shasum` for source but `gpg` for binary, since https://github.com/apache/spark/pull/30123 this confuses me when validating the integrities of spark 3.3.0 RC https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/ ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test script manually Closes #36619 from yaooqinn/SPARK-39240. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 3e783375097d14f1c28eb9b0e08075f1f8daa4a2) Signed-off-by: Sean Owen <srowen@gmail.com> 20 May 2022, 15:55:10 UTC
6587d29 [SPARK-39219][DOC] Promote Structured Streaming over DStream ### What changes were proposed in this pull request? This PR proposes to add NOTE section for DStream guide doc to promote Structured Streaming. Screenshot: <img width="992" alt="screenshot-spark-streaming-programming-guide-change" src="https://user-images.githubusercontent.com/1317309/168977732-4c32db9a-0fb1-4a82-a542-bf385e5f3683.png"> ### Why are the changes needed? We see efforts of community are more focused on Structured Streaming (based on Spark SQL) than Spark Streaming (DStream). We would like to encourage end users to use Structured Streaming than Spark Streaming whenever possible for their workloads. ### Does this PR introduce _any_ user-facing change? Yes, doc change. ### How was this patch tested? N/A Closes #36590 from HeartSaVioR/SPARK-39219. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 7d153392db2f61104da0af1cb175f4ee7c7fbc38) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 19 May 2022, 02:50:54 UTC
cb85686 [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe ### What changes were proposed in this pull request? Add `synchronized` on method `isCachedColumnBuffersLoaded` ### Why are the changes needed? `isCachedColumnBuffersLoaded` should has `synchronized` wrapped, otherwise may cause NPE when modify `_cachedColumnBuffers` concurrently. ``` def isCachedColumnBuffersLoaded: Boolean = { _cachedColumnBuffers != null && isCachedRDDLoaded } def isCachedRDDLoaded: Boolean = { _cachedColumnBuffersAreLoaded || { val bmMaster = SparkEnv.get.blockManager.master val rddLoaded = _cachedColumnBuffers.partitions.forall { partition => bmMaster.getBlockStatus(RDDBlockId(_cachedColumnBuffers.id, partition.index), false) .exists { case(_, blockStatus) => blockStatus.isCached } } if (rddLoaded) { _cachedColumnBuffersAreLoaded = rddLoaded } rddLoaded } } ``` ``` java.lang.NullPointerException at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedRDDLoaded(InMemoryRelation.scala:247) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedColumnBuffersLoaded(InMemoryRelation.scala:241) at org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8(CacheManager.scala:189) at org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8$adapted(CacheManager.scala:176) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:395) at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.CacheManager.recacheByCondition(CacheManager.scala:219) at org.apache.spark.sql.execution.CacheManager.uncacheQuery(CacheManager.scala:176) at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3220) at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3231) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT. Closes #36496 from pan3793/SPARK-39104. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 3c8d8d7a864281fbe080316ad8de9b8eac80fa71) Signed-off-by: Sean Owen <srowen@gmail.com> 17 May 2022, 23:27:14 UTC
e7060b7 [SPARK-39183][BUILD] Upgrade Apache Xerces Java to 2.12.2 ### What changes were proposed in this pull request? Upgrade Apache Xerces Java to 2.12.2 [Release notes](https://xerces.apache.org/xerces2-j/releases.html) ### Why are the changes needed? [Infinite Loop in Apache Xerces Java](https://github.com/advisories/GHSA-h65f-jvqw-m9fj) There's a vulnerability within the Apache Xerces Java (XercesJ) XML parser when handling specially crafted XML document payloads. This causes, the XercesJ XML parser to wait in an infinite loop, which may sometimes consume system resources for prolonged duration. This vulnerability is present within XercesJ version 2.12.1 and the previous versions. References https://nvd.nist.gov/vuln/detail/CVE-2022-23437 https://lists.apache.org/thread/6pjwm10bb69kq955fzr1n0nflnjd27dl http://www.openwall.com/lists/oss-security/2022/01/24/3 https://www.oracle.com/security-alerts/cpuapr2022.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #36544 from bjornjorgensen/Upgrade-xerces-to-2.12.2. Authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 181436bd990d3bdf178a33fa6489ad416f3e7f94) Signed-off-by: Sean Owen <srowen@gmail.com> 16 May 2022, 23:10:28 UTC
5481840 [SPARK-39186][PYTHON] Make pandas-on-Spark's skew consistent with pandas the logics of computing skewness are different between spark sql and pandas: spark sql: [`sqrt(n) * m3 / sqrt(m2 * m2 * m2))`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L304) pandas: [`(count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2**1.5)`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py#L1221) to make skew consistent with pandas yes, the logic to compute skew was changed added UT Closes #36549 from zhengruifeng/adjust_pandas_skew. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7e4519c9a8ba35958ef6d408be3ca4e97917c965) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 386c75693b5b9dd5e3b2147d49f0284badaa7d6d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:31:57 UTC
df0347c [SPARK-37544][SQL] Correct date arithmetic in sequences ### What changes were proposed in this pull request? Change `InternalSequenceBase` to pass a time-zone aware value to `DateTimeUtils#timestampAddInterval`, rather than a time-zone agnostic value, when performing `Date` arithmetic. ### Why are the changes needed? The following query gets the wrong answer if run in the America/Los_Angeles time zone: ``` spark-sql> select sequence(date '2021-01-01', date '2022-01-01', interval '3' month) x; [2021-01-01,2021-03-31,2021-06-30,2021-09-30,2022-01-01] Time taken: 0.664 seconds, Fetched 1 row(s) spark-sql> ``` The answer should be ``` [2021-01-01,2021-04-01,2021-07-01,2021-10-01,2022-01-01] ``` `InternalSequenceBase` converts the date to micros by multiplying days by micros per day. This converts the date into a time-zone agnostic timestamp. However, `InternalSequenceBase` uses `DateTimeUtils#timestampAddInterval` to perform the arithmetic, and that function assumes a _time-zone aware_ timestamp. One simple fix would be to call `DateTimeUtils#timestampNTZAddInterval` instead for date arithmetic. However, Spark date arithmetic is typically time-zone aware (see the comment in the test added by this PR), so this PR converts the date to a time-zone aware value before calling `DateTimeUtils#timestampAddInterval`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #36546 from bersprockets/date_sequence_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 14ee0d8f04f218ad61688196a0b984f024151468) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:26:41 UTC
12d6ecc [SPARK-39174][SQL] Catalogs loading swallows missing classname for ClassNotFoundException ### What changes were proposed in this pull request? this PR captures the actual missing classname when catalog loading meets ClassNotFoundException ### Why are the changes needed? ClassNotFoundException can occur when missing dependencies, we shall not always report the catalog class is missing ### Does this PR introduce _any_ user-facing change? yes, when loading catalogs and ClassNotFoundException occurs, it shows the correct missing class. ### How was this patch tested? new test added Closes #36534 from yaooqinn/SPARK-39174. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1b37f19876298e995596a30edc322c856ea1bbb4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 May 2022, 10:45:54 UTC
6f9e303 [SPARK-39060][SQL][3.2] Typo in error messages of decimal overflow ### What changes were proposed in this pull request? This PR removes extra curly bracket from debug string for Decimal type in SQL. This is a backport from master branch. Commit: https://github.com/apache/spark/commit/165ce4eb7d6d75201beb1bff879efa99fde24f94 ### Why are the changes needed? Typo in error messages of decimal overflow. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests: ``` $ build/sbt "sql/testOnly" ``` Closes #36458 from vli-databricks/SPARK-39060-3.2. Authored-by: Vitalii Li <vitalii.li@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 12 May 2022, 05:13:51 UTC
d1cae9c [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 00:13:49 UTC
345bf8c Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" This reverts commit 0a9a21df7dea62c5f129d91c049d0d3408d0366e. 12 May 2022, 00:12:39 UTC
0a9a21d [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cec1e7b4e68deac321f409d424a3acdcd4cb91be) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 00:08:08 UTC
69df6ec [SPARK-39107][SQL] Account for empty string input in regex replace ### What changes were proposed in this pull request? When trying to perform a regex replace, account for the possibility of having empty strings as input. ### Why are the changes needed? https://github.com/apache/spark/pull/29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string) From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions. 3.0.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | <empty>| +---+--------+ ``` 3.1.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | | +---+--------+ ``` The 3.0.2 outcome is the expected and correct one ### Does this PR introduce _any_ user-facing change? Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above. ### How was this patch tested? Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement. Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace. Authored-by: Lorenzo Martini <lmartini@palantir.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 731aa2cdf8a78835621fbf3de2d3492b27711d1a) Signed-off-by: Sean Owen <srowen@gmail.com> 10 May 2022, 00:45:12 UTC
04161ac [SPARK-38786][SQL][TEST] Bug in StatisticsSuite 'change stats after add/drop partition command' ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979 It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug. ### Why are the changes needed? Due to this test bug, the drop command was dropping a wrong (`partDir1`) underlying file in the test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added extra underlying file location check. Closes #36075 from kazuyukitanimura/SPARK-38786. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit a6b04f007c07fe00637aa8be33a56f247a494110) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a52a245d11c20b0360d463c973388f3ee05768ac) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 May 2022, 00:33:57 UTC
599adc8 [SPARK-39083][CORE] Fix race condition between update and clean app data ### What changes were proposed in this pull request? make `cleanAppData` atomic to prevent race condition between update and clean app data. When the race condition happens, it could lead to a scenario when `cleanAppData` delete the entry of ApplicationInfoWrapper for an application right after it has been updated by `mergeApplicationListing`. So there will be cases when the HS Web UI displays `Application not found` for applications whose logs does exist. #### Error message ``` 22/04/29 17:16:21 DEBUG FsHistoryProvider: New/updated attempts found: 1 ArrayBuffer(viewfs://iu/log/spark3/application_1651119726430_138107_1) 22/04/29 17:16:21 INFO FsHistoryProvider: Parsing viewfs://iu/log/spark3/application_1651119726430_138107_1 for listing data... 22/04/29 17:16:21 INFO FsHistoryProvider: Looking for end event; skipping 10805037 bytes from viewfs://iu/log/spark3/application_1651119726430_138107_1... 22/04/29 17:16:21 INFO FsHistoryProvider: Finished parsing viewfs://iu/log/spark3/application_1651119726430_138107_1 22/04/29 17:16:21 ERROR Utils: Uncaught exception in thread log-replay-executor-7 java.util.NoSuchElementException at org.apache.spark.util.kvstore.InMemoryStore.read(InMemoryStore.java:85) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$3(FsHistoryProvider.scala:927) at scala.Option.foreach(Option.scala:407) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$1(FsHistoryProvider.scala:926) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:2032) at org.apache.spark.deploy.history.FsHistoryProvider.checkAndCleanLog(FsHistoryProvider.scala:916) at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:712) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$15(FsHistoryProvider.scala:576) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` #### Background Currently, the HS runs the `checkForLogs` to build the application list based on the current contents of the log directory for every 10 seconds by default. - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L296-L299 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L472 In each turn of execution, this method scans the specified logDir and parse the log files to update its KVStore: - detect new updated/added files to process : https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L574-L578 - detect stale data to remove: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600 These 2 operations are executed in different threads as `submitLogProcessTask` uses `replayExecutor` to submit tasks. https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1389-L1401 ### When does the bug happen? `Application not found` error happens in the following scenario: In the first run of `checkForLogs`, it detected a newly-added log `viewfs://iu/log/spark3/AAA_1.inprogress` (log of an in-progress application named AAA). So it will add 2 entries to the KVStore - one entry of key-value as the key is the logPath (`viewfs://iu/log/spark3/AAA_1.inprogress`) and the value is an instance of LogInfo represented the log - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L495-L505 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L545-L552 - one entry of key-value as the key is the applicationId (`AAA`) and the value is an instance of ApplicationInfoWrapper holding the information of the application. - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L825 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 In the next run of `checkForLogs`, now the AAA application has finished, the log `viewfs://iu/log/spark3/AAA_1.inprogress` has been deleted and a new log `viewfs://iu/log/spark3/AAA_1` is created. So `checkForLogs` will do the following 2 things in 2 different threads: - Thread 1: parsing the new log `viewfs://iu/log/spark3/AAA_1` and update data in its KVStore - add a new entry of key: `viewfs://iu/log/spark3/AAA_1` and value: an instance of LogInfo represented the log - updated the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper (for example: the isCompleted flag now change from false to true) - Thread 2: data related to `viewfs://iu/log/spark3/AAA_1.inprogress` is now considered as stale and it must be deleted. - clean App data for `viewfs://iu/log/spark3/AAA_1.inprogress` https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600 - Inside `cleanAppData`, first it loads the latest information of `ApplicationInfoWrapper` from the KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L632 For most of the time, when this line is executed, Thread 1 already finished `updating the entry with key=applicationId (AAA) with new value of an instance of ApplicationInfoWrapper` so this condition https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L637 will be evaluated as false, so `isStale` will be false. However, in some rare cases, when Thread1 does not finish the update yet, the old data of ApplicationInfoWrapper will be load, so `isStale` will be true and it leads to deleting the entry of ApplicationInfoWrapper in KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L656-L662 and the worst thing is it delete the entry right after when Thread 1 has finished updating the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper. So the entry for the ApplicationInfoWrapper of applicationId= `AAA` is removed forever then when users access the Web UI for this application, and `Application not found` is shown up while the log for the app does exist. So here we make the `cleanAppData` method atomic just like the `addListing` method https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 so that - If Thread 1 gets the lock on the `listing` before Thread 2, it will update the entry for the application, so in Thread2 `isStale` will be false, the entry for the application will not be removed from KVStore - If Thread 2 gets the lock on the `listing` before Thread 1, then `isStale` will be true, the entry for the application will be removed from KVStore but after that it will be added again by Thread 1. In both case, the entry for the application will not be deleted unexpectedly from KVStore. ### Why are the changes needed? Fix the bug causing HS Web UI to display `Application not found` for applications whose logs does exist. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ## How was this patch tested? Manual test. Deployed in our Spark HS and the `java.util.NoSuchElementException` exception does not happen anymore. `Application not found` error does not happen anymore. Closes #36424 from tanvn/SPARK-39083. Authored-by: tan.vu <tan.vu@linecorp.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 29643265a9f5e8142d20add5350c614a55161451) Signed-off-by: Sean Owen <srowen@gmail.com> 08 May 2022, 13:09:42 UTC
744b5f4 [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. Fixes the JVM crash when checking isEmpty() on a dataset. No. I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 May 2022, 23:33:59 UTC
514d189 Revert "[SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion" This reverts commit 4dba99ae359b07f814f68707073414f60616b564. 02 May 2022, 23:33:22 UTC
4dba99a [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. Fixes the JVM crash when checking isEmpty() on a dataset. No. I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 May 2022, 23:31:40 UTC
5b35cae [SPARK-39032][PYTHON][DOCS] Examples' tag for pyspark.sql.functions.when() ### What changes were proposed in this pull request? Fix missing keyword for `pyspark.sql.functions.when()` documentation. ### Why are the changes needed? [Documentation](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html) is not formatted correctly ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All tests passed. Closes #36369 from vadim/SPARK-39032. Authored-by: vadim <86705+vadim@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5b53bdfa83061c160652e07b999f996fc8bd2ece) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 April 2022, 07:56:43 UTC
3690c8c [SPARK-38868][SQL][3.2] Don't propagate exceptions from filter predicate when optimizing outer joins Backport of #36230 ### What changes were proposed in this pull request? Change `EliminateOuterJoin#canFilterOutNull` to return `false` when a `where` condition throws an exception. ### Why are the changes needed? Consider this query: ``` select * from (select id, id as b from range(0, 10)) l left outer join (select id, id + 1 as c from range(0, 10)) r on l.id = r.id where assert_true(c > 0) is null; ``` The query should succeed, but instead fails with ``` java.lang.RuntimeException: '(c#1L > cast(0 as bigint))' is not true! ``` This happens even though there is no row where `c > 0` is false. The `EliminateOuterJoin` rule checks if it can convert the outer join to a inner join based on the expression in the where clause, which in this case is ``` assert_true(c > 0) is null ``` `EliminateOuterJoin#canFilterOutNull` evaluates that expression with `c` set to `null` to see if the result is `null` or `false`. That rule doesn't expect the result to be a `RuntimeException`, but in this case it always is. That is, the assertion is failing during optimization, not at run time. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #36341 from bersprockets/outer_join_eval_assert_issue_32. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2022, 07:38:22 UTC
d790347 [SPARK-39030][PYTHON] Rename sum to avoid shading the builtin Python function ### What changes were proposed in this pull request? Rename sum to something else. ### Why are the changes needed? Sum is a build in function in python. [SUM() at python docs](https://docs.python.org/3/library/functions.html#sum) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use existing tests. Closes #36364 from bjornjorgensen/rename-sum. Authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3821d807a599a2d243465b4e443f1eb68251d432) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 April 2022, 01:10:34 UTC
93289a5 Revert "[SPARK-38916][CORE] Tasks not killed caused by race conditions between killTask() and launchTask()" This reverts commit 9dd64d40c91253c275fef2313c6a326ef72112cb. 25 April 2022, 03:57:51 UTC
4ab3b11 [SPARK-38955][SQL] Disable lineSep option in 'from_csv' and 'schema_of_csv' This PR proposes to disable `lineSep` option in `from_csv` and `schema_of_csv` expression by setting Noncharacters according to [unicode specification](https://www.unicode.org/charts/PDF/UFFF0.pdf), `\UFFFF`. This can be used for the internal purpose in a program according to the specification. The Univocity parser does not allow to omit the line separator (from my code reading) so this approach was proposed. This specific code path is not affected by our `encoding` or `charset` option because Unicovity parser parses them as unicodes as are internally. Currently, this option is weirdly effective. See the example of `from_csv` as below: ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ Seq[String]("1,\n2,3,4,5").toDF.select( col("value"), from_csv( col("value"), StructType(Seq(StructField("a", LongType), StructField("b", StringType) )), Map[String,String]())).show() ``` ``` +-----------+---------------+ | value|from_csv(value)| +-----------+---------------+ |1,\n2,3,4,5| {1, null}| +-----------+---------------+ ``` `{1, null}` has to be `{1, \n2}`. The CSV expressions cannot easily make it supported because this option is plan-wise option that can change the number of returned rows; however, the expressions are designed to emit one row only whereas this option is easily effective in the scan plan with CSV data source. Therefore, we should disable this option. Yes, now the `lineSep` can be located in the output from `from_csv` and `schema_of_csv`. Manually tested, and unit test was added. Closes #36294 from HyukjinKwon/SPARK-38955. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f3cc2814d4bc585dad92c9eca9a593d1617d27e9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d63e42d128b8814e885b86533f187724fbb7e9fd) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 April 2022, 07:50:12 UTC
38d8189 [SPARK-38977][SQL] Fix schema pruning with correlated subqueries ### What changes were proposed in this pull request? This PR fixes schema pruning for queries with multiple correlated subqueries. Previously, Spark would throw an exception trying to determine root fields in `SchemaPruning$identifyRootFields`. That was happening because expressions in predicates that referenced attributes in subqueries were not ignored. That's why attributes from multiple subqueries could conflict with each other (e.g. incompatible types) even though they should be ignored. For instance, the following query would throw a runtime exception. ``` SELECT name FROM contacts c WHERE EXISTS (SELECT 1 FROM ids i WHERE i.value = c.id) AND EXISTS (SELECT 1 FROM first_names n WHERE c.name.first = n.value) ``` ``` [info] org.apache.spark.SparkException: Failed to merge fields 'value' and 'value'. Failed to merge incompatible data types int and string [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.failedMergingFieldsError(QueryExecutionErrors.scala:936) ``` ### Why are the changes needed? These changes are needed to avoid exceptions for some queries with multiple correlated subqueries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. Closes #36303 from aokolnychyi/spark-38977. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 0c9947dabcb71de414c97c0e60a1067e468f2642) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 22 April 2022, 21:12:14 UTC
1d524a8 [SPARK-38992][CORE] Avoid using bash -c in ShellBasedGroupsMappingProvider ### What changes were proposed in this pull request? This PR proposes to avoid using `bash -c` in `ShellBasedGroupsMappingProvider`. This could allow users a command injection. ### Why are the changes needed? For a security purpose. ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Manually tested. Closes #36315 from HyukjinKwon/SPARK-38992. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c83618e4e5fc092829a1f2a726f12fb832e802cc) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 April 2022, 10:01:51 UTC
f8263bb [SPARK-38990][SQL] Avoid `NullPointerException` when evaluating date_trunc/trunc format as a bound reference ### What changes were proposed in this pull request? Change `TruncInstant.evalHelper` to pass the input row to `format.eval` when `format` is a not a literal (and therefore might be a bound reference). ### Why are the changes needed? This query fails with a `java.lang.NullPointerException`: ``` select date_trunc(col1, col2) from values ('week', timestamp'2012-01-01') as data(col1, col2); ``` This only happens if the data comes from an inline table. When the source is an inline table, `ConvertToLocalRelation` attempts to evaluate the function against the data in interpreted mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update to unit tests. Closes #36312 from bersprockets/date_trunc_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 2e4f4abf553cedec1fa8611b9494a01d24e6238a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 April 2022, 03:31:08 UTC
83f7f40 [MINOR][DOCS] Also remove Google Analytics from Spark release docs, per ASF policy ### What changes were proposed in this pull request? Remove Google Analytics from Spark release docs. See also https://github.com/apache/spark-website/pull/384 ### Why are the changes needed? New ASF privacy policy requirement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #36310 from srowen/PrivacyPolicy. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7a58670e2e68ee4950cf62c2be236e00eb8fc44b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 April 2022, 02:26:44 UTC
9dd64d4 [SPARK-38916][CORE] Tasks not killed caused by race conditions between killTask() and launchTask() ### What changes were proposed in this pull request? This PR fixes the race conditions between the killTask() call and the launchTask() call that sometimes causes tasks not to be killed properly. If killTask() probes the map of pendingTasksLaunches before launchTask() has had a chance to put the corresponding task into that map, the kill flag will be lost and the subsequent launchTask() call will just proceed and run that task without knowing this task should be killed instead. The fix adds a kill mark during the killTask() call so that subsequent launchTask() can pick up the kill mark and call kill() on the TaskLauncher. If killTask() happens to happen after the task has finished and thus makes the kill mark useless, it will be cleaned up in a background thread. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UTs. Closes #36238 from maryannxue/spark-38916. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bb5092b9af60afdceeccb239d14be660f77ae0ea) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 April 2022, 08:31:43 UTC
6bbfb24 [SPARK-38936][SQL] Script transform feed thread should have name ### What changes were proposed in this pull request? re-add thread name(`Thread-ScriptTransformation-Feed`). ### Why are the changes needed? Lost feed thread name after [SPARK-32105](https://issues.apache.org/jira/browse/SPARK-32105) refactoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist UT Closes #36245 from cxzl25/SPARK-38936. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4dc12eb54544a12ff7ddf078ca8bcec9471212c3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 April 2022, 02:34:58 UTC
662c5df [SPARK-38922][CORE] TaskLocation.apply throw NullPointerException ### What changes were proposed in this pull request? TaskLocation.apply w/o NULL check may throw NPE and fail job scheduling ``` Caused by: java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripPrefix(StringLike.scala:155) at scala.collection.immutable.StringOps.stripPrefix(StringOps.scala:29) at org.apache.spark.scheduler.TaskLocation$.apply(TaskLocation.scala:71) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal ``` For instance, `org.apache.spark.rdd.HadoopRDD#convertSplitLocationInfo` might generate unexpected `Some(null)` elements where should be replace by `Option.apply` ### Why are the changes needed? fix NPE ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #36222 from yaooqinn/SPARK-38922. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 33e07f3cd926105c6d28986eb6218f237505549e) Signed-off-by: Kent Yao <yao@apache.org> 20 April 2022, 06:39:09 UTC
cc097a4 [SPARK-38931][SS] Create root dfs directory for RocksDBFileManager with unknown number of keys on 1st checkpoint ### What changes were proposed in this pull request? Create root dfs directory for RocksDBFileManager with unknown number of keys on 1st checkpoint. ### Why are the changes needed? If this fix is not introduced, we might meet exception below: ~~~java File /private/var/folders/rk/wyr101_562ngn8lp7tbqt7_00000gp/T/spark-ce4a0607-b1d8-43b8-becd-638c6b030019/state/1/1 does not exist java.io.FileNotFoundException: File /private/var/folders/rk/wyr101_562ngn8lp7tbqt7_00000gp/T/spark-ce4a0607-b1d8-43b8-becd-638c6b030019/state/1/1 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:93) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:353) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:626) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:701) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:697) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:703) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createTempFile(CheckpointFileManager.scala:327) at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.<init>(CheckpointFileManager.scala:140) at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.<init>(CheckpointFileManager.scala:143) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createAtomic(CheckpointFileManager.scala:333) at org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.zipToDfsFile(RocksDBFileManager.scala:438) at org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.saveCheckpointToDfs(RocksDBFileManager.scala:174) at org.apache.spark.sql.execution.streaming.state.RocksDBSuite.saveCheckpointFiles(RocksDBSuite.scala:566) at org.apache.spark.sql.execution.streaming.state.RocksDBSuite.$anonfun$new$35(RocksDBSuite.scala:179) ........ ~~~ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested via RocksDBSuite. Closes #36242 from Myasuka/SPARK-38931. Authored-by: Yun Tang <myasuka@live.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit abb1df9d190e35a17b693f2b013b092af4f2528a) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 19 April 2022, 11:47:06 UTC
1a35685 [SPARK-37670][SQL] Support predicate pushdown and column pruning for de-duped CTEs This PR adds predicate push-down and column pruning to CTEs that are not inlined as well as fixes a few potential correctness issues: 1) Replace (previously not inlined) CTE refs with Repartition operations at the end of logical plan optimization so that WithCTE is not carried over to physical plan. As a result, we can simplify the logic of physical planning, as well as avoid a correctness issue where the logical link of a physical plan node can point to `WithCTE` and lead to unexpected behaviors in AQE, e.g., class cast exceptions in DPP. 2) Pull (not inlined) CTE defs from subqueries up to the main query level, in order to avoid creating copies of the same CTE def during predicate push-downs and other transformations. 3) Make CTE IDs more deterministic by starting from 0 for each query. Improve de-duped CTEs' performance with predicate pushdown and column pruning; fixes de-duped CTEs' correctness issues. No. Added UTs. Closes #34929 from maryannxue/cte-followup. Lead-authored-by: Maryann Xue <maryann.xue@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 175e429cca29c2314ee029bf009ed5222c0bffad) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 April 2022, 03:06:48 UTC
35ec300 [SPARK-37643][SQL] when charVarcharAsString is true, for char datatype predicate query should skip rpadding rule ### What changes were proposed in this pull request? after add ApplyCharTypePadding rule, when predicate query column data type is char, if column value length is less then defined, will be right-padding, then query will get incorrect result ### Why are the changes needed? fix query incorrect issue when predicate column data type is char, so in this case when charVarcharAsString is true, we should skip the rpadding rule. ### Does this PR introduce _any_ user-facing change? before this fix, if we query with char data type for predicate, then we should be careful to set charVarcharAsString to true. ### How was this patch tested? add new UT. Closes #36187 from fhygh/charpredicatequery. Authored-by: fhygh <283452027@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c1ea8b446d00dd0123a0fad93a3e143933419a76) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 April 2022, 15:17:49 UTC
eeb8b7a [SPARK-38816][ML][DOCS] Fix comment about choice of initial factors in ALS ### What changes were proposed in this pull request? Change a comment in ALS code to match impl. The comment refers to taking the absolute value of a Normal(0,1) value, but it doesn't. ### Why are the changes needed? The docs and impl are inconsistent. The current behavior actually seems fine, desirable, so, change the comments. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests Closes #36228 from srowen/SPARK-38816. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b2b350b1566b8b45c6dba2f79ccbc2dc4e95816d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 April 2022, 23:55:12 UTC
71e3c8e [SPARK-38927][TESTS] Skip NumPy/Pandas tests in `test_rdd.py` if not available ### What changes were proposed in this pull request? This PR aims to skip NumPy/Pandas tests in `test_rdd.py` if they are not available. ### Why are the changes needed? Currently, the tests that involve NumPy or Pandas are failing because NumPy and Pandas are unavailable in underlying Python. The tests should be skipped instead instead of showing failure. **BEFORE** ``` ====================================================================== ERROR: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ---------------------------------------------------------------------- Traceback (most recent call last): File ".../test_rdd.py", line 723, in test_take_on_jrdd_with_large_rows_should_not_cause_deadlock import numpy as np ModuleNotFoundError: No module named 'numpy' ---------------------------------------------------------------------- Ran 1 test in 1.990s FAILED (errors=1) ``` **AFTER** ``` Finished test(python3.9): pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (1s) ... 1 tests were skipped Tests passed in 1 seconds Skipped tests in pyspark.tests.test_rdd RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock with python3.9: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests) ... skipped 'NumPy or Pandas not installed' ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #36235 from williamhyun/skipnumpy. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c34140d8d744dc75d130af60080a2a8e25d501b1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 April 2022, 19:56:17 UTC
8a0a99f [SPARK-38892][SQL][TESTS] Fix a test case schema assertion of ParquetPartitionDiscoverySuite ### What changes were proposed in this pull request? in ParquetPartitionDiscoverySuite, thare are some assert have no parctical significance. `assert(input.schema.sameType(input.schema))` ### Why are the changes needed? fix this to assert the actual result. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated testsuites Closes #36189 from fhygh/assertutfix. Authored-by: fhygh <283452027@qq.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4835946de2ef71b176da5106e9b6c2706e182722) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 April 2022, 11:42:45 UTC
0150413 [SPARK-38905][BUILD][3.2] Upgrade ORC to 1.6.14 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC dependency to 1.6.14. ### Why are the changes needed? Apache ORC 1.6.14 is a maintenance release with 5 patches. Apache ORC community recommends to upgrade it. Here is the release note, https://orc.apache.org/news/2022/04/14/ORC-1.6.14/ - [ORC-1121](https://issues.apache.org/jira/browse/ORC-1121) Fix column coversion check bug which causes column filters don’t work - [ORC-1146](https://issues.apache.org/jira/browse/ORC-1146) Float category missing check if the statistic sum is a finite value - [ORC-1147](https://issues.apache.org/jira/browse/ORC-1147) Use isNaN instead of isFinite to determine the contain NaN values ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI. Closes #36204 from dongjoon-hyun/SPARK-38905. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 April 2022, 00:41:13 UTC
d66350a [SPARK-38677][PYSPARK][3.2] Python MonitorThread should detect deadlock due to blocking I/O ### What changes were proposed in this pull request? This PR cherry-picks https://github.com/apache/spark/pull/36065 to branch-3.2. --- When calling a Python UDF on a DataFrame with large rows, a deadlock can occur involving the following three threads: 1. The Scala task executor thread. During task execution, this is responsible for reading output produced by the Python process. However, in this case the task has finished early, and this thread is no longer reading output produced by the Python process. Instead, it is waiting for the Scala WriterThread to exit so that it can finish the task. 2. The Scala WriterThread. This is trying to send a large row to the Python process, and is waiting for the Python process to read that row. 3. The Python process. This is trying to send a large output to the Scala task executor thread, and is waiting for that thread to read that output, which will never happen. We considered the following three solutions for the deadlock: 1. When the task completes, make the Scala task executor thread close the socket before waiting for the Scala WriterThread to exit. If the WriterThread is blocked on a large write, this would interrupt that write and allow the WriterThread to exit. However, it would prevent Python worker reuse. 2. Modify PythonWorkerFactory to use interruptible I/O. [java.nio.channels.SocketChannel](https://docs.oracle.com/javase/6/docs/api/java/nio/channels/SocketChannel.html#write(java.nio.ByteBuffer)) supports interruptible blocking operations. The goal is that when the WriterThread is interrupted, it should exit even if it was blocked on a large write. However, this would be invasive. 3. Add a watchdog thread similar to the existing PythonRunner.MonitorThread to detect this deadlock and kill the Python worker. The MonitorThread currently kills the Python worker only if the task itself is interrupted. In this case, the task completes normally, so the MonitorThread does not take action. We want the new watchdog thread (WriterMonitorThread) to detect that the task is completed but the Python writer thread has not stopped, indicating a deadlock. This PR implements Option 3. ### Why are the changes needed? To fix a deadlock that can cause PySpark queries to hang. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a test that previously encountered the deadlock and timed out, and now succeeds. Closes #36172 from HyukjinKwon/SPARK-38677-3.2. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 April 2022, 07:58:18 UTC
d76d29a Revert "[SPARK-38677][PYSPARK] Python MonitorThread should detect deadlock due to blocking I/O" This reverts commit 6bd695e9abc19c57fe34772813c9e61627017349. 13 April 2022, 04:49:51 UTC
6bd695e [SPARK-38677][PYSPARK] Python MonitorThread should detect deadlock due to blocking I/O When calling a Python UDF on a DataFrame with large rows, a deadlock can occur involving the following three threads: 1. The Scala task executor thread. During task execution, this is responsible for reading output produced by the Python process. However, in this case the task has finished early, and this thread is no longer reading output produced by the Python process. Instead, it is waiting for the Scala WriterThread to exit so that it can finish the task. 2. The Scala WriterThread. This is trying to send a large row to the Python process, and is waiting for the Python process to read that row. 3. The Python process. This is trying to send a large output to the Scala task executor thread, and is waiting for that thread to read that output, which will never happen. We considered the following three solutions for the deadlock: 1. When the task completes, make the Scala task executor thread close the socket before waiting for the Scala WriterThread to exit. If the WriterThread is blocked on a large write, this would interrupt that write and allow the WriterThread to exit. However, it would prevent Python worker reuse. 2. Modify PythonWorkerFactory to use interruptible I/O. [java.nio.channels.SocketChannel](https://docs.oracle.com/javase/6/docs/api/java/nio/channels/SocketChannel.html#write(java.nio.ByteBuffer)) supports interruptible blocking operations. The goal is that when the WriterThread is interrupted, it should exit even if it was blocked on a large write. However, this would be invasive. 3. Add a watchdog thread similar to the existing PythonRunner.MonitorThread to detect this deadlock and kill the Python worker. The MonitorThread currently kills the Python worker only if the task itself is interrupted. In this case, the task completes normally, so the MonitorThread does not take action. We want the new watchdog thread (WriterMonitorThread) to detect that the task is completed but the Python writer thread has not stopped, indicating a deadlock. This PR implements Option 3. To fix a deadlock that can cause PySpark queries to hang. No. Added a test that previously encountered the deadlock and timed out, and now succeeds. Closes #36065 from ankurdave/SPARK-38677. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 088e05d2518883aa27d0b8144107e45f41dd6b90) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 April 2022, 03:03:26 UTC
04612fd [SPARK-38830][CORE] Warn on corrupted block messages ### What changes were proposed in this pull request? This PR aims to warn when `NettyBlockRpcServer` received a corrupted block RPC message or under attack. - `IllegalArgumentException`: When the type is unknown/invalid when decoding. This fails at Spark layer. - `NegativeArraySizeException`: When the size read is negative. This fails at Spark layer during buffer creation. - `IndexOutOfBoundsException`: When the data field isn't matched with the size. This fails at Netty later. ### Why are the changes needed? When the RPC messages are corrupted or the servers are under attack, Spark shows `IndexOutOfBoundsException` due to the failure from `Decoder`. Instead of `Exception`, we had better ignore the message with a directional warning message. ``` java.lang.IndexOutOfBoundsException: readerIndex(5) + length(602416) exceeds writerIndex(172): UnpooledUnsafeDirectByteBuf(ridx: 5, widx: 172, cap: 172/172) at io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1477) at io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1463) at io.netty.buffer.UnpooledDirectByteBuf.readBytes(UnpooledDirectByteBuf.java:316) at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:904) at org.apache.spark.network.protocol.Encoders$Strings.decode(Encoders.java:45) at org.apache.spark.network.shuffle.protocol.UploadBlock.decode(UploadBlock.java:112) at org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:71) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:53) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:161) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) ``` ### Does this PR introduce _any_ user-facing change? Yes, but this clarify the log messages from exceptions, `IndexOutOfBoundsException`. ### How was this patch tested? Pass the CIs with newly added test suite. Closes #36116 from dongjoon-hyun/SPARK-38830. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8ac06474f8cfa8e5619f817aaeea29a77ec8a2a4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 April 2022, 00:44:59 UTC
99e6adb [SPARK-38809][SS] Implement option to skip null values in symmetric hash implementation of stream-stream joins ### What changes were proposed in this pull request? In the symmetric has join state manager, we can receive entries with null values for a key and that can cause the `removeByValue` and get iterators to fail and run into the NullPointerException. This is possible if the state recovered is written from an old spark version or its corrupted on disk or due to issues with the iterators. Since we don't have a utility to query this state, we would like to provide a conf option to skip nulls for the symmetric hash implementation in stream stream joins. ### Why are the changes needed? Without these changes, if we encounter null values for stream-stream joins, the executor task will repeatedly fail with NullPointerException and will terminate the stage and eventually the query as well. This change allows the user to set a config option to continue iterating by skipping null values for symmetric hash based implementation of stream-stream joins. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests to test the new functionality by adding nulls in between and forcing the iteration/get calls with nulls in the mix and tested the behavior with the config disabled as well as enabled. Sample output: ``` [info] SymmetricHashJoinStateManagerSuite: 15:07:50.627 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [info] - StreamingJoinStateManager V1 - all operations (588 milliseconds) [info] - StreamingJoinStateManager V2 - all operations (251 milliseconds) 15:07:52.669 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=3 and endIndex=4. 15:07:52.671 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=3 and endIndex=3. 15:07:52.672 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=1 and endIndex=3. 15:07:52.672 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=1 and endIndex=1. [info] - StreamingJoinStateManager V1 - all operations with nulls (252 milliseconds) 15:07:52.896 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=3 and endIndex=4. 15:07:52.897 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=3 and endIndex=3. 15:07:52.898 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=1 and endIndex=3. 15:07:52.898 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=1 and endIndex=1. [info] - StreamingJoinStateManager V2 - all operations with nulls (221 milliseconds) 15:07:53.114 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=5 and endIndex=6. 15:07:53.116 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=3 and endIndex=6. 15:07:53.331 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=5 and endIndex=6. 15:07:53.331 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=3 and endIndex=3. [info] - StreamingJoinStateManager V1 - all operations with nulls in middle (435 milliseconds) 15:07:53.549 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=5 and endIndex=6. 15:07:53.551 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=3 and endIndex=6. 15:07:53.785 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=5 and endIndex=6. 15:07:53.785 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager: `keyWithIndexToValue` returns a null value for indices with range from startIndex=3 and endIndex=3. [info] - StreamingJoinStateManager V2 - all operations with nulls in middle (456 milliseconds) [info] - SPARK-35689: StreamingJoinStateManager V1 - printable key of keyWithIndexToValue (390 milliseconds) [info] - SPARK-35689: StreamingJoinStateManager V2 - printable key of keyWithIndexToValue (216 milliseconds) 15:07:54.640 WARN org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManagerSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.SymmetricHashJoinStateManagerSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) ===== [info] Run completed in 5 seconds, 714 milliseconds. [info] Total number of tests run: 8 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 8, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` Closes #36090 from anishshri-db/bfix/SPARK-38809. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 61c489ea7ef51d7d0217f770ec358ed7a7b76b42) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 08 April 2022, 03:37:58 UTC
d2afd98 Revert "[MINOR][SQL][SS][DOCS] Add varargs to Dataset.observe(String, ..) with a documentation fix" This reverts commit 72a0562f62eb66388ca3d2b2e2b17928124e8e69. 07 April 2022, 00:44:10 UTC
6f6eb3f [SPARK-38787][SS] Replace found value with non-null element in the remaining list for key and remove remaining null elements from values in keyWithIndexToValue store for stream-stream joins ### What changes were proposed in this pull request? In stream-stream joins, for removing old state (watermark by value), we call the `removeByValue` function with a removal condition. Within the iterator returned, if we find null at the end for matched value at non-last index, we are currently not removing and swapping the matched value. With this change, we will find the first non-null value from end and swap current index with that value and remove all elements from index + 1 to the end and then drop the last element as before. ### Why are the changes needed? This change fixes a bug where we were not replacing found/matching values for `removeByValue` when encountering nulls in the symmetric hash join code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit test for this change with nulls added. Here is a sample output: ``` Executing tests from //sql/core:org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManagerSuite-hive-2.3__hadoop-3.2 ----------------------------------------------------------------------------- 2022-04-01 21:11:59,641 INFO CodeGenerator - Code generated in 225.884757 ms 2022-04-01 21:11:59,662 INFO CodeGenerator - Code generated in 10.870786 ms Run starting. Expected test count is: 4 … ===== TEST OUTPUT FOR o.a.s.sql.execution.streaming.state.SymmetricHashJoinStateManagerSuite: 'StreamingJoinStateManager V2 - all operations with nulls' ===== 2022-04-01 21:12:03,487 INFO StateStore - State Store maintenance task started 2022-04-01 21:12:03,508 INFO CheckpointFileManager - Writing atomically to /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyToNumValues/_metadata/schema using temp file /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyToNumValues/_metadata/.schema.9bcc39c9-721e-4ee0-b369-fb4f516c4fd6.tmp 2022-04-01 21:12:03,524 INFO CheckpointFileManager - Renamed temp file /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyToNumValues/_metadata/.schema.9bcc39c9-721e-4ee0-b369-fb4f516c4fd6.tmp to /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyToNumValues/_metadata/schema 2022-04-01 21:12:03,525 INFO StateStore - Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef374ccb9 2022-04-01 21:12:03,525 INFO StateStore - Reported that the loaded instance StateStoreProviderId(StateStoreId(/tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292,0,0,left-keyToNumValues),47925997-9891-4025-a36a-3e18bc758b50) is active 2022-04-01 21:12:03,525 INFO HDFSBackedStateStoreProvider - Retrieved version 0 of HDFSStateStoreProvider[id = (op=0,part=0),dir = /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyToNumValues] for update 2022-04-01 21:12:03,525 INFO SymmetricHashJoinStateManager$KeyToNumValuesStore - Loaded store StateStoreId(/tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292,0,0,left-keyToNumValues) 2022-04-01 21:12:03,541 INFO CheckpointFileManager - Writing atomically to /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyWithIndexToValue/_metadata/schema using temp file /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyWithIndexToValue/_metadata/.schema.fcde2229-a4fa-409b-b3eb-751572f06c08.tmp 2022-04-01 21:12:03,556 INFO CheckpointFileManager - Renamed temp file /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyWithIndexToValue/_metadata/.schema.fcde2229-a4fa-409b-b3eb-751572f06c08.tmp to /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyWithIndexToValue/_metadata/schema 2022-04-01 21:12:03,558 INFO StateStore - Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef1ea930eb 2022-04-01 21:12:03,559 INFO StateStore - Reported that the loaded instance StateStoreProviderId(StateStoreId(/tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292,0,0,left-keyWithIndexToValue),47925997-9891-4025-a36a-3e18bc758b50) is active 2022-04-01 21:12:03,559 INFO HDFSBackedStateStoreProvider - Retrieved version 0 of HDFSStateStoreProvider[id = (op=0,part=0),dir = /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyWithIndexToValue] for update 2022-04-01 21:12:03,559 INFO SymmetricHashJoinStateManager$KeyWithIndexToValueStore - Loaded store StateStoreId(/tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292,0,0,left-keyWithIndexToValue) 2022-04-01 21:12:03,564 INFO CheckpointFileManager - Writing atomically to /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyWithIndexToValue/1.delta using temp file /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyWithIndexToValue/.1.delta.86db3ac9-aa68-4a6b-9729-df93dc4b8a45.tmp 2022-04-01 21:12:03,568 INFO CheckpointFileManager - Writing atomically to /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyToNumValues/1.delta using temp file /tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyToNumValues/.1.delta.9673bc1b-2bbe-412d-a0af-69f237cde31e.tmp 2022-04-01 21:12:03,572 WARN SymmetricHashJoinStateManager - `keyWithIndexToValue` returns a null value for index 4 at current key [false,40,10.0]. 2022-04-01 21:12:03,574 WARN SymmetricHashJoinStateManager - `keyWithIndexToValue` returns a null value for index 3 at current key [false,40,10.0]. 2022-04-01 21:12:03,576 WARN SymmetricHashJoinStateManager - `keyWithIndexToValue` returns a null value for index 3 at current key [false,60,10.0]. 2022-04-01 21:12:03,576 WARN SymmetricHashJoinStateManager - `keyWithIndexToValue` returns a null value for index 1 at current key [false,40,10.0]. 2022-04-01 21:12:03,577 INFO SymmetricHashJoinStateManager$KeyToNumValuesStore - Aborted store StateStoreId(/tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292,0,0,left-keyToNumValues) 2022-04-01 21:12:03,577 INFO HDFSBackedStateStoreProvider - Aborted version 1 for HDFSStateStore[id=(op=0,part=0),dir=/tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyToNumValues] 2022-04-01 21:12:03,577 INFO SymmetricHashJoinStateManager$KeyWithIndexToValueStore - Aborted store StateStoreId(/tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292,0,0,left-keyWithIndexToValue) 2022-04-01 21:12:03,577 INFO HDFSBackedStateStoreProvider - Aborted version 1 for HDFSStateStore[id=(op=0,part=0),dir=/tmp/spark-d94b9f11-e04c-4871-aeea-1d0b5c62e292/0/0/left-keyWithIndexToValue] 2022-04-01 21:12:03,580 INFO StateStore - StateStore stopped 2022-04-01 21:12:03,580 INFO SymmetricHashJoinStateManagerSuite - ===== FINISHED o.a.s.sql.execution.streaming.state.SymmetricHashJoinStateManagerSuite: 'StreamingJoinStateManager V2 - all operations with nulls' ===== … 2022-04-01 21:12:04,205 INFO StateStore - StateStore stopped Run completed in 5 seconds, 908 milliseconds. Total number of tests run: 4 Suites: completed 1, aborted 0 Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 All tests passed. 2022-04-01 21:12:04,605 INFO ShutdownHookManager - Shutdown hook called 2022-04-01 21:12:04,605 INFO ShutdownHookManager - Deleting directory /tmp/spark-37347802-bee5-4e7f-bffe-acb13eda1c5c 2022-04-01 21:12:04,608 INFO ShutdownHookManager - Deleting directory /tmp/spark-9e79a2e1-cec7-4fbf-804a-92e63913f516 ``` Closes #36073 from anishshri-db/bfix/SPARK-38787. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 6d9bfb675f3e58c6e7d9facd8cf3f22069c4cc48) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 06 April 2022, 21:07:38 UTC
72a0562 [MINOR][SQL][SS][DOCS] Add varargs to Dataset.observe(String, ..) with a documentation fix ### What changes were proposed in this pull request? This PR proposes two minor changes: - Fixes the example at `Dataset.observe(String, ...)` - Adds `varargs` to be consistent with another overloaded version: `Dataset.observe(Observation, ..)` ### Why are the changes needed? To provide a correct example, support Java APIs properly with `varargs` and API consistency. ### Does this PR introduce _any_ user-facing change? Yes, the example is fixed in the documentation. Additionally Java users should be able to use `Dataset.observe(String, ..)` per `varargs`. ### How was this patch tested? Manually tested. CI should verify the changes too. Closes #36084 from HyukjinKwon/minor-docs. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fb3f380b3834ca24947a82cb8d87efeae6487664) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 April 2022, 08:26:40 UTC
255a789 [MINOR][DOCS] Remove PySpark doc build warnings This PR fixes a various documentation build warnings in PySpark documentation To render the docs better. Yes, it changes the documentation to be prettier. Pretty minor though. I manually tested it by building the PySpark documentation Closes #36057 from HyukjinKwon/remove-pyspark-build-warnings. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a252c109b32bd3bbb269d6790f0c35e0a4ae705f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 April 2022, 06:17:45 UTC
44e2093 Revert "[SPARK-38754][SQL][TEST][3.1] Using EquivalentExpressions getEquivalentExprs function instead of getExprState at SubexpressionEliminationSuite" This reverts commit 02a055a42de5597cd42c1c0d4470f0e769571dc3. 03 April 2022, 17:59:24 UTC
6e55ff4 [SPARK-38446][CORE] Fix deadlock between ExecutorClassLoader and FileDownloadCallback caused by Log4j ### What changes were proposed in this pull request? While `log4j.ignoreTCL/log4j2.ignoreTCL` is false, which is the default, it uses the context ClassLoader for the current Thread, see `org.apache.logging.log4j.util.LoaderUtil.loadClass`. While ExecutorClassLoader try to loadClass through remotely though the FileDownload, if error occurs, we will long on debug level, and `log4j...LoaderUtil` will be blocked by ExecutorClassLoader acquired classloading lock. Fortunately, it only happens when ThresholdFilter's level is `debug`. or we can set `log4j.ignoreTCL/log4j2.ignoreTCL` to true, but I don't know what else it will cause. So in this PR, I simply remove the debug log which cause this deadlock ### Why are the changes needed? fix deadlock ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? https://github.com/apache/incubator-kyuubi/pull/2046#discussion_r821414439, with a ut in kyuubi project, resolved(https://github.com/apache/incubator-kyuubi/actions/runs/1950222737) ### Additional Resources [ut.jstack.txt](https://github.com/apache/spark/files/8206457/ut.jstack.txt) Closes #35765 from yaooqinn/SPARK-38446. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit aef674564ff12e78bd2f30846e3dcb69988249ae) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 03 April 2022, 17:30:58 UTC
02a055a [SPARK-38754][SQL][TEST][3.1] Using EquivalentExpressions getEquivalentExprs function instead of getExprState at SubexpressionEliminationSuite ### What changes were proposed in this pull request? This pr use EquivalentExpressions getEquivalentExprs function instead of getExprState at SubexpressionEliminationSuite, and remove cpus paramter. ### Why are the changes needed? Fixes build error ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? CI Tests Closes #36033 from monkeyboy123/SPARK-38754. Authored-by: Dereck Li <monkeyboy.ljh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f677272d08de030ff9c4045ceec062168105b75c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 April 2022, 14:54:38 UTC
7e542f1 [SPARK-38684][SS] Fix correctness issue on stream-stream outer join with RocksDB state store provider (Credit to alex-balikov for the inspiration of the root cause observation, and anishshri-db for looking into the issue together.) This PR fixes the correctness issue on stream-stream outer join with RocksDB state store provider, which can occur in certain condition, like below: * stream-stream time interval outer join * left outer join has an issue on left side, right outer join has an issue on right side, full outer join has an issue on both sides * At batch N, produce non-late row(s) on the problematic side * At the same batch (batch N), some row(s) on the problematic side are evicted by the condition of watermark The root cause is same as [SPARK-38320](https://issues.apache.org/jira/browse/SPARK-38320) - weak read consistency on iterator, especially with RocksDB state store provider. (Quoting from SPARK-38320: The problem is due to the StateStore.iterator not reflecting StateStore changes made after its creation.) More specifically, if updates are performed during processing input rows and somehow updates the number of values for grouping key, the update is not seen in SymmetricHashJoinStateManager.removeByValueCondition, and the method does the eviction with the number of values in out of sync. Making it more worse, if the method performs the eviction and updates the number of values for grouping key, it "overwrites" the number of value, effectively drop all rows being inserted in the same batch. Below code blocks are references on understanding the details of the issue. https://github.com/apache/spark/blob/ca7200b0008dc6101a252020e6c34ef7b72d81d6/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L327-L339 https://github.com/apache/spark/blob/ca7200b0008dc6101a252020e6c34ef7b72d81d6/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L619-L627 https://github.com/apache/spark/blob/ca7200b0008dc6101a252020e6c34ef7b72d81d6/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala#L195-L201 https://github.com/apache/spark/blob/ca7200b0008dc6101a252020e6c34ef7b72d81d6/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala#L208-L223 This PR fixes the outer iterators as late evaluation to ensure all updates on processing input rows are reflected "before" outer iterators are initialized. The bug is described in above section. No. New UT added. Closes #36002 from HeartSaVioR/SPARK-38684. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 01 April 2022, 10:08:05 UTC
6a4b2c2 [SPARK-38333][SQL][3.2][FOLLOWUP] fix compilation error 01 April 2022, 10:08:05 UTC
e78cca9 [SPARK-38333][SQL] PlanExpression expression should skip addExprTree function in Executor It is master branch pr [SPARK-38333](https://github.com/apache/spark/pull/35662) Bug fix, it is potential issue. No UT Closes #36012 from monkeyboy123/spark-38333. Authored-by: Dereck Li <monkeyboy.ljh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a40acd4392a8611062763ce6ec7bc853d401c646) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 March 2022, 13:35:40 UTC
01600ae [SPARK-38652][K8S] `uploadFileUri` should preserve file scheme ### What changes were proposed in this pull request? This PR replaces `new Path(fileUri.getPath)` with `new Path(fileUri)`. By using `Path` class constructor with URI parameter, we can preserve file scheme. ### Why are the changes needed? If we use, `Path` class constructor with `String` parameter, it loses file scheme information. Although the original code works so far, it fails at Apache Hadoop 3.3.2 and breaks dependency upload feature which is covered by K8s Minikube integration tests. ```scala test("uploadFileUri") { val fileUri = org.apache.spark.util.Utils.resolveURI("/tmp/1.txt") assert(new Path(fileUri).toString == "file:/private/tmp/1.txt") assert(new Path(fileUri.getPath).toString == "/private/tmp/1.txt") } ``` ### Does this PR introduce _any_ user-facing change? No, this will prevent a regression at Apache Spark 3.3.0 instead. ### How was this patch tested? Pass the CIs. In addition, this PR and #36009 will recover K8s IT `DepsTestsSuite`. Closes #36010 from dongjoon-hyun/SPARK-38652. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cab8aa1c4fe66c4cb1b69112094a203a04758f76) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 March 2022, 15:27:06 UTC
d81ad57 [SPARK-38655][SQL] `OffsetWindowFunctionFrameBase` cannot find the offset row whose input is not-null ### What changes were proposed in this pull request? ``` select x, nth_value(x, 5) IGNORE NULLS over (order by x rows between unbounded preceding and current row) from (select explode(sequence(1, 3)) x) ``` The sql output: ``` null null 3 ``` But it should returns ``` null null null ``` ### Why are the changes needed? Fix the bug UnboundedPrecedingOffsetWindowFunctionFrame works not good. ### Does this PR introduce _any_ user-facing change? 'Yes'. The output will be correct after fix this bug. ### How was this patch tested? New tests. Closes #35971 from beliefer/SPARK-38655. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 March 2022, 06:23:28 UTC
7842621 [SPARK-38528][SQL][3.2] Eagerly iterate over aggregate sequence when building project list in `ExtractGenerator` Backport of #35837. ### What changes were proposed in this pull request? When building the project list from an aggregate sequence in `ExtractGenerator`, convert the aggregate sequence to an `IndexedSeq` before performing the flatMap operation. ### Why are the changes needed? This query fails with a `NullPointerException`: ``` val df = Seq(1, 2, 3).toDF("v") df.select(Stream(explode(array(min($"v"), max($"v"))), sum($"v")): _*).collect ``` If you change `Stream` to `Seq`, then it succeeds. `ExtractGenerator` uses a flatMap operation over `aggList` for two purposes: - To produce a new aggregate list - to update `projectExprs` (which is initialized as an array of nulls). When `aggList` is a `Stream`, the flatMap operation evaluates lazily, so all entries in `projectExprs` after the first will still be null when the rule completes. Changing `aggList` to an `IndexedSeq` forces the flatMap to evaluate eagerly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit test Closes #35851 from bersprockets/generator_aggregate_issue_32. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 March 2022, 00:31:49 UTC
8621914 [SPARK-38570][SQL][3.2] Incorrect DynamicPartitionPruning caused by Literal This is a backport of #35878 to branch 3.2. ### What changes were proposed in this pull request? The return value of Literal.references is an empty AttributeSet, so Literal is mistaken for a partition column. For example, the sql in the test case will generate such a physical plan when the adaptive is closed: ```tex *(4) Project [store_id#5281, date_id#5283, state_province#5292] +- *(4) BroadcastHashJoin [store_id#5281], [store_id#5291], Inner, BuildRight, false :- Union : :- *(1) Project [4 AS store_id#5281, date_id#5283] : : +- *(1) Filter ((isnotnull(date_id#5283) AND (date_id#5283 >= 1300)) AND dynamicpruningexpression(4 IN dynamicpruning#5300)) : : : +- ReusedSubquery SubqueryBroadcast dynamicpruning#5300, 0, [store_id#5291], [id=#336] : : +- *(1) ColumnarToRow : : +- FileScan parquet default.fact_sk[date_id#5283,store_id#5286] Batched: true, DataFilters: [isnotnull(date_id#5283), (date_id#5283 >= 1300)], Format: Parquet, Location: CatalogFileIndex(1 paths)[file:/Users/dongdongzhang/code/study/spark/spark-warehouse/org.apache.s..., PartitionFilters: [dynamicpruningexpression(4 IN dynamicpruning#5300)], PushedFilters: [IsNotNull(date_id), GreaterThanOrEqual(date_id,1300)], ReadSchema: struct<date_id:int> : : +- SubqueryBroadcast dynamicpruning#5300, 0, [store_id#5291], [id=#336] : : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#335] : : +- *(1) Project [store_id#5291, state_province#5292] : : +- *(1) Filter (((isnotnull(country#5293) AND (country#5293 = US)) AND ((store_id#5291 <=> 4) OR (store_id#5291 <=> 5))) AND isnotnull(store_id#5291)) : : +- *(1) ColumnarToRow : : +- FileScan parquet default.dim_store[store_id#5291,state_province#5292,country#5293] Batched: true, DataFilters: [isnotnull(country#5293), (country#5293 = US), ((store_id#5291 <=> 4) OR (store_id#5291 <=> 5)), ..., Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/dongdongzhang/code/study/spark/spark-warehouse/org.apache...., PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US), Or(EqualNullSafe(store_id,4),EqualNullSafe(store_id,5))..., ReadSchema: struct<store_id:int,state_province:string,country:string> : +- *(2) Project [5 AS store_id#5282, date_id#5287] : +- *(2) Filter ((isnotnull(date_id#5287) AND (date_id#5287 <= 1000)) AND dynamicpruningexpression(5 IN dynamicpruning#5300)) : : +- ReusedSubquery SubqueryBroadcast dynamicpruning#5300, 0, [store_id#5291], [id=#336] : +- *(2) ColumnarToRow : +- FileScan parquet default.fact_stats[date_id#5287,store_id#5290] Batched: true, DataFilters: [isnotnull(date_id#5287), (date_id#5287 <= 1000)], Format: Parquet, Location: CatalogFileIndex(1 paths)[file:/Users/dongdongzhang/code/study/spark/spark-warehouse/org.apache.s..., PartitionFilters: [dynamicpruningexpression(5 IN dynamicpruning#5300)], PushedFilters: [IsNotNull(date_id), LessThanOrEqual(date_id,1000)], ReadSchema: struct<date_id:int> : +- ReusedSubquery SubqueryBroadcast dynamicpruning#5300, 0, [store_id#5291], [id=#336] +- ReusedExchange [store_id#5291, state_province#5292], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#335] ``` after this pr: ```tex *(4) Project [store_id#5281, date_id#5283, state_province#5292] +- *(4) BroadcastHashJoin [store_id#5281], [store_id#5291], Inner, BuildRight, false :- Union : :- *(1) Project [4 AS store_id#5281, date_id#5283] : : +- *(1) Filter (isnotnull(date_id#5283) AND (date_id#5283 >= 1300)) : : +- *(1) ColumnarToRow : : +- FileScan parquet default.fact_sk[date_id#5283,store_id#5286] Batched: true, DataFilters: [isnotnull(date_id#5283), (date_id#5283 >= 1300)], Format: Parquet, Location: CatalogFileIndex(1 paths)[file:/Users/dongdongzhang/code/study/spark/spark-warehouse/org.apache.s..., PartitionFilters: [], PushedFilters: [IsNotNull(date_id), GreaterThanOrEqual(date_id,1300)], ReadSchema: struct<date_id:int> : +- *(2) Project [5 AS store_id#5282, date_id#5287] : +- *(2) Filter (isnotnull(date_id#5287) AND (date_id#5287 <= 1000)) : +- *(2) ColumnarToRow : +- FileScan parquet default.fact_stats[date_id#5287,store_id#5290] Batched: true, DataFilters: [isnotnull(date_id#5287), (date_id#5287 <= 1000)], Format: Parquet, Location: CatalogFileIndex(1 paths)[file:/Users/dongdongzhang/code/study/spark/spark-warehouse/org.apache.s..., PartitionFilters: [], PushedFilters: [IsNotNull(date_id), LessThanOrEqual(date_id,1000)], ReadSchema: struct<date_id:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#326] +- *(3) Project [store_id#5291, state_province#5292] +- *(3) Filter (((isnotnull(country#5293) AND (country#5293 = US)) AND ((store_id#5291 <=> 4) OR (store_id#5291 <=> 5))) AND isnotnull(store_id#5291)) +- *(3) ColumnarToRow +- FileScan parquet default.dim_store[store_id#5291,state_province#5292,country#5293] Batched: true, DataFilters: [isnotnull(country#5293), (country#5293 = US), ((store_id#5291 <=> 4) OR (store_id#5291 <=> 5)), ..., Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/dongdongzhang/code/study/spark/spark-warehouse/org.apache...., PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US), Or(EqualNullSafe(store_id,4),EqualNullSafe(store_id,5))..., ReadSchema: struct<store_id:int,state_province:string,country:string> ``` ### Why are the changes needed? Execution performance improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test Closes #35967 from mcdull-zhang/spark_38570_3.2. Authored-by: mcdull-zhang <work4dong@163.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> 26 March 2022, 04:48:08 UTC
271b338 [SPARK-38631][CORE] Uses Java-based implementation for un-tarring at Utils.unpack ### What changes were proposed in this pull request? This PR proposes to use `FileUtil.unTarUsingJava` that is a Java implementation for un-tar `.tar` files. `unTarUsingJava` is not public but it exists in all Hadoop versions from 2.1+, see HADOOP-9264. The security issue reproduction requires a non-Windows platform, and a non-gzipped TAR archive file name (contents don't matter). ### Why are the changes needed? There is a risk for arbitrary shell command injection via `Utils.unpack` when the filename is controlled by a malicious user. This is due to an issue in Hadoop's `unTar`, that is not properly escaping the filename before passing to a shell command:https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java#L904 ### Does this PR introduce _any_ user-facing change? Yes, it prevents a security issue that, previously, allowed users to execute arbitrary shall command. ### How was this patch tested? Manually tested in local, and existing test cases should cover. Closes #35946 from HyukjinKwon/SPARK-38631. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 057c051285ec32c665fb458d0670c1c16ba536b2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 March 2022, 03:12:57 UTC
back to top