https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
c34baeb [SPARK-47719][SQL] Change spark.sql.legacy.timeParserPolicy default to CORRECTED ### What changes were proposed in this pull request? We changed the time parser policy in Spark 3.0.0. The config has since defaulted to raise an exception if there is a potential conflict between teh legacy and the new policy. Spark 4.0.0 is a good time to default to the new policy ### Why are the changes needed? Move the product forward and retire legacy behavior over time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run existing unit tests and verify changes. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45859 from srielau/SPARK-47719-parser-policy-default-to-corrected. Lead-authored-by: Serge Rielau <serge@rielau.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 05 April 2024, 18:35:38 UTC
6bd0ccf [SPARK-47511][SQL][FOLLOWUP] Rename the config REPLACE_NULLIF_USING_WITH_EXPR to be more general ### What changes were proposed in this pull request? This is a follow-up of - #45649 `With` is not only used by `NullIf`, but also `Between`. This PR renames the config `REPLACE_NULLIF_USING_WITH_EXPR` to be more general, and use it to control `Between` as well. ### Why are the changes needed? have a conf to control all the usages of `With`. ### Does this PR introduce _any_ user-facing change? no, not released yet ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45871 from cloud-fan/with-conf. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 14:47:39 UTC
12d0367 [SPARK-47724][PYTHON][TESTS][FOLLOW-UP] Make testing script to inherits SPARK_CONNECT_TESTING_REMOTE env ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/45868 that proposes to make testing script to inherits SPARK_CONNECT_TESTING_REMOTE environment variable. ### Why are the changes needed? So the testing script can set `SPARK_CONNECT_TESTING_REMOTE`, and makes the env effective. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested at https://github.com/apache/spark/pull/45870 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45886 from HyukjinKwon/SPARK-47724-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 14:46:08 UTC
97e63ff [SPARK-47735][PYTHON][TESTS] Make pyspark.testing.connectutils compatible with pyspark-connect ### What changes were proposed in this pull request? This PR proposes to make `pyspark.testing.connectutils` compatible with `pyspark-connect`. ### Why are the changes needed? This is the base work to set up the CI for pyspark-connect. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Tested in https://github.com/apache/spark/pull/45870. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45887 from HyukjinKwon/SPARK-47735. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 14:41:28 UTC
aeb082e [SPARK-47081][CONNECT][TESTS][FOLLOW-UP] Skip the flaky doctests for now ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/45150 that skips flaky doctests. ### Why are the changes needed? In order to make the build stable. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should verify it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45889 from HyukjinKwon/SPARK-47081-followup2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2024, 05:22:39 UTC
d5620cb [SPARK-47289][SQL] Allow extensions to log extended information in explain plan ### What changes were proposed in this pull request? This addresses SPARK-47289 and adds a new section in explain plan where Spark extensions can add additional information for end users. The section is included in the output only if the relevant configuration is enabled and if the extension actually adds some new information ### Why are the changes needed? Extensions to Spark can add their own planning rules and sometimes may need to add additional information about how the plan was generated. This is useful for end users in determining if the extensions rules are working as intended. ### Does this PR introduce _any_ user-facing change? This PR increases the information logged in the UI in the query plan. The attached screenshot shows output from an extension which provides some of its own operations but does not support some operations. <img width="703" alt="Screenshot 2024-03-11 at 10 23 36 AM" src="https://github.com/apache/spark/assets/6529136/88594772-f85e-4fd4-8eac-33017ef0c1c6"> ### How was this patch tested? Unit test and manual testing ### Was this patch authored or co-authored using generative AI tooling? No Closes #45488 from parthchandra/explain_extensions. Authored-by: Parth Chandra <parthc@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 April 2024, 02:35:11 UTC
0107435 [SPARK-47734][PYTHON][TESTS] Fix flaky DataFrame.writeStream doctest by stopping streaming query ### What changes were proposed in this pull request? This PR deflakes the `pyspark.sql.dataframe.DataFrame.writeStream` doctest. PR https://github.com/apache/spark/pull/45298 aimed to fix that test but misdiagnosed the root issue. The problem is not that concurrent tests were colliding on a temporary directory. Rather, the issue is specific to the `DataFrame.writeStream` test's logic: that test is starting a streaming query that writes files to the temporary directory, the exits the temp directory context manager without first stopping the streaming query. That creates a race condition where the context manager might be deleting the directory while the streaming query is writing new files into it, leading to the following type of error during cleanup: ``` File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in pyspark.sql.dataframe.DataFrame.writeStream Failed example: with tempfile.TemporaryDirectory() as d: # Create a table with Rate source. df.writeStream.toTable( "my_table", checkpointLocation=d) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.11/doctest.py", line 1353, in __run exec(compile(example.source, filename, "single", File "<doctest pyspark.sql.dataframe.DataFrame.writeStream[3]>", line 1, in <module> with tempfile.TemporaryDirectory() as d: File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__ self.cleanup() File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors) File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree _rmtree(name, onerror=onerror) File "/usr/lib/python3.11/shutil.py", line 738, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/usr/lib/python3.11/shutil.py", line 736, in rmtree os.rmdir(path, dir_fd=dir_fd) OSError: [Errno 39] Directory not empty: '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq' ``` In this PR, I update the doctest to properly stop the streaming query. ### Why are the changes needed? Fix flaky test. ### Does this PR introduce _any_ user-facing change? No, test-only. Small user-facing doc change, but one that is consistent with other doctest examples. ### How was this patch tested? Manually ran updated test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45885 from JoshRosen/fix-flaky-writestream-doctest. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2024, 02:14:42 UTC
b9ca91d [SPARK-47712][CONNECT] Allow connect plugins to create and process Datasets ### What changes were proposed in this pull request? This PR adds new versions of `SparkSession.createDataset` and `SparkSession.createDataFrame` that take an `Array[Byte]` as input. The older versions that take a `protobuf.Any` are deprecated. This PR also adds new versions of `SparkConnectPlanner.transformRelation` and `SparkConnectPlanner.transformExpression` that take an `Array[Byte]`. ### Why are the changes needed? Without these changes it's difficult to create plugins for Spark Connect. The methods above used to take a protobuf class that is shaded as input, meaning that that plugins had to shade these classes in the exact same way. Now they can just serialize the protobuf object to bytes and pass that in instead. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests were added ### Was this patch authored or co-authored using generative AI tooling? No Closes #45850 from tomvanbussel/SPARK-47712. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2024, 01:42:43 UTC
404d58c [SPARK-47081][CONNECT][FOLLOW-UP] Add the `shell` module into PyPI package ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/45150 that adds the new `shell` module into PyPI package. ### Why are the changes needed? So PyPI package contains `shell` module. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released yet. ### How was this patch tested? The test case will be added at https://github.com/apache/spark/pull/45870. It was found out during working on that PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45882 from HyukjinKwon/SPARK-47081-followup. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2024, 01:02:51 UTC
fb96b1a [SPARK-47723][CORE][TESTS] Introduce a tool that can sort alphabetically enumeration field in `LogEntry` automatically ### What changes were proposed in this pull request? The pr aims to `introduce` a `tool` that can `sort alphabetically` enumeration field in `LogEntry` automatically. ### Why are the changes needed? Enable developers to more conveniently write the enumeration values in `LogEntry` in alphabetical order according to the requirements of structured log development documents. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "common-utils/testOnly *LogKeySuite -- -t \"LogKey enumeration fields are correctly sorted\"" ``` - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45867 from panbingkun/SPARK-47723. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 05 April 2024, 00:04:53 UTC
240923c [SPARK-46812][PYTHON][TESTS][FOLLOWUP] Check should_test_connect and pyarrow to skip tests ### What changes were proposed in this pull request? This is a follow-up of SPARK-46812 to skip the tests more robustly and to recover PyPy CIs. - https://github.com/apache/spark/actions/runs/8556900899/job/23447948557 ### Why are the changes needed? - `should_test_connect` covers more edge cases than `have_pandas`. - `test_resources.py` has Arrow usage too. https://github.com/apache/spark/blob/25fc67fa114d2c34099c3ab50396870f543c338b/python/pyspark/resource/tests/test_resources.py#L85 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tests with `pandas` and without `pyarrow`. ``` $ pip3 freeze | grep pyarrow $ pip3 freeze | grep pandas pandas==2.2.1 pandas-stubs==1.2.0.53 $ python/run-tests --modules=pyspark-resource --parallelism=1 --python-executables=python3.10 Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.10'] Will test the following Python modules: ['pyspark-resource'] python3.10 python_implementation is CPython python3.10 version is: Python 3.10.13 Starting test(python3.10): pyspark.resource.profile (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/db9cb886-2698-49d9-a663-9b8bea79caba/python3.10__pyspark.resource.profile__8mg46xru.log) Finished test(python3.10): pyspark.resource.profile (1s) Starting test(python3.10): pyspark.resource.tests.test_connect_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/53f979bd-1073-41e6-99ba-8e787edc415b/python3.10__pyspark.resource.tests.test_connect_resources__hrgrs5sk.log) Finished test(python3.10): pyspark.resource.tests.test_connect_resources (0s) ... 1 tests were skipped Starting test(python3.10): pyspark.resource.tests.test_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/2b06c671-0199-4827-a0e5-f852a28313fd/python3.10__pyspark.resource.tests.test_resources__jis6mk9a.log) Finished test(python3.10): pyspark.resource.tests.test_resources (2s) ... 1 tests were skipped Tests passed in 4 seconds Skipped tests in pyspark.resource.tests.test_connect_resources with python3.10: test_profile_before_sc_for_connect (pyspark.resource.tests.test_connect_resources.ResourceProfileTests) ... skip (0.002s) Skipped tests in pyspark.resource.tests.test_resources with python3.10: test_profile_before_sc_for_sql (pyspark.resource.tests.test_resources.ResourceProfileTests) ... skip (0.001s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45880 from dongjoon-hyun/SPARK-46812-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2024, 19:34:00 UTC
3fd0cd6 [SPARK-47598][CORE] MLLib: Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logError` in module `MLLib` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45837 from panbingkun/SPARK-47598. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 04 April 2024, 17:46:54 UTC
e3405c1 [SPARK-47610][CONNECT][FOLLOWUP] Add -Dio.netty.tryReflectionSetAccessible=true for spark-connect-scala-client ### What changes were proposed in this pull request? Add `-Dio.netty.tryReflectionSetAccessible=true` for `spark-connect-scala-client` ### Why are the changes needed? The previous change missed spark-connect-scala-client, may be due to bad IDEA index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA ### Was this patch authored or co-authored using generative AI tooling? No Closes #45860 from pan3793/SPARK-47610-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2024, 16:58:42 UTC
25fc67f [SPARK-47728][DOC] Document G1 Concurrent GC metrics ### What changes were proposed in this pull request? This is to document G1 Concurrent GC metrics introduced with https://issues.apache.org/jira/browse/SPARK-44162 ### Why are the changes needed? To improve the documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #45874 from LucaCanali/documentG1GCCurrentMetrics. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2024, 14:37:34 UTC
5ca3467 [SPARK-47729][PYTHON][TESTS] Get the proper default port for pyspark-connect testcases ### What changes were proposed in this pull request? This PR proposes to get the proper default port for `pyspark-connect` testcases. ### Why are the changes needed? `pyspark-connect` cannot access to the JVM, so cannot get the randomized port assigned from JVM. ### Does this PR introduce _any_ user-facing change? No, `pyspark-connect` is not published yet, and this is a test-only change. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45875 from HyukjinKwon/SPARK-47729. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2024, 14:33:24 UTC
5f9f5db [SPARK-47689][SQL][FOLLOWUP] More accurate file path in TASK_WRITE_FAILED error ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/45797 . Instead of detecting query execution errors and not wrapping them, it's better to do the error wrapping only in the data writer, which has more context. We can provide the specific file path when the error happened, instead of the destination directory name. ### Why are the changes needed? better error message ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45844 from cloud-fan/write-error. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2024, 14:05:09 UTC
3b8aea3 Revert "[SPARK-47708][CONNECT] Do not log gRPC exception to stderr in PySpark This reverts commit d87ac8ef49dbd7a14d7a774b4ace1ab681a1bb01. Turns out the sparkconnect logger is disabled by default. This make us loose the ability to log the error message if required (by explicitly turning on the logger). ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? Closes #45878 from nemanja-boric-databricks/revert-logger. Authored-by: Nemanja Boric <nemanja.boric@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2024, 13:48:40 UTC
bffb02d [SPARK-47565][PYTHON] PySpark worker pool crash resilience ### What changes were proposed in this pull request? PySpark worker processes may die while they are idling. Here we aim to provide some resilience, by validating process and selectionkey aliveness prior to returning the process from idle pool. ### Why are the changes needed? To not fail queries when a python process crashed while idling. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added appropriate testcase. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45635 from sebastianhillig-db/python-worker-factory-crash. Authored-by: Ubuntu <sebastian.hillig@ip-10-110-16-229.us-west-2.compute.internal> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 April 2024, 11:15:53 UTC
f6999df [SPARK-47081][CONNECT] Support Query Execution Progress ### What changes were proposed in this pull request? This patch adss a new mechanism to push query execution progress for batch queries. We add a new response message type and periodically push query progress to the client. The client can consume this data to for example display a progress bar. This patch adds support for displaying a progress bar in the PySpark shell when started with Spark Connect. The proto message is defined as follows: ``` // This message is used to communicate progress about the query progress during the execution. // This message is used to communicate progress about the query progress during the execution. message ExecutionProgress { // Captures the progress of each individual stage. repeated StageInfo stages = 1; // Captures the currently in progress tasks. int64 num_inflight_tasks = 2; message StageInfo { int64 stage_id = 1; int64 num_tasks = 2; int64 num_completed_tasks = 3; int64 input_bytes_read = 4; bool done = 5; } } ``` Clients can simply ignore the messages or consume them. On top of that this adds additional capabilities to register a callback for progress tracking to the SparkSession. ``` handler = lambda **kwargs: print(kwargs) spark.register_progress_handler(handler) spark.range(100).collect() spark.remove_progress_handler(handler) ``` #### Example 1 ![progress_medium_query_multi_stage mp4](https://github.com/apache/spark/assets/3421/5eff1ec4-def2-4d39-8a75-13a6af784c99) #### Example 2 ![progress_bar mp4](https://github.com/apache/spark/assets/3421/20638511-2da4-4bd6-83f2-da3b9f500bde) ### Why are the changes needed? Usability and Experience ### Does this PR introduce _any_ user-facing change? When the user opens the PySpark shell with Spark Connect mode, it will use the progress bar by default. ### How was this patch tested? Added new tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45150 from grundprinzip/SPARK-47081. Authored-by: Martin Grund <martin.grund@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 April 2024, 04:59:56 UTC
3f6ac60 [SPARK-47577][CORE][PART1] Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? Migrate logError with variables of core module to structured logging framework. This is part1 which transforms the logError entries of the following API ``` def logError(msg: => String): Unit ``` to ``` def logError(entry: LogEntry): Unit ``` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? Yes, Spark core logs will contain additional MDC ### How was this patch tested? Compiler and scala style checks, as well as code review. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45834 from gengliangwang/coreError. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 04 April 2024, 04:42:25 UTC
d75c775 [SPARK-46812][PYTHON][TESTS][FOLLOWUP] Skip `pandas`-required tests if pandas is not available ### What changes were proposed in this pull request? This is a follow-up of the followings to skip `pandas`-related tests if pandas is not available. - #44852 - #45232 ### Why are the changes needed? `pandas` is an optional dependency. We had better skip it without causing failures. To recover the PyPy 3.8 CI, - https://github.com/apache/spark/actions/runs/8541011879/job/23421483071 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ``` $ python/run-tests --modules=pyspark-resource --parallelism=1 --python-executables=python3.10 Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.10'] Will test the following Python modules: ['pyspark-resource'] python3.10 python_implementation is CPython python3.10 version is: Python 3.10.13 Starting test(python3.10): pyspark.resource.profile (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/021bc7bb-242f-4cb4-8584-11ed6e711f78/python3.10__pyspark.resource.profile__jn89f1hh.log) Finished test(python3.10): pyspark.resource.profile (1s) Starting test(python3.10): pyspark.resource.tests.test_connect_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/244d6c6f-8799-4a2a-b7a7-20d7c50d643d/python3.10__pyspark.resource.tests.test_connect_resources__5ta1tf6e.log) Finished test(python3.10): pyspark.resource.tests.test_connect_resources (0s) ... 1 tests were skipped Starting test(python3.10): pyspark.resource.tests.test_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/671e7afa-e764-443f-bc40-7e940d7342ea/python3.10__pyspark.resource.tests.test_resources__lhbp6y5f.log) Finished test(python3.10): pyspark.resource.tests.test_resources (2s) ... 1 tests were skipped Tests passed in 4 seconds Skipped tests in pyspark.resource.tests.test_connect_resources with python3.10: test_profile_before_sc_for_connect (pyspark.resource.tests.test_connect_resources.ResourceProfileTests) ... skip (0.005s) Skipped tests in pyspark.resource.tests.test_resources with python3.10: test_profile_before_sc_for_sql (pyspark.resource.tests.test_resources.ResourceProfileTests) ... skip (0.001s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45869 from dongjoon-hyun/SPARK-46812. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2024, 03:16:02 UTC
d272a1b [SPARK-47724][PYTHON][TESTS] Add an environment variable for testing remote pure Python library ### What changes were proposed in this pull request? This PR proposes to add an environment variable called `SPARK_CONNECT_TESTING_REMOTE` to set `remote` URL. ### Why are the changes needed? In order to test pure Python library with a remote server. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45868 from HyukjinKwon/SPARK-47724. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 April 2024, 03:15:11 UTC
c25fd93 [SPARK-47705][INFRA][FOLLOWUP] Sort LogKey alphabetically and build a test to ensure it ### What changes were proposed in this pull request? The pr aims to fix bug about https://github.com/apache/spark/pull/45857 ### Why are the changes needed? In fact, `LogKey.values.toSeq.sorted` did not sort alphabetically as expected. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45864 from panbingkun/fix_sort_logkey. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 04 April 2024, 02:38:37 UTC
678aeb7 [SPARK-47683][PYTHON][BUILD] Decouple PySpark core API to pyspark.core package ### What changes were proposed in this pull request? This PR proposes to release a separate `pyspark-connect` package, see also [SPIP: Pure Python Package in PyPI (Spark Connect)](https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing). Today's PySpark package is roughly as follows: ``` pyspark ├── *.py # *Core / No Spark Connect support* ├── mllib # MLlib / No Spark Connect support ├── resource # Resource profile API / No Spark Connect support ├── streaming # DStream (deprecated) / No Spark Connect support ├── ml # ML │ └── connect # Spark Connect for ML ├── pandas # API on Spark with/without Spark Connect support └── sql # SQL └── connect # Spark Connect for SQL └── streaming # Spark Connect for Structured Streaming ``` There will be two packages available, `pyspark` and `pyspark-connect`. #### `pyspark` Same as today’s PySpark. But Core module is factored out to `pyspark.core.*`. User-facing interface stays the same at `pyspark.*`. ``` pyspark ├── core # *Core / No Spark Connect support* ├── mllib # MLlib / No Spark Connect support ├── resource # Resource profile API / No Spark Connect support ├── streaming # DStream (deprecated) / No Spark Connect support ├── ml # ML │ └── connect # Spark Connect for ML ├── pandas # API on Spark with/without Spark Connect support └── sql # SQL └── connect # Spark Connect for SQL └── streaming # Spark Connect for Structured Streaming ``` #### `pyspark-connect` Package after excluding modules that do not support Spark Connect, also excluding jars, that are, ml without jars: ``` pyspark ├── ml │ └── connect ├── pandas └── sql └── connect └── streaming ``` ### Why are the changes needed? To provide a pure Python library that does not depend on JVM. See also [SPIP: Pure Python Package in PyPI (Spark Connect)](https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? Yes, users can install pure Python library via `pip install pyspark-connect`. ### How was this patch tested? Manually tested the basic set of tests. ```bash ./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` ``` ```bash cd python python packaging/connect/setup.py sdist cd dist conda create -y -n clean-py-3.11 python=3.11 conda activate clean-py-3.11 pip install pyspark-connect-4.0.0.dev0.tar.gz python ``` ```python >>> import pyspark >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.remote("sc://localhost").getOrCreate() >>> spark.range(10).show() ``` ``` +---+ | id| +---+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| | 9| +---+ ``` They will be separated added, and set as a scheduled job in CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45053 from HyukjinKwon/refactoring-core. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 April 2024, 02:23:19 UTC
447f8af [SPARK-47720][CORE] Update `spark.speculation.multiplier` to 3 and `spark.speculation.quantile` to 0.9 ### What changes were proposed in this pull request? This PR aims to adjust the following in order to make Spark speculative execution behavior less aggressive from Apache Spark 4.0.0. - `spark.speculation.multiplier`: 1.5 -> 3 - `spark.speculation.quantile`: 0.75 -> 0.9 ### Why are the changes needed? Although `spark.speculation` is disabled by default, this has been used in many production use cases. ### Does this PR introduce _any_ user-facing change? This will make a speculative execution less agressive. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45858 from dongjoon-hyun/SPARK-47720. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 April 2024, 02:18:52 UTC
d87ac8e [SPARK-47708][CONNECT] Do not log gRPC exception to stderr in PySpark ### What changes were proposed in this pull request? Currently if there's any gRPC exception, instead of just handling it, the PySpark's gRPC error handler is going to print it out to the stderr, not allowing the user to cleanly ignore the exception by using try/except control flow statement. In this PR we are removing the logger.exception call and we rely on the downstream exception mechanism to report this to the user. ### Why are the changes needed? Without this change, there's no way that the user ignores the gRPC error without piping the stderr to /dev/null or equivalent. ### Does this PR introduce _any_ user-facing change? Yes, the stderr will not have the exception trace written twice. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45840 from nemanja-boric-databricks/no-log. Authored-by: Nemanja Boric <nemanja.boric@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 April 2024, 01:28:34 UTC
e3aab8c [SPARK-47210][SQL] Addition of implicit casting without indeterminate support ### What changes were proposed in this pull request? This PR adds automatic casting and collations resolution as per `PGSQL` behaviour: 1. Collations set on the metadata level are implicit 2. Collations set using the `COLLATE` expression are explicit 3. When there is a combination of expressions of multiple collations the output will be: - if there are explicit collations and all of them are equal then that collation will be the output - if there are multiple different explicit collations `COLLATION_MISMATCH.EXPLICIT` will be thrown - if there are no explicit collations and only a single type of non default collation, that one will be used - if there are no explicit collations and multiple non-default implicit ones `COLLATION_MISMATCH.IMPLICIT` will be thrown INDETERMINATE_COLLATION should only be thrown on comparison operations and memory storing of data, and we should be able to combine different implicit collations for certain operations like concat and possible others in the future. This is why we have to add another predefined collation id named INDETERMINATE_COLLATION_ID which means that the result is a combination of conflicting non-default implicit collations. Right now it would an id of -1 so it fail if it ever goes to the CollatorFactory. ### Why are the changes needed? We need to be able to compare columns and values with different collations and set a way of explicitly changing the collation we want to use. ### Does this PR introduce _any_ user-facing change? Yes. We add 3 new errors and enable collation casting. ### How was this patch tested? Tests in `CollationSuite` were done to check code validity. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45383 from mihailom-db/SPARK-47210. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 April 2024, 23:50:02 UTC
fbe6b1d [SPARK-47721][DOC] Guidelines for the Structured Logging Framework ### What changes were proposed in this pull request? As suggested in https://github.com/apache/spark/pull/45834/files#r1549565157, I am creating initial guidelines for the structured logging framework. ### Why are the changes needed? We need guidelines to align the logging migration works in the community. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: GitHub Copilot 1.2.17.2887 Closes #45862 from gengliangwang/logREADME. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 April 2024, 23:46:21 UTC
49eefc5 [SPARK-47722][SS] Wait until RocksDB background work finish before closing ### What changes were proposed in this pull request? When closing the rocksdb instance, we need to wait until all background work finish. If not, the following error could be observed: ``` 24/03/29 06:47:11 INFO RocksDB StateStoreId(opId=0,partId=0,name=default): [NativeRocksDB-2] [/error_handler.cc:396] Background IO error IO error: No such file or directory: While open a file for appending: /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=0,name=default)-732271d8-03b3-4046-a911-5797804df25c/workingDir-3a85e625-e4fb-4a78-b668-941ca16cc7a2/000008.sst: No such file or directory 24/03/29 06:47:11 ERROR RocksDB StateStoreId(opId=0,partId=0,name=default): [NativeRocksDB-3] [/db_impl/db_impl_compaction_flush.cc:3021] Waiting after background flush error: IO error: No such file or directory: While open a file for appending: /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=0,name=default)-732271d8-03b3-4046-a911-5797804df25c/workingDir-3a85e625-e4fb-4a78-b668-941ca16cc7a2/000008.sst: No such file or directoryAccumulated background error counts: 1 <TRUNCATED LOG> 24/03/29 11:54:09 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/spark-b5dac908-59cc-4276-80f7-34dab79716b7/StateStoreId(opId=0,partId=0,name=default)-702d3c8f-245e-4119-a763-b8e963d07e7b 24/03/29 06:47:12 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=4,name=default)-0eb30b1b-b92f-4744-aff6-85f9efd2bcf2 24/03/29 06:47:12 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/streaming.metadata-d281c16c-89c7-49b3-b65a-6eb2de6ddb6f pthread lock: Invalid argument ``` In the source code, after this error is thrown, there is a sleep for 1 second and then re lock the original mutex: https://github.com/facebook/rocksdb/blob/e46ab9d4f0a0e63bfc668421e2994efa918d6570/db/db_impl/db_impl_compaction_flush.cc#L2613 From the logs of RocksDB and ShutdownHookManager , we can see that exactly 1 second after rocks db throws, the pthread lock: Invalid argument is thrown. So it is likely that this mutex throws. ### Why are the changes needed? Bug fix for a transient issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test should be enough. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45863 from WweiL/SPARK-47722-rocksdb-cleanup. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 April 2024, 23:45:27 UTC
6a0555c [SPARK-47700][SQL] Fix formatting of error messages with treeNode ### What changes were proposed in this pull request? Fix formatting of error messages. Example: In several error messages, we are concatenating the plan without any separation: `in this locationFilter (dept_id#652`. We should add a colon and space or newline in between. Before: ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationFilter (dept_id#22 = outer(dept_id#16)) +- SubqueryAlias dept +- View (`DEPT`, [dept_id#22, dept_name#23, state#24]) +- Project [cast(dept_id#25 as int) AS dept_id#22, cast(dept_name#26 as string) AS dept_name#23, cast(state#27 as string) AS state#24] +- Project [dept_id#25, dept_name#26, state#27] +- SubqueryAlias DEPT +- LocalRelation [dept_id#25, dept_name#26, state#27] . SQLSTATE: 0A000; line 3 pos 19; ``` After: ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this location: Filter (dept_id#71 = outer(dept_id#65)) +- SubqueryAlias dept +- View (`DEPT`, [dept_id#71, dept_name#72, state#73]) +- Project [cast(dept_id#74 as int) AS dept_id#71, cast(dept_name#75 as string) AS dept_name#72, cast(state#76 as string) AS state#73] +- Project [dept_id#74, dept_name#75, state#76] +- SubqueryAlias DEPT +- LocalRelation [dept_id#74, dept_name#75, state#76] . SQLSTATE: 0A000; line 3 pos 19; ``` ### Why are the changes needed? Improve error messages readability. ### Does this PR introduce _any_ user-facing change? Improve error messages readability. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45825 from jchen5/treenode-error-format. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 April 2024, 23:39:52 UTC
7dec5eb [SPARK-47705][INFRA] Sort LogKey alphabetically and build a test to ensure it ### What changes were proposed in this pull request? This PR adds a unit test to ensure that the fields of the `LogKey` enumeration are sorted alphabetically, as specified by https://issues.apache.org/jira/browse/SPARK-47705. ### Why are the changes needed? This will make sure that the fields of the enumeration remain easy to read in the future as we add more cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR adds testing coverage only. ### Was this patch authored or co-authored using generative AI tooling? GitHub copilot offered some suggestions, but I rejected them Closes #45857 from dtenedor/logs. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 03 April 2024, 21:38:52 UTC
1515c56 [SPARK-47553][SS] Add Java support for transformWithState operator APIs ### What changes were proposed in this pull request? Add Java support for transformWithState operator APIs ### Why are the changes needed? To add support for using transformWithState operator in Java ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added unit tests ``` [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testGroupByKey() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testCollect() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testKryoEncoderErrorMessageForPrivateClass() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testJavaBeanEncoder() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testTupleEncoder() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testPeriodEncoder() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testRowEncoder() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testNestedTupleEncoder() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testTupleEncoderSchema() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testMappingFunctionWithTestGroupState() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testReduce() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testSelect() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testInitialStateFlatMapGroupsWithState() started [info] Test test.org.apache.spark.sql.JavaDatasetSuite#testJavaEncoderErrorMessageForPrivateClass() started [info] Test run finished: 0 failed, 0 ignored, 45 total, 14.73s [info] Passed: Total 45, Failed 0, Errors 0, Passed 45 [success] Total time: 20 s, completed Mar 28, 2024, 12:37:30 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #45758 from anishshri-db/task/SPARK-47553. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 03 April 2024, 19:55:23 UTC
a427a45 [SPARK-47710][SQL][DOCS] Postgres: Document Mapping Spark SQL Data Types from PostgreSQL ### What changes were proposed in this pull request? This PR added a User Document for Mapping Spark SQL Data Types from PostgreSQL. The write side document is not included yet which might need further verification. ### Why are the changes needed? doc improvements ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add some test for missing PG data types ![image](https://github.com/apache/spark/assets/8326978/7629fd87-b047-48c7-9892-42820f0bb430) ### Was this patch authored or co-authored using generative AI tooling? no Closes #45845 from yaooqinn/SPARK-47710. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2024, 17:51:02 UTC
efe437e [SPARK-47711][BUILD][TESTS] Parameterize JDBC Driver versions for docker integration tests ### What changes were proposed in this pull request? We currently support configuring the versions of DB backends but not configuring the versions of the JDBC while testing. ### Why are the changes needed? This PR improves the development experience, developers can test JDBC drivers of lower or higher versions, look for behavior changes, etc. ### Does this PR introduce _any_ user-facing change? no, dev-only ### How was this patch tested? ``` ./build/sbt -Pdocker-integration-tests "docker-integration-tests/testOnly *MySQLIntegrationSuite" -Dmysql.connector.version=5.1.49 ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #45847 from yaooqinn/SPARK-47711. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2024, 17:02:00 UTC
1404d1f [SPARK-47715][BUILD][STS] Upgrade hive-service-rpc 4.0.0 ### What changes were proposed in this pull request? This PR upgrades hive-service-rpc from 3.1.3 to 4.0.0, which has 3 changes. - https://issues.apache.org/jira/browse/HIVE-14388 (new added field is optional, leave it now and investigate later) - https://issues.apache.org/jira/browse/HIVE-24230 (not applicable for Spark) - https://issues.apache.org/jira/browse/HIVE-24893 (mark methods as not supported and investigate later) ### Why are the changes needed? Use the latest version of `hive-service-rpc`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45854 from pan3793/SPARK-47715. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2024, 14:55:45 UTC
d487acd [SPARK-47707][SQL] Special handling of JSON type for MySQL Connector/J 5.x ### What changes were proposed in this pull request? MySQL JDBC driver `mysql-connector-java-5.1.49.jar` converts JSON type into Types.CHAR with a precision of Int.Max. When receiving CHAR with Int.Max precision, Spark executor will throw an error of `java.lang.OutOfMemoryError: Requested array size exceeds VM limit `. ### Why are the changes needed? before plan ``` +----------------------------------------------------+ | plan | +----------------------------------------------------+ | == Physical Plan == *(1) Project [staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, field_test#75, 2147483647, true, false, true) AS field_test#97] +- *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$161874508 [field_test#75] PushedFilters: [], ReadSchema: struct<field_test:string> | +----------------------------------------------------+ ``` after this pr ``` +----------------------------------------------------+ | plan | +----------------------------------------------------+ | == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$132d7eba2 [field_test#28] PushedFilters: [], ReadSchema: struct<field_test:string> | +----------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? No Closes #45835 from beryllw/spark-43362. Authored-by: wangjunbo <wangjunbo@qiyi.com> Signed-off-by: Kent Yao <yao@apache.org> 03 April 2024, 10:47:12 UTC
d16183d [SPARK-47653][SS] Add support for negative numeric types and range scan key encoder ### What changes were proposed in this pull request? Add support for negative numeric types and range scan key encoder ### Why are the changes needed? Without this change, sort ordering for `-ve` numbers is not maintained on iteration. Negative numbers would appear last previously. Note that only non-floating integer types such as `short, integer, long` are supported for signed values. For float/double, we cannot simply prepend a sign byte given the way floating point values are stored in the IEEE 754 floating point representation. Additionally we also need to flip all the bits and convert them back to the original value on read, in order to support these types. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` [info] - rocksdb range scan - with prefix scan - with colFamiliesEnabled=true (with changelog checkpointing) (164 milliseconds) [info] - rocksdb range scan - with prefix scan - with colFamiliesEnabled=true (without changelog checkpointing) (95 milliseconds) [info] - rocksdb range scan - with prefix scan - with colFamiliesEnabled=false (with changelog checkpointing) (155 milliseconds) [info] - rocksdb range scan - with prefix scan - with colFamiliesEnabled=false (without changelog checkpointing) (82 milliseconds) 12:55:54.184 WARN org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBStateStoreSuite, threads: rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true) ===== [info] Run completed in 8 seconds, 888 milliseconds. [info] Total number of tests run: 44 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 44, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 21 s, completed Mar 29, 2024, 12:55:54 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #45778 from anishshri-db/task/SPARK-47653. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 03 April 2024, 07:54:23 UTC
6e3a425 [MINOR][DOCS] replace `-formatted code with <code> tags inside configuration.md tables ### What changes were proposed in this pull request? Inside HTML tables in the configuration webpage `configuration.md`, some markdown-formatted `code` were not rendering correctly, with `` ` `` visible. replace `` `foo` `` with `<code>foo</code>`. ### Why are the changes needed? For rendering the code block properly. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? Manually tested at https://github.com/apache/spark/pull/45731#issuecomment-2033499194 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45731 from vadim/patch-1. Authored-by: Vadim Patsalo <86705+vadim@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 April 2024, 07:09:28 UTC
62f90ec [SPARK-47452][INFRA][FOLLOWUP] Enforce to install `six` to `Python 3.10` ### What changes were proposed in this pull request? This PR aims to enforce to install `six` to Python 3.10 because `Python 3.10` is missing `six` and causes `Pandas` detection failures in CIs. - https://github.com/apache/spark/actions/runs/8525063765/job/23373974516 - Note that `pandas` is visible in the installed package list, but it fails when PySpark detects it due to the missing `six`. ``` $ docker run -it --rm ghcr.io/apache/apache-spark-ci-image:master-8345361470 python3.9 -m pip freeze | grep six six==1.16.0 $ docker run -it --rm ghcr.io/apache/apache-spark-ci-image:master-8345361470 python3.10 -m pip freeze | grep six $ docker run -it --rm ghcr.io/apache/apache-spark-ci-image:master-8345361470 python3.11 -m pip freeze | grep six six==1.16.0 $ docker run -it --rm ghcr.io/apache/apache-spark-ci-image:master-8345361470 python3.12 -m pip freeze | grep six six==1.16.0 ``` - CI failure message example. - https://github.com/apache/spark/actions/runs/8525063765/job/23373974096 ``` Starting test(python3.10): pyspark.ml.tests.connect.test_connect_classification (temp output: /__w/spark/spark/python/target/370eb2c4-12f2-411f-96d1-f617f5d59528/python3.10__pyspark.ml.tests.connect.test_connect_classification__v6itdsxy.log) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/__w/spark/spark/python/pyspark/ml/tests/connect/test_connect_classification.py", line 37, in <module> class ClassificationTestsOnConnect(ClassificationTestsMixin, unittest.TestCase): NameError: name 'ClassificationTestsMixin' is not defined ``` ### Why are the changes needed? Since Python 3.10 is the default Python version of Ubuntu OS, the behavior is different. ``` RUN python3.10 -m pip install numpy pyarrow>=15.0.0 six==1.16.0 ... ... #20 0.766 Requirement already satisfied: six==1.16.0 in /usr/lib/python3/dist-packages (1.16.0) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Check the docker image built by this PR. - https://github.com/dongjoon-hyun/spark/actions/runs/8533625657/job/23376659246 ``` $ docker pull --platform amd64 ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-8533625657 $ docker run -it --rm ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-8533625657 python3.10 -m pip freeze | grep six six==1.16.0 ``` Run tests on new docker image. ``` $ docker run -it --rm -v $PWD:/spark ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-8533625657 rootb7f5f56892b0:/# cd /spark rootb7f5f56892b0:/spark# python/run-tests --modules=pyspark-mllib,pyspark-ml,pyspark-ml-connect --parallelism=1 --python-executables=python3.10 Running PySpark tests. Output is in /spark/python/unit-tests.log Will test against the following Python executables: ['python3.10'] Will test the following Python modules: ['pyspark-mllib', 'pyspark-ml', 'pyspark-ml-connect'] python3.10 python_implementation is CPython python3.10 version is: Python 3.10.12 Starting test(python3.10): pyspark.ml.tests.connect.test_connect_classification (temp output: /spark/python/target/675eccdc-3c4b-4146-a58b-030302bdc6d7/python3.10__pyspark.ml.tests.connect.test_connect_classification__9habp0rh.log) Finished test(python3.10): pyspark.ml.tests.connect.test_connect_classification (159s) Starting test(python3.10): pyspark.ml.tests.connect.test_connect_evaluation (temp output: /spark/python/target/fbac93ba-c72d-40e4-acfe-f3ac01b4932a/python3.10__pyspark.ml.tests.connect.test_connect_evaluation__js11z0ux.log) Finished test(python3.10): pyspark.ml.tests.connect.test_connect_evaluation (36s) Starting test(python3.10): pyspark.ml.tests.connect.test_connect_feature (temp output: /spark/python/target/fdb8828e-4241-4e78-a7d6-b2a4beb3cfc1/python3.10__pyspark.ml.tests.connect.test_connect_feature__et5gr30f.log) Finished test(python3.10): pyspark.ml.tests.connect.test_connect_feature (30s) Starting test(python3.10): pyspark.ml.tests.connect.test_connect_function (temp output: /spark/python/target/e365e62f-a09b-483d-9101-fe9dfc0801f2/python3.10__pyspark.ml.tests.connect.test_connect_function__5e288azs.log) Finished test(python3.10): pyspark.ml.tests.connect.test_connect_function (24s) Starting test(python3.10): pyspark.ml.tests.connect.test_connect_pipeline (temp output: /spark/python/target/bdc167be-6d6e-4704-b840-cf5d23c4b21e/python3.10__pyspark.ml.tests.connect.test_connect_pipeline__63blw3o2.log) ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45832 from dongjoon-hyun/SPARK-47452-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2024, 06:14:43 UTC
9a20794 [SPARK-47701][SQL][TESTS] Postgres: Add test for Composite and Range types ### What changes were proposed in this pull request? Add tests for Composite and Range types of postgres. ### Why are the changes needed? test improvments ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45827 from yaooqinn/SPARK-47701. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2024, 05:43:05 UTC
360a3f9 [SPARK-45733][PYTHON][TESTS][FOLLOWUP] Skip `pyspark.sql.tests.connect.client.test_client` if not should_test_connect ### What changes were proposed in this pull request? This is a follow-up of the following. - https://github.com/apache/spark/pull/43591 ### Why are the changes needed? This test requires `pandas` which is an optional dependency in Apache Spark. ``` $ python/run-tests --modules=pyspark-connect --parallelism=1 --python-executables=python3.10 --testnames 'pyspark.sql.tests.connect.client.test_client' Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.10'] Will test the following Python tests: ['pyspark.sql.tests.connect.client.test_client'] python3.10 python_implementation is CPython python3.10 version is: Python 3.10.13 Starting test(python3.10): pyspark.sql.tests.connect.client.test_client (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/216a8716-3a1f-4cf9-9c7c-63087f29f892/python3.10__pyspark.sql.tests.connect.client.test_client__tydue4ck.log) Traceback (most recent call last): File "/Users/dongjoon/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/dongjoon/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/sql/tests/connect/client/test_client.py", line 137, in <module> class TestPolicy(DefaultPolicy): NameError: name 'DefaultPolicy' is not defined ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Pass the CIs and manually test without `pandas`. ``` $ pip3 uninstall pandas $ python/run-tests --modules=pyspark-connect --parallelism=1 --python-executables=python3.10 --testnames 'pyspark.sql.tests.connect.client.test_client' Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.10'] Will test the following Python tests: ['pyspark.sql.tests.connect.client.test_client'] python3.10 python_implementation is CPython python3.10 version is: Python 3.10.13 Starting test(python3.10): pyspark.sql.tests.connect.client.test_client (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/acf07ed5-938a-4272-87e1-47e3bf8b988e/python3.10__pyspark.sql.tests.connect.client.test_client__sfdosnek.log) Finished test(python3.10): pyspark.sql.tests.connect.client.test_client (0s) ... 13 tests were skipped Tests passed in 0 seconds Skipped tests in pyspark.sql.tests.connect.client.test_client with python3.10: test_basic_flow (pyspark.sql.tests.connect.client.test_client.SparkConnectClientReattachTestCase) ... skip (0.002s) test_fail_and_retry_during_execute (pyspark.sql.tests.connect.client.test_client.SparkConnectClientReattachTestCase) ... skip (0.000s) test_fail_and_retry_during_reattach (pyspark.sql.tests.connect.client.test_client.SparkConnectClientReattachTestCase) ... skip (0.000s) test_fail_during_execute (pyspark.sql.tests.connect.client.test_client.SparkConnectClientReattachTestCase) ... skip (0.000s) test_channel_builder (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) test_channel_builder_with_session (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) test_interrupt_all (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) test_is_closed (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) test_properties (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) test_retry (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) test_retry_client_unit (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) test_user_agent_default (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) test_user_agent_passthrough (pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... skip (0.000s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45830 from dongjoon-hyun/SPARK-45733. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2024, 05:30:08 UTC
344f640 [SPARK-47454][PYTHON][TESTS][FOLLOWUP] Skip `test_create_dataframe_from_pandas_with_day_time_interval` if pandas is not avaiable ### What changes were proposed in this pull request? This is a follow-up of #45591 to skip `test_create_dataframe_from_pandas_with_day_time_interval` if `pandas` is not available. ### Why are the changes needed? The test requires `pandas` due to `import pandas as pd`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45828 from dongjoon-hyun/SPARK-47454. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2024, 04:04:20 UTC
2c2a2ad [SPARK-47655][SS] Integrate timer with Initial State handling for state-v2 ### What changes were proposed in this pull request? We want to support timer feature during initial state handling. Currently, if grouping key introduced in initial state is never seen again in new input rows, those rows in initial state will never be emitted. We want to add a field in initial state handling function such that users can manipulate the timers with initial state rows. ### Why are the changes needed? These changes are needed to support timer with initial state. The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939 ### Does this PR introduce _any_ user-facing change? Yes. We add a new `timerValues` field in `StatefulProcessorWithInitialState`: ``` /** * Function that will be invoked only in the first batch for users to process initial states. * * param key - grouping key * param initialState - A row in the initial state to be processed * param timerValues - instance of TimerValues that provides access to current processing/event * time if available */ def handleInitialState(key: K, initialState: S, timerValues: TimerValues): Unit ``` ### How was this patch tested? Add unit tests in `TransformWithStateWithInitialStateSuite` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45780 from jingz-db/timer-initState-integration. Authored-by: jingz-db <jing.zhan@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 03 April 2024, 03:21:29 UTC
49b7b6b [SPARK-47691][SQL] Postgres: Support multi dimensional array on the write side ### What changes were proposed in this pull request? This pull request adds support for writing our nested array to a Postgres multiple-dimensional array, but the read side has not yet been implemented. ### Why are the changes needed? improve pg datasource ### Does this PR introduce _any_ user-facing change? yes, we support `array(array(...))` types or so, while _LEGACY_ERROR_TEMP_2082 is raised before ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45815 from yaooqinn/SPARK-47691. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 April 2024, 02:50:58 UTC
55b5ff6 [SPARK-47669][SQL][CONNECT][PYTHON] Add `Column.try_cast` ### What changes were proposed in this pull request? Add `try_cast` function in Column APIs ### Why are the changes needed? for functionality parity ### Does this PR introduce _any_ user-facing change? yes ``` >>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame( ... [(2, "123"), (5, "Bob"), (3, None)], ["age", "name"]) >>> df.select(df.name.try_cast("double")).show() +-----+ | name| +-----+ |123.0| | NULL| | NULL| +-----+ ``` ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45796 from zhengruifeng/connect_try_cast. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 03 April 2024, 00:15:08 UTC
27fdf96 [SPARK-47697][INFRA] Add Scala style check for invalid MDC usage ### What changes were proposed in this pull request? Add Scala style check for invalid MDC usage to avoid invalid MDC usage `s"Task ${MDC(TASK_ID, taskId)} failed"`, which should be `log"Task ${MDC(TASK_ID, taskId)} failed"`. ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Manual test, verified it will throw errors on invalid MDC usage. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45823 from gengliangwang/style. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 April 2024, 22:45:23 UTC
d11d9cf [SPARK-47699][BUILD] Upgrade `gcs-connector` to 2.2.21 and add a note for 3.0.0 ### What changes were proposed in this pull request? This PR aims to upgrade `gcs-connector` to 2.2.21 and add a note for 3.0.0. ### Why are the changes needed? This PR aims to upgrade `gcs-connector` to bring the latest bug fixes. However, due to the following, we stick to use 2.2.21. - https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/1114 - `gcs-connector` 2.2.21 has shaded Guava 32.1.2-jre. - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/15c8ee41a15d6735442f36333f1d67792c93b9cf/pom.xml#L100 - `gcs-connector` 3.0.0 has shaded Guava 31.1-jre. - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/667bf17291dbaa96a60f06df58c7a528bc4a8f79/pom.xml#L97 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ``` $ dev/make-distribution.sh -Phadoop-cloud $ cd dist $ export KEYFILE=~/.ssh/apache-spark.json $ export EMAIL=$(jq -r '.client_email' < $KEYFILE) $ export PRIVATE_KEY_ID=$(jq -r '.private_key_id' < $KEYFILE) $ export PRIVATE_KEY="$(jq -r '.private_key' < $KEYFILE)" $ bin/spark-shell \ -c spark.hadoop.fs.gs.auth.service.account.email=$EMAIL \ -c spark.hadoop.fs.gs.auth.service.account.private.key.id=$PRIVATE_KEY_ID \ -c spark.hadoop.fs.gs.auth.service.account.private.key="$PRIVATE_KEY" Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 4.0.0-SNAPSHOT /_/ Using Scala version 2.13.13 (OpenJDK 64-Bit Server VM, Java 21.0.2) Type in expressions to have them evaluated. Type :help for more information. {"ts":"2024-04-02T13:08:31.513-0700","level":"WARN","msg":"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable","logger":"org.apache.hadoop.util.NativeCodeLoader"} Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1712088511841). Spark session available as 'spark'. scala> spark.read.text("gs://apache-spark-bucket/README.md").count() val res0: Long = 124 scala> spark.read.orc("examples/src/main/resources/users.orc").write.mode("overwrite").orc("gs://apache-spark-bucket/users.orc") scala> spark.read.orc("gs://apache-spark-bucket/users.orc").show() +------+--------------+----------------+ | name|favorite_color|favorite_numbers| +------+--------------+----------------+ |Alyssa| NULL| [3, 9, 15, 20]| | Ben| red| []| +------+--------------+----------------+ ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45824 from dongjoon-hyun/SPARK-47699. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 April 2024, 20:31:18 UTC
e6144e4 [SPARK-47695][BUILD] Upgrade AWS SDK v2 to 2.24.6 ### What changes were proposed in this pull request? This PR aims to upgrade AWS SDK v2 to 2.24.6. ### Why are the changes needed? Like HADOOP-19082 (Apache Hadoop 3.4.1), Apache Spark 4.0.0 had better use the latest one. - https://github.com/apache/hadoop/pull/6568 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45821 from dongjoon-hyun/SPARK-47695. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 April 2024, 18:05:03 UTC
db0975c [SPARK-47602][CORE] Resource managers: Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to: - migrate `logError` in module `Resource managers` with variables to `structured logging framework`. - add some test case to `*LoggingSuite` - add new UT `MDCSuite` - make the `type` of `value` from `String` to `Any`( so that when using `MDC`, some code can be `omitted`, eg: `String.valueOf(...)` , `x.toString`) ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Add & update new UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45808 from panbingkun/SPARK-47602. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 02 April 2024, 17:04:30 UTC
8d4e964 [SPARK-47684][SQL] Postgres: Map length unspecified bpchar to StringType ### What changes were proposed in this pull request? This PR maps length unspecified bpchar to StringType for Postgres. Length unspecified bpchar represents variable unlimited character string, and the PG JDBC reports its size with Int.MaxValue. W/o this patch, we will always do char padding to 2147483647. This is inefficient and risky. ### Why are the changes needed? suitable type mapping ### Does this PR introduce _any_ user-facing change? yes, length unspecified bpchar is mapped to StringType, before, CharType(2147483647) ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45810 from yaooqinn/SPARK-47684. Lead-authored-by: Kent Yao <yao@apache.org> Co-authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Kent Yao <yao@apache.org> 02 April 2024, 14:33:33 UTC
ba98b7a [SPARK-47689][SQL] Do not wrap query execution error during data writing ### What changes were proposed in this pull request? It's quite confusing to report `TASK_WRITE_FAILED` error when the error was caused by input query execution. This PR updates the error wrapping code to not wrap with `TASK_WRITE_FAILED` if the error was from input query execution. ### Why are the changes needed? better error reporting ### Does this PR introduce _any_ user-facing change? yes, now people won't see `TASK_WRITE_FAILED` error if the error was from input query. ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45797 from cloud-fan/write-error. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 April 2024, 12:34:32 UTC
b1b1fde Revert "[SPARK-47684][SQL] Postgres: Map length unspecified bpchar to StringType" This reverts commit 22771a6a87e7b6bc4d037ee15ea1c21baef7a241. 02 April 2024, 12:24:08 UTC
03f4e45 [SPARK-47685][SQL] Restore the support for `Stream` type in `Dataset#groupBy` ### What changes were proposed in this pull request? When I reviewed the changes in SPARK-45685, I found an old user case that is no longer supported: ```scala Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): _*).sum("id") ``` ``` [info] - SPARK-38221: group by `Stream` of complex expressions should not fail *** FAILED *** (51 milliseconds) [info] org.apache.spark.SparkException: Task not serializable [info] at org.apache.spark.util.SparkClosureCleaner$.clean(SparkClosureCleaner.scala:45) [info] at org.apache.spark.SparkContext.clean(SparkContext.scala:2718) [info] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:908) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:411) [info] at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:907) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:762) ... ``` Since this is a historical user usage, and although the `Stream` type has been deprecated after Scala 2.13.0, it has not been removed, so this PR restores the support for `Stream` type in `Dataset#groupBy`. ### Why are the changes needed? Restore the support for `Stream` type in `Dataset#groupBy` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Restored the test case for dataset group by `Stream`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45811 from LuciferYang/SPARK-47685. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 02 April 2024, 09:37:46 UTC
a598f65 [SPARK-47664][PYTHON][CONNECT][TESTS][FOLLOW-UP] Add more tests ### What changes were proposed in this pull request? Add more tests ### Why are the changes needed? for test coverage, to address https://github.com/apache/spark/pull/45788#discussion_r1546663788 ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45809 from zhengruifeng/col_name_val_test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 02 April 2024, 08:33:32 UTC
22771a6 [SPARK-47684][SQL] Postgres: Map length unspecified bpchar to StringType ### What changes were proposed in this pull request? This PR maps length unspecified bpchar to StringType for Postgres. Length unspecified bpchar represents variable unlimited character string, and the PG JDBC reports its size with Int.MaxValue. W/o this patch, we will always do char padding to 2147483647. This is inefficient and risky. ### Why are the changes needed? suitable type mapping ### Does this PR introduce _any_ user-facing change? yes, length unspecified bpchar is mapped to StringType, before, CharType(2147483647) ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45810 from yaooqinn/SPARK-47684. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 02 April 2024, 08:30:03 UTC
e2e6e09 [SPARK-47634][SQL] Add legacy support for disabling map key normalization ### What changes were proposed in this pull request? Added a `DISABLE_MAP_KEY_NORMALIZATION` option in `SQLConf` to allow for legacy creation of a map without key normalization (keys `0.0` and `-0.0`) in `ArrayBasedMapBuilder`. ### Why are the changes needed? As a legacy fallback option. ### Does this PR introduce _any_ user-facing change? New `DISABLE_MAP_KEY_NORMALIZATION` config option. ### How was this patch tested? New UT proposed in this PR ### Was this patch authored or co-authored using generative AI tooling? No Closes #45760 from stevomitric/stevomitric/normalize-conf. Authored-by: Stevo Mitric <stevo.mitric@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 April 2024, 08:28:27 UTC
eb9b126 [SPARK-47663][CORE][TESTS] add end to end test for task limiting according to different cpu and gpu configurations ### What changes were proposed in this pull request? Add an end-to-end unit test to ensure that the number of tasks is calculated correctly according to the different task CPU amound and task GPU amount. ### Why are the changes needed? To increase the test coverage. More details can be found at https://github.com/apache/spark/pull/45528#discussion_r1545905575 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The CI can pass. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45794 from wbo4958/end2end-test. Authored-by: Bobby Wang <wbo4958@gmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 02 April 2024, 07:30:10 UTC
00162b8 [SPARK-47686][SQL][TESTS] Use `=!=` instead of `!==` in `JoinHintSuite` ### What changes were proposed in this pull request? This pr use `=!=` instead of `!==` in `JoinHintSuite`. `!==` is a deprecated API since 2.0.0, and its test already exists in `DeprecatedAPISuite`. ### Why are the changes needed? Clean up deprecated API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45812 from LuciferYang/SPARK-47686. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 April 2024, 06:45:00 UTC
c3da260 [SPARK-47665][SQL] Use SMALLINT to Write ShortType to MYSQL ### What changes were proposed in this pull request? This PR uses MySQL SMALLINT to write ShortType to MYSQL, as we read SMALLINT as ShortType. ### Why are the changes needed? SMALLINT fits better for ShortType and plays roundtrip for r/w. ### Does this PR introduce _any_ user-facing change? yes, type mapping changes ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45789 from yaooqinn/SPARK-47665. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 02 April 2024, 04:58:05 UTC
e983349 [SPARK-47679][SQL] Use `HiveConf.getConfVars` or Hive conf names directly ### What changes were proposed in this pull request? This PR aims to use `HiveConf.getConfVars` or Hive config names directly to be robust on Hive incompatibility. ### Why are the changes needed? Apache Hive 4.0.0 introduced incompatible changes on `ConfVars` enum via HIVE-27925. - https://github.com/apache/hive/pull/4919 - https://github.com/apache/hive/pull/5107 `HiveConf.getConfVars` or config names is more robust way to handle this incompatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45804 from dongjoon-hyun/SPARK-47679. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 April 2024, 04:17:56 UTC
1fd3089 [SPARK-47551][SQL] Add variant_get expression ### What changes were proposed in this pull request? This PR adds a new `VariantGet` expression that extracts variant value and casts it to a concrete type. It is exposed as two SQL expressions, `variant_get` and `try_variant_get`. If the extracted path doesn't exist in the source variant value, they should both return null. The difference is at the cast step: when the cast fails,`variant_get` should throw an exception, and `try_variant_get` should return null. The cast behavior is NOT affected by the `spark.sql.ansi.enabled` flag: `variant_get` always has the ANSI cast semantics, while `try_variant_get` always has the TRY cast semantics. An example is that casting a variant long to an int never silently overflows and produces the wrapped int value, while casting a long to an int may silently overflow in LEGACY mode. The current path extraction only supports array index access and case-sensitive object key access. Usage examples: ``` > SELECT variant_get(parse_json('{"a": 1}'), '$.a', 'int'); 1 > SELECT variant_get(parse_json('{"a": 1}'), '$.b', 'int'); NULL > SELECT variant_get(parse_json('[1, "2"]'), '$[1]', 'string'); 2 > SELECT variant_get(parse_json('[1, "2"]'), '$[2]', 'string'); NULL > SELECT variant_get(parse_json('[1, "hello"]'), '$[1]'); -- when the target type is not specified, it returns variant by default (i.e., only extracts a sub-variant without cast) "hello" > SELECT try_variant_get(parse_json('[1, "hello"]'), '$[1]', 'int'); -- "hello" cannot be cast into int NULL ``` ### How was this patch tested? Unit tests. Closes #45708 from chenhao-db/variant_get. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 April 2024, 01:35:23 UTC
43cedc0 [SPARK-46812][CONNECT][PYTHON] Make mapInPandas / mapInArrow support ResourceProfile ### What changes were proposed in this pull request? Support stage-level scheduling for PySpark connect DataFrame APIs (mapInPandas and mapInArrow). ### Why are the changes needed? https://github.com/apache/spark/pull/44852 has supported ResourceProfile in mapInPandas/mapInArrow for SQL, So it's the right time to enable it for connect. ### Does this PR introduce _any_ user-facing change? Yes, Users can pass ResourceProfile to mapInPandas/mapInArrow through the connect pyspark client. ### How was this patch tested? Pass the CIs and manual tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45232 from wbo4958/connect-rp. Authored-by: Bobby Wang <wbo4958@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 April 2024, 00:11:23 UTC
58f44e8 [SPARK-47643][SS][PYTHON] Add pyspark test for python streaming source ### What changes were proposed in this pull request? Add pyspark end to end test for Python streaming dada source in pure python environment. ### Why are the changes needed? Currently there are only scala tests for python streaming data source, which is not sufficient. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45768 from chaoqin-li1123/python_test. Authored-by: Chaoqin Li <chaoqin.li@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 April 2024, 00:04:43 UTC
128f74b [SPARK-47676][BUILD] Clean up the removed `VersionsSuite` references ### What changes were proposed in this pull request? This PR aims to clean up the removed `VersionsSuite` reference. ### Why are the changes needed? At Apache Spark 3.3.0, `VersionsSuite` is removed via SPARK-38036 . - https://github.com/apache/spark/pull/35335 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45800 from dongjoon-hyun/SPARK-47676. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 April 2024, 23:49:39 UTC
72fe58d [SPARK-47675][K8S][TESTS] Use AWS SDK v2 `2.23.19` in K8s IT ### What changes were proposed in this pull request? This PR aims to use AWS SDK v2 `2.23.19` in `kubernetes/integration-tests` which is consistent with SPARK-45393. ### Why are the changes needed? Since SPARK-45393, Apache Spark uses `software.amazon.awssdk:bundle:2.23.19` which is higher than the current value of `aws.java.sdk.v2.version`, `2.20.160`. https://github.com/apache/spark/blob/e86c499b008ccc96b49f3fb9343ce67ff642c204/dev/deps/spark-deps-hadoop-3-hive-2.3#L34 ### Does this PR introduce _any_ user-facing change? No. This property is used in the K8s integration test only. ### How was this patch tested? No. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45799 from dongjoon-hyun/SPARK-47675. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 April 2024, 22:13:21 UTC
e86c499 [SPARK-47674][CORE] Enable `spark.metrics.appStatusSource.enabled` by default ### What changes were proposed in this pull request? This PR aims to enable `spark.metrics.appStatusSource.enabled` by default. ### Why are the changes needed? `spark.metrics.appStatusSource.enabled` was introduced at `Apache Spark 3.0.0` and has been used usefully in the production in order to expose app status. We had better enable it by default in `Apache Spark 4.0.0`. ### Does this PR introduce _any_ user-facing change? This will expose additional metrics. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45798 from dongjoon-hyun/SPARK-47674. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 April 2024, 20:16:35 UTC
1481197 [SPARK-47659][CORE][TESTS] Improve `*LoggingSuite*` ### What changes were proposed in this pull request? The pr aims to improve `UT` related to `structured logs`, including: `LoggingSuiteBase`, `StructuredLoggingSuite` and `PatternLoggingSuite`. ### Why are the changes needed? Enhance readability and make it more elegant. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45784 from panbingkun/SPARK-47659. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 01 April 2024, 17:16:49 UTC
e9f204a [SPARK-46840][SQL][TESTS] Add `CollationBenchmark` ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-46840 [Collation Support in Spark.docx](https://github.com/apache/spark/files/14551958/Collation.Support.in.Spark.docx) ### Why are the changes needed? Work is underway to introduce collation concept into Spark. There is a need to build out a benchmarking suite to allow engineers to address performance impact. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GHA 'Run Benchmarks' ran on this, for both JDK 17 and JDK 21 In addition, both the author and dbatomic tested locally on personal computers: `build/sbt "sql/Test/runMain org.apache.spark.sql.execution.benchmark.CollationBenchmark"` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45453 from GideonPotok/spark_46840. Authored-by: GideonPotok <g.potok4@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 April 2024, 14:29:28 UTC
72c619e [SPARK-47666][SQL] Fix NPE when reading mysql bit array as LongType ### What changes were proposed in this pull request? This PR fixes NPE when reading mysql bit array as LongType ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45790 from yaooqinn/SPARK-47666. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 01 April 2024, 10:14:55 UTC
968cba2 [SPARK-47664][PYTHON][CONNECT] Validate the column name with cached schema ### What changes were proposed in this pull request? improve the column name validation, to try the best to avoid RPC. ### Why are the changes needed? existing validation contains two parts: 1. check whether the column name is in `self.columns` <- client side validation; 2. if step 1 fail, validate with additional RPC `df.select(...)` <- RPC; the client side validation is too simple, and this PR aims to improve it to cover more cases: 1. backticks: ``` '`a`' ``` 2. nested fields: ``` 'a.b.c' ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci, added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes #45788 from zhengruifeng/column_name_validate. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 April 2024, 09:33:30 UTC
cf02b1a [SPARK-47569][SQL] Disallow comparing variant ### What changes were proposed in this pull request? It adds type-checking rules to disallow comparing variant values (including group by a variant column). We may support comparing variant values in the future, but since we don't have a proper comparison implementation at this point, they should be disallowed on the user surface. ### How was this patch tested? Unit tests. Closes #45726 from chenhao-db/SPARK-47569. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 April 2024, 07:42:02 UTC
535734e [SPARK-47431][SQL][FOLLOWUP] Do not access SQLConf in Literal.apply ### What changes were proposed in this pull request? Refactor of `literals.scala`. ### Why are the changes needed? `Literal` expression is too low-level and should not access `SQLConf`, otherwise it's hard to reason about when/where the default collation is applied. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests already available from initial PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45750 from mihailom-db/SPARK-47431-FOLLOWUP. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 April 2024, 07:19:12 UTC
a5ca586 [SPARK-47654][INFRA] Structured logging framework: support log concatenation ### What changes were proposed in this pull request? Support the log concatenation in the structured logging framework. For example ``` log"${MDC(CONFIG, SHUFFLE_MAPOUTPUT_MIN_SIZE_FOR_BROADCAST.key)} " + log"(${MDC(MIN_SIZE, minSizeForBroadcast.toString)} bytes) " + log"must be <= spark.rpc.message.maxSize (${MDC(MAX_SIZE, maxRpcMessageSize.toString)} " + log"bytes) to prevent sending an rpc message that is too large." ``` ### Why are the changes needed? Although most of the Spark logs are short, we need this convenient syntax when handling long logs with multiple variables. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? Yes, GitHub copilot Closes #45779 from gengliangwang/logConcat. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 01 April 2024, 07:15:24 UTC
4c1405d [SPARK-47645][BUILD][CORE][SQL][YARN] Make Spark build with `-release` instead of `-target` ### What changes were proposed in this pull request? This pr makes the following changes to allow Spark to build with `-release` instead of `-target`: 1. Use `MethodHandle` instead of direct calls to `sun.security.action.GetBooleanAction` and `sun.util.calendar.ZoneInfo`, because they are not `exports` APIs. 2. `Channels.newReader` is used instead of ``,StreamDecoder.forDecoder because `StreamDecoder.forDecoder` is also not `exports` APIs. ```java public static Reader newReader(ReadableByteChannel ch, CharsetDecoder dec, int minBufferCap) { Objects.requireNonNull(ch, "ch"); return StreamDecoder.forDecoder(ch, dec.reset(), minBufferCap); } ``` 3. Adjusted the import of `java.io._` in `yarn/Client.scala` to fix the compilation error: ``` Error: ] /home/runner/work/spark/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:20: object FileSystem is not a member of package java.io ``` 4. Replaced `-target` with `-release` in `pom.xml` and `SparkBuild.scala`, and removed the `-source` option, because using `-release` is sufficient. 5. Upgrade `scala-maven-plugin` from 4.7.1 to 4.8.1 to fix the error `[ERROR] -release cannot be less than -target` when executing `build/mvn clean install -DskipTests -Djava.version=21` ### Why are the changes needed? After Scala 2.13.9, the compile option `-target` has been deprecated, it is recommended to use `-release`: - https://github.com/scala/scala/pull/9982 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45716 from LuciferYang/scala-maven-plugin-491. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 April 2024, 04:27:33 UTC
18b582c [SPARK-47662][SQL][DOCS] Add User Document for Mapping Spark SQL Data Types to MySQL ### What changes were proposed in this pull request? Following #45736, we add the User Document for Mapping Spark SQL Data Types to MySQL ### Why are the changes needed? doc improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ![image](https://github.com/apache/spark/assets/8326978/e7d1aa1a-3bcf-45ad-9848-233913830d07) ### Was this patch authored or co-authored using generative AI tooling? no Closes #45787 from yaooqinn/SPARK-47662. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 April 2024, 04:19:10 UTC
ef4c27b [SPARK-47658][BUILD] Upgrade `tink` to 1.12.0 ### What changes were proposed in this pull request? The pr aims to upgrade `tink` from `1.9.0` to `1.12.0`. ### Why are the changes needed? The last update occurred 11 months ago, as follows: https://github.com/apache/spark/pull/40878. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45783 from panbingkun/SPARK-47658. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2024, 22:42:04 UTC
1bc812b [SPARK-47656][BUILD] Upgrade commons-io to 2.16.0 ### What changes were proposed in this pull request? The pr aims to upgrade `commons-io` from `2.15.1` to `2.16.0`. ### Why are the changes needed? 1.2.15.1 vs 2.16.0 https://github.com/apache/commons-io/compare/rel/commons-io-2.15.1...rel/commons-io-2.16.0 2.The version fix some bugs: https://github.com/apache/commons-io/pull/525 https://github.com/apache/commons-io/pull/521 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45781 from panbingkun/SPARK-47656. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2024, 22:40:06 UTC
11d76c9 [SPARK-47576][INFRA] Implement logInfo API in structured logging framework ### What changes were proposed in this pull request? Implement logWarning API in structured logging framework. Also, revise the test case names to make it more reasonable for the `PatternLoggingSuite` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #45777 from gengliangwang/logInfo. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 29 March 2024, 20:10:40 UTC
db14be8 [SPARK-47637][SQL] Use errorCapturingIdentifier in more places ### What changes were proposed in this pull request? errorCapturingIdentifier parses identifier with included '-' to raise INVALID_IDENTIFIER errors instead of SYNTAX_ERROR for non-delimited identifiers containing a hyphen. It is meant to be used wherever the context is not that of an expression This PR replaces a few missed identifiers with that rule. ### Why are the changes needed? Improve error messages for undelimited identifiers with a hyphen. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests in ErrorParserSuite.scala ### Was this patch authored or co-authored using generative AI tooling? No Closes #45764 from srielau/SPARK-47637-errorCapturingIdentifier. Authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 29 March 2024, 18:54:20 UTC
d182810 [SPARK-47575][INFRA] Implement logWarning API in structured logging framework ### What changes were proposed in this pull request? Implement logWarning API in structured logging framework. Also, refactor the logging test suites to reduce duplicated code. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45770 from gengliangwang/logWarning. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 29 March 2024, 18:13:21 UTC
48def2b [SPARK-47648][SQL][TESTS] Use `checkError()` to check Exception in `[CSV|Json|Xml]Suite` and `[Csv|Json]FunctionsSuite` ### What changes were proposed in this pull request? The pr aims to - use `checkError()` to check Exception in `[CSV|Json|Xml]Suite` and `[Csv|Json]FunctionsSuite` - Fix `typo` about `CSVSuite`. ### Why are the changes needed? - The changes improve the error framework. - Fix `typ`o about `CSVSuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45774 from panbingkun/checkError_csv_json_xml. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 March 2024, 17:33:58 UTC
12155e8 [SPARK-47647][SQL] Make MySQL data source able to read bit(n>1) as BinaryType like Postgres ### What changes were proposed in this pull request? Make MySQL data source able to read bit(n>1) as BinaryType like Postgres. It seemed an unfulfilled work from the original author >>// This could instead be a BinaryType if we'd rather return bit-vectors of up to 64 // byte arrays instead of longs. A property spark.sql.legacy.mysql.bitArrayMapping.enabled is added to restore the old behavior. ### Why are the changes needed? Make the behavior consistent among different JDBC data sources. ### Does this PR introduce _any_ user-facing change? yes, type mapping changes ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? Closes #45773 from yaooqinn/SPARK-47647. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 March 2024, 16:08:34 UTC
5318846 [SPARK-47641][SQL] Improve the performance for `UnaryMinus` and `Abs` ### What changes were proposed in this pull request? The pr aims to improve the `performance` for `UnaryMinus` and `Abs`. ### Why are the changes needed? We can further `improve the performance` of `UnaryMinus` and `Abs` by the following suggestions: <img width="905" alt="image" src="https://github.com/apache/spark/assets/15246973/456b142d-a15d-408e-8aad-91b53841fc16"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45766 from panbingkun/improve_UnaryMinus. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 March 2024, 16:05:19 UTC
5d5611e [SPARK-47642][BUILD] Exclude dependencies related to `org.junit.jupiter` and `org.junit.platform` from `jmock-junit5` ### What changes were proposed in this pull request? This pr exclude dependencies related to `org.junit.jupiter` and `org.junit.platform` from `jmock-junit5` to avoid using the wrong version of junit5 when Maven tests on `kafka-0-10` and `kafka-0-10-sql`. ### Why are the changes needed? Avoid using the wrong version of junit5 when Maven tests on `kafka-0-10` and `kafka-0-10-sql`. - https://github.com/apache/spark/actions/runs/8467689881/job/23199006729 ``` /home/runner/work/spark/spark/connector/kafka-0-10/src/test/java/org/apache/spark/streaming/kafka010/JavaDirectKafkaStreamSuite.java:51: warning: [deprecation] JavaStreamingContext in org.apache.spark.streaming.api.java has been deprecated ssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(200)); ^ 5 warnings. Mar 28, 2024 1:48:00 PM org.junit.platform.launcher.core.DefaultLauncher handleThrowable WARNING: TestEngine with ID 'junit-jupiter' failed to discover tests java.lang.NoClassDefFoundError: org/junit/platform/engine/support/discovery/SelectorResolver at org.junit.jupiter.engine.JupiterTestEngine.discover(JupiterTestEngine.java:69) at org.junit.platform.launcher.core.DefaultLauncher.discoverEngineRoot(DefaultLauncher.java:177) at org.junit.platform.launcher.core.DefaultLauncher.discoverRoot(DefaultLauncher.java:164) at org.junit.platform.launcher.core.DefaultLauncher.discover(DefaultLauncher.java:120) at org.apache.maven.surefire.junitplatform.LazyLauncher.discover(LazyLauncher.java:50) at org.apache.maven.surefire.junitplatform.TestPlanScannerFilter.accept(TestPlanScannerFilter.java:52) at org.apache.maven.surefire.api.util.DefaultScanResult.applyFilter(DefaultScanResult.java:87) at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.scanClasspath(JUnitPlatformProvider.java:142) at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:122) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:385) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162) at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495) Caused by: java.lang.ClassNotFoundException: org.junit.platform.engine.support.discovery.SelectorResolver at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) ... 13 more ... Warning: /home/runner/work/spark/spark/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala:524: class DefaultPartitioner in package internals is deprecated Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation, site=org.apache.spark.sql.kafka010.KafkaTestUtils.producerConfiguration, origin=org.apache.kafka.clients.producer.internals.DefaultPartitioner Warning: [WARNING] 10 warnings found Mar 28, 2024 1:49:57 PM org.junit.platform.launcher.core.DefaultLauncher handleThrowable WARNING: TestEngine with ID 'junit-jupiter' failed to discover tests java.lang.NoClassDefFoundError: org/junit/platform/engine/support/discovery/SelectorResolver at org.junit.jupiter.engine.JupiterTestEngine.discover(JupiterTestEngine.java:69) at org.junit.platform.launcher.core.DefaultLauncher.discoverEngineRoot(DefaultLauncher.java:177) at org.junit.platform.launcher.core.DefaultLauncher.discoverRoot(DefaultLauncher.java:164) at org.junit.platform.launcher.core.DefaultLauncher.discover(DefaultLauncher.java:120) at org.apache.maven.surefire.junitplatform.LazyLauncher.discover(LazyLauncher.java:50) at org.apache.maven.surefire.junitplatform.TestPlanScannerFilter.accept(TestPlanScannerFilter.java:52) at org.apache.maven.surefire.api.util.DefaultScanResult.applyFilter(DefaultScanResult.java:87) at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.scanClasspath(JUnitPlatformProvider.java:142) at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:122) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:385) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162) at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495) Caused by: java.lang.ClassNotFoundException: org.junit.platform.engine.support.discovery.SelectorResolver at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) ... 13 more ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual checked: ``` build/mvn clean install -pl connector/kafka-0-10-sql -am -DskipTests build/mvn test -pl connector/kafka-0-10,connector/kafka-0-10-sql ``` Test cases can be discovered normally, no longer exist information related to `Caused by: java.lang.ClassNotFoundException: org.junit.platform.engine.support.discovery.SelectorResolver`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45767 from LuciferYang/SPARK-47642. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 March 2024, 15:57:48 UTC
63529bf [SPARK-47543][CONNECT][PYTHON][TESTS][FOLLOW-UP] Skip the test if pandas and PyArrow are unavailable ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/45699 that adds pandas/arrow check (and skips the test if not installed). Currently, scheduled PyPy build is broken https://github.com/apache/spark/actions/runs/8469356110/job/23208888581. ### Why are the changes needed? To make the PySpark tests runnable without optional dependencies. To fix up the scheduled build for PyPy. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45772 from HyukjinKwon/SPARK-47543. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 March 2024, 09:31:12 UTC
d709e20 [SPARK-47646][SQL] Make try_to_number return NULL for malformed input ### What changes were proposed in this pull request? This PR proposes to add NULL check after parsing the number so the output can be safely null for `try_to_number` expression. ```scala import org.apache.spark.sql.functions._ val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11"))) df.select(try_to_number($"value", lit("$99.99"))).show() ``` ``` java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) ``` ### Why are the changes needed? To fix the bug, and let `try_to_number` return `NULL` for malformed input as designed. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug. Previously, `try_to_number` failed with NPE. ### How was this patch tested? Unittest was added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45771 from HyukjinKwon/SPARK-47646. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 March 2024, 08:38:10 UTC
609bd48 [SPARK-47572][SQL] Enforce Window partitionSpec is orderable ### What changes were proposed in this pull request? In the `Window` node, both `partitionSpec` and `orderSpec` must be orderable, but the current type check only verifies `orderSpec` is orderable. This can cause an error in later optimizing phases. Given a query: ``` with t as (select id, map(id, id) as m from range(0, 10)) select rank() over (partition by m order by id) from t ``` Before the PR, it fails with an `INTERNAL_ERROR`: ``` org.apache.spark.SparkException: [INTERNAL_ERROR] grouping/join/window partition keys cannot be map type. SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.needNormalize(NormalizeFloatingNumbers.scala:103) at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.org$apache$spark$sql$catalyst$optimizer$NormalizeFloatingNumbers$$needNormalize(NormalizeFloatingNumbers.scala:94) ... ``` After the PR, it fails with a `EXPRESSION_TYPE_IS_NOT_ORDERABLE`, which is expected: ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [EXPRESSION_TYPE_IS_NOT_ORDERABLE] Column expression "m" cannot be sorted because its type "MAP<BIGINT, BIGINT>" is not orderable. SQLSTATE: 42822; line 2 pos 53; Project [RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4] +- Project [id#1L, m#0, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4] +- Window [rank(id#1L) windowspecdefinition(m#0, id#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4], [m#0], [id#1L ASC NULLS FIRST] +- Project [id#1L, m#0] +- SubqueryAlias t +- SubqueryAlias t +- Project [id#1L, map(id#1L, id#1L) AS m#0] +- Range (0, 10, step=1, splits=None) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52) ... ``` ### How was this patch tested? Unit test. Closes #45730 from chenhao-db/SPARK-47572. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 March 2024, 08:18:46 UTC
5259dcc [SPARK-47644][PYTHON][DOCS] Refine docstrings of try_* ### What changes were proposed in this pull request? This PR refines docstring of `try_*` functions with more descriptive examples. ### Why are the changes needed? For better API reference documentation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes user-facing documentation. ### How was this patch tested? Manually tested. GitHub Actions should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45769 from HyukjinKwon/SPARK-47644. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 March 2024, 07:28:34 UTC
874d033 [SPARK-47574][INFRA] Introduce Structured Logging Framework ### What changes were proposed in this pull request? Introduce Structured Logging Framework as per [SPIP: Structured Logging Framework for Apache Spark](https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing) . * The default logging output format will be json lines. For example ``` { "ts":"2023-03-12T12:02:46.661-0700", "level":"ERROR", "msg":"Cannot determine whether executor 289 is alive or not", "context":{ "executor_id":"289" }, "exception":{ "class":"org.apache.spark.SparkException", "msg":"Exception thrown in awaitResult", "stackTrace":"..." }, "source":"BlockManagerMasterEndpoint" } ``` * Introduce a new configuration `spark.log.structuredLogging.enabled` to set the default log4j configuration. It is true by default. Users can disable it to get plain text log outputs. * The change will start with the `logError` method. Example changes on the API: from ``` logError(s"Cannot determine whether executor $executorId is alive or not.", e) ``` to ``` logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, executorId)} is alive or not.", e) ``` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. This transition will change the format of the default log output from plain text to JSON lines, making it more analyzable. ### Does this PR introduce _any_ user-facing change? Yes, the default log output format will be json lines instead of plain text. User can restore the default plain text output when disabling configuration `spark.log.structuredLogging.enabled`. If a user is a customized log4j configuration, there is no changes in the log output. ### How was this patch tested? New Unit tests ### Was this patch authored or co-authored using generative AI tooling? Yes, some of the code comments are from github copilot Closes #45729 from gengliangwang/LogInterpolator. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 29 March 2024, 05:58:51 UTC
a8b247e [SPARK-47629][INFRA] Add `common/variant` and `connector/kinesis-asl` to maven daily test module list ### What changes were proposed in this pull request? This pr add `common/variant` and `connector/kinesis-asl` to maven daily test module list. ### Why are the changes needed? Synchronize the modules to be tested in Maven daily test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Monitor GA after merge ### Was this patch authored or co-authored using generative AI tooling? No Closes #45754 from LuciferYang/SPARK-47629. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 29 March 2024, 05:26:40 UTC
0b844e5 [SPARK-47568][SS] Fix race condition between maintenance thread and load/commit for snapshot files ### What changes were proposed in this pull request? This PR fixes a race condition between the maintenance thread and task thread when change-log checkpointing is enabled, and ensure all snapshots are valid. 1. The maintenance thread currently relies on class variable lastSnapshot to find the latest checkpoint and uploads it to DFS. This checkpoint can be modified at commit time by Task thread if a new snapshot is created. 2. The task thread was not resetting the lastSnapshot at load time, which can result in newer snapshots (if a old version is loaded) being considered valid and uploaded to DFS. This results in VersionIdMismatch errors. ### Why are the changes needed? These are logical bugs which can cause `VersionIdMismatch` errors causing user to discard the snapshot and restart the query. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test cases. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45724 from sahnib/rocks-db-fix. Authored-by: Bhuwan Sahni <bhuwan.sahni@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 29 March 2024, 04:23:15 UTC
b314672 [SPARK-47511][SQL] Canonicalize With expressions by re-assigning IDs ### What changes were proposed in this pull request? This PR changes the canonicalization of `With` expressions to re-assign IDs starting from 0. ### Why are the changes needed? This ensures that the canonicalization of `With` expressions is consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45649 from kelvinjian-db/SPARK-47511-with-canonicalization. Authored-by: Kelvin Jiang <kelvin.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 March 2024, 03:09:30 UTC
a57f94b [SPARK-47638][PS][CONNECT] Skip column name validation in PS ### What changes were proposed in this pull request? Skip column name validation in PS ### Why are the changes needed? `scol_for` is an internal method, not exposed to users, so this eager validation seems unnecessary when a bad column name is used before: `scol_for` immediately fails after: silent at `scol_for` call, fail later at analysis (e.g. dtypes/schema) or execution ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45752 from zhengruifeng/test_avoid_validation. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 29 March 2024, 02:51:56 UTC
41cbd56 [SPARK-47525][SQL] Support subquery correlation joining on map attributes ### What changes were proposed in this pull request? Currently, when a subquery is correlated on a condition like `outer_map[1] = inner_map[1]`, DecorrelateInnerQuery may generate a join on the map itself, which is unsupported, so the query cannot run - for example: ``` select * from x where (select sum(y2) from y where xm[1] = ym[1]) > 2; org.apache.spark.sql.AnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_CORRELATED_REFERENCE_DATA_TYPE] Unsupported subquery expression: Correlated column reference 'x.xm' cannot be map type. ``` However, if we rewrite the query to pull out the map access `outer_map[1]` into the outer plan, it succeeds. See the comments in the code at PullOutNestedDataOuterRefExpressions for more details and an example of the rewrite. ### Why are the changes needed? Enable queries to run successfully ### Does this PR introduce _any_ user-facing change? Yes, enables queries to run that previously errored. ### How was this patch tested? Add tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45673 from jchen5/subquery-nested-expr. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 March 2024, 02:51:37 UTC
fde7a7f [SPARK-47546][SQL][FOLLOWUP] Improve validation when reading Variant from Parquet using non-vectorized reader ### What changes were proposed in this pull request? Follow-up to https://github.com/apache/spark/pull/45703, which added validation for the Parquet vectorized reader. When the vectorized reader is disabled, we do not go through the code path that was modified in that PR. This PR adds the same checks for when the parquet-mr reader is used. ### Why are the changes needed? Improved error checking. ### Does this PR introduce _any_ user-facing change? No, Variant has not yet been released. ### How was this patch tested? Modified the previous test to also run with the vectorized reader disabled. It fails without the changes in this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45755 from cashmand/SPARK-47546-parquet-mr. Authored-by: cashmand <david.cashman@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 March 2024, 02:49:20 UTC
6cf5f95 [SPARK-47623][PYTHON][CONNECT][TESTS] Use `QuietTest` in parity tests ### What changes were proposed in this pull request? 1, add `def quiet` in both `ReusedPySparkTestCase` and `ReusedConnectTestCase`; 2, replace existing `QuietTest(self.sc)` with `self.quiet()`; 3, remove unnecessary overridden parity tests; ### Why are the changes needed? there are a lot of tests not enabled in parity test due to the usage of `sc` in `with QuietTest(self.sc)`, some of them were break into `def check_` and call the `check_` functions in the parity tests; However, there are still such tests not enabled. This PR aims to enable `QuietTest` within parity tests and make it easier to enable the remaining tests (`def check_` no longer needed). ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45747 from zhengruifeng/py_test_sc. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 March 2024, 23:54:13 UTC
009b50e [SPARK-47631][SQL] Remove unused `SQLConf.parquetOutputCommitterClass` method ### What changes were proposed in this pull request? This PR aims to remove unused `SQLConf.parquetOutputCommitterClass` method. ### Why are the changes needed? This method has been never used in Apache Spark 3.x. ### Does this PR introduce _any_ user-facing change? No. `o.a.s.sql.internal.SQLConf` is an internal package. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45757 from dongjoon-hyun/SPARK-47631. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 March 2024, 23:51:15 UTC
7f1eadc [SPARK-47366][PYTHON] Add pyspark and dataframe parse_json aliases ### What changes were proposed in this pull request? Added the `parse_json` function alias for pyspark and dataframe APIs. ### Why are the changes needed? Improves usability of the `parse_json` function. ### Does this PR introduce _any_ user-facing change? Before this change, the following would not be possible: ``` df.select(parse_json(df.json)).collect() ``` ### How was this patch tested? Added unit tests and manual testing. ### Was this patch authored or co-authored using generative AI tooling? no Closes #45741 from gene-db/variant-py-fns. Lead-authored-by: Gene Pang <gene.pang@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 March 2024, 23:50:44 UTC
ca00013 [SPARK-47635][K8S] Use Java `21` instead of `21-jre` image in K8s Dockerfile ### What changes were proposed in this pull request? This PR aims to use Java 21 instead of 21-jre in K8s Dockerfile . ### Why are the changes needed? Since Apache Spark 3.5.0, SPARK-44153 starts to use `jmap` like the following. https://github.com/apache/spark/blob/c832e2ac1d04668c77493577662c639785808657/core/src/main/scala/org/apache/spark/util/Utils.scala#L2030 ``` $ docker run -it --rm azul/zulu-openjdk:21-jre jmap docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "jmap": executable file not found in $PATH: unknown. ``` ``` $ docker run -it --rm azul/zulu-openjdk:21 jmap | head -n2 Usage: jmap -clstats <pid> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45761 from dongjoon-hyun/SPARK-47635. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 March 2024, 23:33:23 UTC
c832e2a [SPARK-47492][SQL] Widen whitespace rules in lexer ### What changes were proposed in this pull request? In this pull PR we extend the Lexer's understanding of WhiteSpace (what separates tokens) from the ASCII: <SPACE>, <LF><TAB><CR> to the various Unicode flavors of "space" such as "narrow" and "wide". ### Why are the changes needed? SQL statements are frequently copy pasted from various sources. Many of these sources are "rich text" and based on Unicode. When doing do it is inevitable that non ASCII whitespace characters are copied. This results today in often incomprehensible syntax errors. Incomprehensible because the error message prints the "bad" whitespace just like an ASCII whitespace. So the user stands little chance to find root cause unless they use possible editor options to to highlight non ASCII space or they, by sheer luck, happen to remove the whitespace. So in this PR we acknowledge the reality and stop "discriminating" against non-ASCII whitespace. ### Does this PR introduce _any_ user-facing change? Queries that used to fail before with a Syntax error, now succeed. ### How was this patch tested? Added a new set of unit tests in SparkSQLParserSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #45620 from srielau/SPARK-47492-Widen-whitespace-rules-in-lexer. Lead-authored-by: Serge Rielau <srielau@users.noreply.github.com> Co-authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 28 March 2024, 22:51:32 UTC
back to top