Revision history – Software Heritage archive

Revision	Author	Date	Message	Commit Date
bafce5d	Jiaheng Tang	25 July 2024, 04:58:11 UTC	[SPARK-48761][SQL] Introduce clusterBy DataFrameWriter API for Scala ### What changes were proposed in this pull request? Introduce a new `clusterBy` DataFrame API in Scala. This PR adds the API for both the DataFrameWriter V1 and V2, as well as Spark Connect. ### Why are the changes needed? Introduce more ways for users to interact with clustered tables. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new `clusterBy` DataFrame API in Scala to allow specifying the clustering columns when writing DataFrames. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47301 from zedtang/clusterby-scala-api. Authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 July 2024, 04:58:11 UTC
8c625ea	Wei Guo	25 July 2024, 00:53:32 UTC	[SPARK-48943][SQL][TESTS][FOLLOWUP] Fix the `h2` filter push-down test case failure with ANSI mode off ### What changes were proposed in this pull request? This PR aims to fix the `h2` filter push-down test case failure with ANSI mode off. ### Why are the changes needed? Fix test failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test of the whole `JDBCV2Suite` with ANSI mode off and on. 1. Method One: with IDEA. - ANSI mode off: with `SPARK_ANSI_SQL_MODE=false` <img width="1066" alt="image" src="https://github.com/user-attachments/assets/13ec8ff4-0699-4f3e-95c4-74f53d9824fe"> - ANSI mode on: without `SPARK_ANSI_SQL_MODE` env variable <img width="1066" alt="image" src="https://github.com/user-attachments/assets/8434bf0c-b332-4663-965c-0d17d60da78a"> 2. Method Two: with commands. - ANSI mode off ``` SPARK_ANSI_SQL_MODE=false $ build/sbt > project sql > testOnly org.apache.spark.sql.jdbc.JDBCV2Suite ``` - ANSI mode on ``` UNSET SPARK_ANSI_SQL_MODE $ build/sbt > project sql > testOnly org.apache.spark.sql.jdbc.JDBCV2Suite ``` Test results: 1. The issue on current `master` branch - with `SPARK_ANSI_SQL_MODE=false`, test failed - without `SPARK_ANSI_SQL_MODE` env variable, test passed 2. Fixed with new test code - with `SPARK_ANSI_SQL_MODE=false`, test passed - without `SPARK_ANSI_SQL_MODE` env variable, test passed ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47472 from wayneguow/fix_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	25 July 2024, 00:53:32 UTC
5d787e2	Bo Zhang	24 July 2024, 23:49:15 UTC	[SPARK-47764][FOLLOW-UP] Change to use ShuffleDriverComponents.removeShuffle to remove shuffle properly ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/45930, where we introduced ShuffleCleanupMode and implemented cleaning up of shuffle dependencies. There was a bug where `ShuffleManager.unregisterShuffle` was used on Driver, and in non-local mode it is not effective at all. This change fixed the bug by changing to use `ShuffleDriverComponents.removeShuffle` instead. ### Why are the changes needed? This is to address the comments in https://github.com/apache/spark/pull/45930#discussion_r1584223064 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46302 from bozhang2820/spark-47764-1. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	24 July 2024, 23:49:15 UTC
34e65a8	panbingkun	24 July 2024, 14:52:04 UTC	[SPARK-48990][SQL] Unified variable related SQL syntax keywords ### What changes were proposed in this pull request? The pr aims to unified `variable` related `SQL syntax` keywords, enable syntax `DECLARE (OR REPLACE)? ...` and `DROP TEMPORARY ...` to support keyword: `VAR` (not only `VARIABLE`). ### Why are the changes needed? When `setting` variables, we support `(VARIABLE \| VAR)`, but when `declaring` and `dropping` variables, we only support the keyword `VARIABLE` (not support the keyword `VAR`) <img width="597" alt="image" src="https://github.com/user-attachments/assets/07084fef-4080-4410-a74c-e6001ae0a8fa"> https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L68-L72 https://github.com/apache/spark/blob/285489b0225004e918b6e937f7367e492292815e/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4#L218-L220 The syntax seems `a bit weird`, `inconsistent experience` in SQL syntax related to variable usage by end-users, so I propose to `unify` it. ### Does this PR introduce _any_ user-facing change? Yes, enable end-users to use `variable related SQL` with `consistent` keywords. ### How was this patch tested? Updated existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47469 from panbingkun/SPARK-48990. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	24 July 2024, 14:52:04 UTC
d68cde8	Kent Yao	24 July 2024, 11:47:29 UTC	[SPARK-48991][SQL] Move path initialization into try-catch block in FileStreamSink.hasMetadata ### What changes were proposed in this pull request? This pull request proposed to move path initialization into try-catch block in FileStreamSink.hasMetadata. Then, exceptions from invalid paths can be handled consistently like other path-related exceptions in the current try-catch block. At last, we can make the errors fall into the correct code branches to be handled ### Why are the changes needed? bugfix for improperly handled exceptions in FileStreamSink.hasMetadata ### Does this PR introduce _any_ user-facing change? no, an invalid path is still invalid, but fails in the correct places ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #47471 from yaooqinn/SPARK-48991. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	24 July 2024, 11:47:29 UTC
8597b78	Ruifeng Zheng	24 July 2024, 10:13:32 UTC	[SPARK-48988][ML] Make `DefaultParamsReader/Writer` handle metadata with spark session ### What changes were proposed in this pull request? `DefaultParamsReader/Writer` handle metadata with spark session ### Why are the changes needed? In existing ml implementations, when loading/saving a model, it loads/saves the metadata with `SparkContext` then loads/saves the coefficients with `SparkSession`. This PR aims to also load/save the metadata with `SparkSession`, by introducing new helper functions. - Note I: 3-rd libraries (e.g. [xgboost](https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/ml/util/XGBoostReadWrite.scala#L38-L53) ) likely depends on existing implementation of saveMetadata/loadMetadata, so we cannot simply remove them even though they are `private[ml]`. - Note II: this PR only handles `loadMetadata` and `saveMetadata`, there are similar cases for meta algorithms and param read/write, but I want to ignore the remaining part first, to avoid touching too many files in single PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47467 from zhengruifeng/ml_load_with_spark. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	24 July 2024, 10:13:32 UTC
0c9b072	Richard Chen	24 July 2024, 10:02:38 UTC	[SPARK-48833][SQL][VARIANT] Support variant in `InMemoryTableScan` ### What changes were proposed in this pull request? adds support for variant type in `InMemoryTableScan`, or `df.cache()` by supporting writing variant values to an inmemory buffer. ### Why are the changes needed? prior to this PR, calling `df.cache()` on a df that has a variant would fail because `InMemoryTableScan` does not support reading variant types. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added UTs ### Was this patch authored or co-authored using generative AI tooling? no Closes #47252 from richardc-db/variant_dfcache_support. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	24 July 2024, 10:02:38 UTC
22eb6c4	Wei Liu	24 July 2024, 09:57:45 UTC	[SPARK-48567][SS][FOLLOWUP] StreamingQuery.lastProgress should return the actual StreamingQueryProgress This reverts commit d067fc6c1635dfe7730223021e912e78637bb791, which reverted 042804ad545c88afe69c149b25baea00fc213708, essentially brings it back. 042804ad545c88afe69c149b25baea00fc213708 failed the 3.5 client <> 4.0 server test, but the test was decided to turned off for cross-version test in https://github.com/apache/spark/pull/47468 ### What changes were proposed in this pull request? This PR is created after discussion in this closed one: https://github.com/apache/spark/pull/46886 I was trying to fix a bug (in connect, query.lastProgress doesn't have `numInputRows`, `inputRowsPerSecond`, and `processedRowsPerSecond`), and we reached the conclusion that what purposed in this PR should be the ultimate fix. In python, for both classic spark and spark connect, the return type of `lastProgress` is `Dict` (and `recentProgress` is `List[Dict]`), but in scala it's the actual `StreamingQueryProgress` object: https://github.com/apache/spark/blob/1a5d22aa2ffe769435be4aa6102ef961c55b9593/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala#L94-L101 This API discrepancy brings some confusion, like in Scala, users can do `query.lastProgress.batchId`, while in Python they have to do `query.lastProgress["batchId"]`. This PR makes `StreamingQuery.lastProgress` to return the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` to return `List[StreamingQueryProgress]`). To prevent breaking change, we extend `StreamingQueryProgress` to be a subclass of `dict`, so existing code accessing using dictionary method (e.g. `query.lastProgress["id"]`) is still functional. ### Why are the changes needed? API parity ### Does this PR introduce _any_ user-facing change? Yes, now `StreamingQuery.lastProgress` returns the actual `StreamingQueryProgress` (and `StreamingQuery.recentProgress` returns `List[StreamingQueryProgress]`). ### How was this patch tested? Added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47470 from WweiL/bring-back-lastProgress. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	24 July 2024, 09:57:45 UTC
239d77b	Momcilo Mrkaic	24 July 2024, 07:19:08 UTC	[SPARK-48338][SQL] Check variable declarations ### What changes were proposed in this pull request? Checking wether variable declaration is only at the beginning of the BEGIN END block. ### Why are the changes needed? SQL standard states that the variables can be declared only immediately after BEGIN. ### Does this PR introduce _any_ user-facing change? Users will get an error if they try to declare variable in the scope that is not started with BEGIN and ended with END or if the declarations are not immediately after BEGIN. ### How was this patch tested? Tests are in SqlScriptingParserSuite. There are 2 tests for now, if declarations are correctly written and if declarations are not written immediately after BEGIN. There is a TODO to write the test if declaration is located in the scope that is not BEGIN END. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47404 from momcilomrk-db/check_variable_declarations. Authored-by: Momcilo Mrkaic <momcilo.mrkaic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	24 July 2024, 07:19:08 UTC
4de4ed1	panbingkun	24 July 2024, 06:44:11 UTC	[SPARK-48935][SQL][TESTS] Make `checkEvaluation` directly check the `Collation` expression itself in UT ### What changes were proposed in this pull request? The pr aims to: - make `checkEvaluation` directly check the `Collation` expression itself in UT, rather than `Collation(...).replacement`. - fix an `miss` check in UT. ### Why are the changes needed? When checking the `RuntimeReplaceable` expression in UT, there is no need to write as `checkEvaluation(Collation(Literal("abc")).replacement, "UTF8_BINARY")`, because it has already undergone a similar replacement internally. https://github.com/apache/spark/blob/1a428c1606645057ef94ac8a6cadbb947b9208a6/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L75 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Update existed UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47401 from panbingkun/SPARK-48935. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	24 July 2024, 06:44:11 UTC
090ad9f	Haejoon Lee	24 July 2024, 05:35:53 UTC	[SPARK-48961][PYTHON] Make the parameter naming of `PySparkException` consistent with JVM ### What changes were proposed in this pull request? This PR proposes to make the parameter naming of `PySparkException` consistent with JVM ### Why are the changes needed? The parameter names of `PySparkException` are different from `SparkException` so there is an inconsistency when searching those parameters from error logs. SparkException: https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L27-L33 PySparkException: https://github.com/apache/spark/blob/6508b1f5e18731359354af0a7bcc1382bc4f356b/python/pyspark/errors/exceptions/base.py#L29-L40 ### Does this PR introduce _any_ user-facing change? The error parameter names are changed from: - `error_class` -> `errorClass` - `message_parameters` -> `messageParameters` - `query_contexts` -> `context` ### How was this patch tested? The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47436 from itholic/SPARK-48961. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>	24 July 2024, 05:35:53 UTC
4e20a2a	Riya Verma	24 July 2024, 05:03:17 UTC	[SPARK-48931][SS] Reduce Cloud Store List API cost for state store maintenance task ### What changes were proposed in this pull request? Currently, during the state store maintenance process, we find which old version files of the RocksDB state store to delete by listing all existing snapshotted version files in the checkpoint directory every 1 minute by default. The frequent list calls in the cloud can result in high costs. To address this concern and reduce the cost associated with state store maintenance, we should aim to minimize the frequency of listing object stores inside the maintenance task. To minimize the frequency, we will try to accumulate versions to delete and only call list when the number of versions to delete reaches a configured threshold. The changes include: 1. Adding new configuration variable `ratioExtraVersionsAllowedInCheckpoint` in SQLConf. This determines the ratio of extra versions files we want to retain in the checkpoint directory compared to number of versions to retain for rollbacks (`minBatchesToRetain`). 2. Using this config and `minBatchesToRetain`, set `minVersionsToDelete` config inside StateStoreConf to `minVersionsToDelete = ratioExtraVersionsAllowedInCheckpoint * minBatchesToRetain.` 3. Using `minSeenVersion` and `maxSeenVersion` variables in RocksDBFileManager to estimate min/max version present in directory and control deletion frequency. This is done by ensuring number of stale versions to delete is at least `minVersionsToDelete` ### Why are the changes needed? Currently, maintenance operations like snapshotting, purging, deletion, and file management is done asynchronously for each data partition. We want to shift away periodic deletion and instead rely on the estimated number of files in the checkpoint directory to reduce list calls and introduce batch deletion. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47393 from riyaverm-db/reduce-cloud-store-list-api-cost-in-maintenance. Authored-by: Riya Verma <riya.verma@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	24 July 2024, 05:03:17 UTC
f828146	yangjie01	24 July 2024, 04:22:19 UTC	[SPARK-48975][PROTOBUF] Remove unnecessary `ScalaReflectionLock` definition from `protobuf` ### What changes were proposed in this pull request? This PR removes the unused object definition `ScalaReflectionLock` from the `protobuf` module. `ScalaReflectionLock` is a definition at the access scope of `protobuf` package, which was defined in SPARK-40654 \| https://github.com/apache/spark/pull/37972 and become unused in SPARK-41639 \| https://github.com/apache/spark/pull/39147. ### Why are the changes needed? Clean up unused definitions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #47459 from LuciferYang/remove-ScalaReflectionLock. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	24 July 2024, 04:22:19 UTC
881b214	panbingkun	24 July 2024, 03:21:33 UTC	[SPARK-48976][SQL][DOCS] Improve the docs related to `variable` ### What changes were proposed in this pull request? The pr aims to improve the docs related to `variable`, includes: - `docs/sql-ref-syntax-aux-set-var.md`, show the `primitive` error messages. - `docs/sql-ref-syntax-ddl-declare-variable.md`, add usage of `DECLARE OR REPLACE`. - `docs/sql-ref-syntax-ddl-drop-variable.md`, show the `primitive` error messages and fix `typo`. ### Why are the changes needed? Only improve docs. ### Does this PR introduce _any_ user-facing change? Yes, make end-user docs clearer. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47460 from panbingkun/SPARK-48976. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	24 July 2024, 03:21:33 UTC
3fdad6a	Stefan Kandic	24 July 2024, 03:20:56 UTC	[SPARK-48981] Fix simpleString method of StringType in pyspark for collations ### What changes were proposed in this pull request? Fixing the bug in the code where because of different way string interpolation works in python we had an accidental dollar sign in the string value. ### Why are the changes needed? To be consistent with the scala code. ### Does this PR introduce _any_ user-facing change? Different value will be shown to the user. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47463 from stefankandic/fixPythonToString. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	24 July 2024, 03:20:56 UTC
b9447db	panbingkun	24 July 2024, 03:19:06 UTC	[SPARK-48987][INFRA] Make `curl` retry 3 times in `bin/mvn` ### What changes were proposed in this pull request? The pr aims to make `curl` retry `3 times` in `bin/mvn`. ### Why are the changes needed? Avoid the following issues: https://github.com/panbingkun/spark/actions/runs/10067831390/job/27832101470 <img width="993" alt="image" src="https://github.com/user-attachments/assets/3fa9a59a-82cb-4e99-b9f7-d128f051d340"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Continuous manual observation. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47465 from panbingkun/SPARK-48987. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	24 July 2024, 03:19:06 UTC
877c3f2	yangjie01	24 July 2024, 02:41:07 UTC	[SPARK-48974][SQL][SS][ML][MLLIB] Use `SparkSession.implicits` instead of `SQLContext.implicits` ### What changes were proposed in this pull request? This PR replaces `SQLContext.implicits` with `SparkSession.implicits` in the Spark codebase. ### Why are the changes needed? Reduce the usage of code from `SQLContext` within the internal code of Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #47457 from LuciferYang/use-sparksession-implicits. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>	24 July 2024, 02:41:07 UTC
fdcf975	Stefan Kandic	24 July 2024, 01:43:38 UTC	[SPARK-48414][PYTHON] Fix breaking change in python's `fromJson` ### What changes were proposed in this pull request? Fix breaking change in `fromJson` method by having default param values. ### Why are the changes needed? In order to not break clients that don't care about collations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46737 from stefankandic/fromJsonBreakingChange. Authored-by: Stefan Kandic <stefan.kandic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	24 July 2024, 01:43:38 UTC
118167f	Mingkang Li	23 July 2024, 19:41:47 UTC	[SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpointed RDDs ### What changes were proposed in this pull request? This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method. ### Why are the changes needed? Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior. ### How was this patch tested? This PR does not change any existing behavior and therefore no tests are added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47391 from mingkangli-db/warning_unpersist. Authored-by: Mingkang Li <mingkang.li@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	23 July 2024, 19:41:47 UTC
c69f02e	Haejoon Lee	23 July 2024, 18:04:21 UTC	[SPARK-48752][FOLLOWUP][PYTHON][DOCS] Use explicit name for line number in log ### What changes were proposed in this pull request? This PR followups for https://github.com/apache/spark/pull/47145 to rename the log field naming ### Why are the changes needed? `line_no` is not very intuitive so we better renaming to `line_number` explicitly. ### Does this PR introduce _any_ user-facing change? No API change, but user-facing log message will be improved ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> The existing CI should pass ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No Closes #47437 from itholic/logger_followup. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>	23 July 2024, 18:04:21 UTC
fba4c8c	Weichen Xu	23 July 2024, 11:19:28 UTC	[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer ### What changes were proposed in this pull request? `SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None. ### Why are the changes needed? It fixes the bug like: ``` spark = SparkSession.getActiveSession() > spark.createDataFrame( # type: ignore[union-attr] [(metadataJson,)], schema=["value"] ).coalesce(1).write.text(metadataPath) E AttributeError: 'NoneType' object has no attribute 'createDataFrame' ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47453 from WeichenXu123/SPARK-48970. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	23 July 2024, 11:19:28 UTC
285489b	Anish Shrigondekar	23 July 2024, 08:18:52 UTC	[SPARK-48957][SS] Return sub-classified error class on state store load for hdfs and rocksdb provider ### What changes were proposed in this pull request? Return sub-classified error class on state store load for hdfs and rocksdb provider ### Why are the changes needed? Without the change, all the higher level functions were seeing the exception and error class as `CANNOT_LOAD_STATE_STORE.UNCATEGORIZED` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Augmented unit tests ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common... [info] Run completed in 4 minutes, 12 seconds. [info] Total number of tests run: 176 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47431 from anishshri-db/task/SPARK-48957. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	23 July 2024, 08:18:52 UTC
71f3f1f	Ruifeng Zheng	23 July 2024, 04:12:06 UTC	[SPARK-48972][PYTHON] Unify the literal string handling in functions ### What changes were proposed in this pull request? Unify the literal string handling in functions ### Why are the changes needed? internal refactoring, to make code simple ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47454 from zhengruifeng/py_func_cleanup. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	23 July 2024, 04:12:06 UTC
a809740	Wei Guo	23 July 2024, 02:27:21 UTC	[SPARK-48893][SQL][PYTHON][DOCS] Add some examples for `linearRegression` built-in functions ### What changes were proposed in this pull request? This PR aims to add some extra examples for `linearRegression` built-in functions. ### Why are the changes needed? - Align the use examples for this series of functions. - Allow users to better understand the usage of `linearRegression` related methods from sql built-in functions docs(https://spark.apache.org/docs/latest/api/sql/index.html). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA and Manual testing for new examples. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47343 from wayneguow/regr_series. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>	23 July 2024, 02:27:21 UTC
4cc41ea	Wei Guo	23 July 2024, 02:21:32 UTC	[SPARK-48943][TESTS] Upgrade `h2` to 2.3.230 and enhance the test coverage of behavior changes of `asin` and `acos` complying Standard SQL ### What changes were proposed in this pull request? This PR aims to upgrade `h2` from 2.2.220 to 2.3.230 and enhance the test coverage of behavior changes of `asin` and `acos` complying Standard SQL. The detail of behavior changes as follows: After this commit( https://github.com/h2database/h2database/commit/186647d4a35d05681febf4f53502b306aa6d511a), the behavior of `asin` and `acos` has changed in h2, complying with Standard SQL, and throwing exceptions directly when the argument is invalid(< -1d \|\| > 1d). ### Why are the changes needed? 2.3.230 is latest version of `h2`, there are a lot of bug fixes and improvements. Full change notes: https://www.h2database.com/html/changelog.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update a exist test case and add a new test case. Pass GA and manually test `JDBCV2Suite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47414 from wayneguow/upgrade_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>	23 July 2024, 02:21:32 UTC
0b6cb3e	Avery Qi	23 July 2024, 01:36:30 UTC	[SPARK-48914][SQL][TESTS] Add OFFSET operator as an option in the subquery generator ### What changes were proposed in this pull request? This adds offset operator in subquery generator suite. ### Why are the changes needed? Complete the subquery generator functionality ### Does this PR introduce _any_ user-facing change? previously there's no subqueries having offset operator being tested. Currently offset operator is added. ### How was this patch tested? query test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47375 from averyqi-db/offset_operator. Authored-by: Avery Qi <avery.qi@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	23 July 2024, 01:36:30 UTC
f70d41a	panbingkun	22 July 2024, 22:56:32 UTC	[SPARK-48703][SQL][TESTS] Upgrade `mssql-jdbc` to 12.6.3.jre11 ### What changes were proposed in this pull request? The pr aims to upgrade `mssql-jdbc` from `12.6.2.jre11` to `12.6.3.jre11` ### Why are the changes needed? https://github.com/microsoft/mssql-jdbc/releases/tag/v12.6.3 Hotfix & Stable Release: - Fixed issue where TokenCredential class was required to be imported https://github.com/microsoft/mssql-jdbc/pull/2453 - Fixed timestamp string conversion regression https://github.com/microsoft/mssql-jdbc/pull/2455 - Fixed SQLServerCallableStatement default value regression https://github.com/microsoft/mssql-jdbc/pull/2456 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47075 from panbingkun/SPARK-48703. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	22 July 2024, 22:56:32 UTC
3b07315	Kent Yao	22 July 2024, 19:26:07 UTC	[MINOR][INFRA] Add more know translations for contributors ### What changes were proposed in this pull request? Recognized these contribtuor translations ```diff +Yikf - Kaifei Yi +jackylee-ch - Junqing Li +liujiayi771 - Jiayi Liu +maheshk114 - Mahesh Kumar Behera +panbingkun - BingKun Pan +wForget - Zhen Wang +yaooqinn - Kent Yao +zhipengmao-db - Zhipeng Mao ``` ### Why are the changes needed? improvement for infra tooling ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Using translate-contributors.py ### Was this patch authored or co-authored using generative AI tooling? no Closes #47441 from yaooqinn/minor3. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	22 July 2024, 19:26:07 UTC
4772602	Serge Rielau	22 July 2024, 18:24:28 UTC	[SPARK-48929] Fix view internal error and clean up parser exception context ### What changes were proposed in this pull request? Replace INVALID_VIEW_TEXT (SQLSTATE: XX000) error-condition which is forced when view expansion fails to parse with the underlying error. The reason being that the underlying error has important information AND it is much more likely that view expansion fails because of a behavior change in the parser than some nefarious action or simple corruption. For example the recent removal of `!` and `NOT` equivalence fails old view on on Spark 4.0 if they exploit the unsupported syntax. As part of this change we also include the Query context summary in the ParserException processing. This will expose the view name along side the view query error message. ### Why are the changes needed? Better error handling improving ability of customers to do root cause analysis. ### Does this PR introduce _any_ user-facing change? We are now giving more error context in case of parer errors and specifically in case of parser errors in view expansion ### How was this patch tested? Existing QA ### Was this patch authored or co-authored using generative AI tooling? No Closes #47405 from srielau/SPARK-48929-Fix-view-intrenal-error. Authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	22 July 2024, 18:24:28 UTC
bc7a5c0	panbingkun	22 July 2024, 16:01:12 UTC	[SPARK-48958][BUILD] Upgrade `zstd-jni` to 1.5.6-4 ### What changes were proposed in this pull request? The pr aims to upgrade `zstd-jni` from `1.5.6-3` to `1.5.6-4`. ### Why are the changes needed? 1.v1.5.6-3 VS v1.5.6-4 https://github.com/luben/zstd-jni/compare/v1.5.6-3...v1.5.6-4 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47432 from panbingkun/SPARK-48958. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	22 July 2024, 16:01:12 UTC
c17fc5a	Kent Yao	22 July 2024, 15:59:12 UTC	[SPARK-48963][INFRA] Support `JIRA_ACCESS_TOKEN` in translate-contributors.py ### What changes were proposed in this pull request? Support JIRA_ACCESS_TOKEN in translate-contributors.py ### Why are the changes needed? Remove plaintext password in JIRA_PASSWORD environment variable to prevent password leakage ### Does this PR introduce _any_ user-facing change? no, infra only ### How was this patch tested? Ran translate-contributors.py with 3.5.2 RC ### Was this patch authored or co-authored using generative AI tooling? no Closes #47440 from yaooqinn/SPARK-48963. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	22 July 2024, 15:59:12 UTC
3af2ea9	Ruifeng Zheng	22 July 2024, 14:14:30 UTC	[SPARK-48959][SQL] Make `NoSuchNamespaceException` extend `NoSuchDatabaseException` to restore the exception handling ### What changes were proposed in this pull request? Make `NoSuchNamespaceException` extend `NoSuchNamespaceException` ### Why are the changes needed? 1, https://github.com/apache/spark/pull/47276 made many SQL commands throw `NoSuchNamespaceException` instead of `NoSuchDatabaseException`, it is more then an end-user facing change, it is a breaking change which break the exception handling in 3-rd libraries in the eco-system. 2, `NoSuchNamespaceException` and `NoSuchDatabaseException` actually share the same error class `SCHEMA_NOT_FOUND` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47433 from zhengruifeng/make_nons_nodb. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	22 July 2024, 14:14:30 UTC
be1afd5	Weichen Xu	22 July 2024, 13:18:54 UTC	[SPARK-48941][PYTHON][ML] Replace RDD read / write API invocation with Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of https://github.com/apache/spark/pull/47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	22 July 2024, 13:18:54 UTC
8816e9c	panbingkun	22 July 2024, 09:36:57 UTC	[SPARK-48962][INFRA] Make the input parameters of `workflows/benchmark` selectable ### What changes were proposed in this pull request? The pr aims to make the `input parameters` of `workflows/benchmark` selectable. ### Why are the changes needed? - Before: <img width="311" alt="image" src="https://github.com/user-attachments/assets/da93ea8f-8791-4816-a5d9-f82c018fa819"> - After: https://github.com/panbingkun/spark/actions/workflows/benchmark.yml <img width="318" alt="image" src="https://github.com/user-attachments/assets/0b9b01a0-96f6-4630-98d9-7d2709aafcd0"> ### Does this PR introduce _any_ user-facing change? Yes, Convenient for developers to run `workflows/benchmark`, transforming input values from only `tex`t to `selectable values`. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47438 from panbingkun/improve_workflow_dispatch. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	22 July 2024, 09:36:57 UTC
42d1479	Wei Guo	22 July 2024, 07:07:10 UTC	[MINOR][DOCS] Fix some typos in `LZFBenchmark` ### What changes were proposed in this pull request? This RP aims to fix some typos in `LZFBenchmark`. ### Why are the changes needed? Fix typos and avoid confusion. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47435 from wayneguow/lzf. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	22 July 2024, 07:07:10 UTC
6508b1f	Ruifeng Zheng	21 July 2024, 23:53:31 UTC	[MINOR][PYTHON] Fix type hint for `from_utc_timestamp` and `to_utc_timestamp` ### What changes were proposed in this pull request? Fix type hint for `from_utc_timestamp` and `to_utc_timestamp` ### Why are the changes needed? the str type input should be treated as literal string, instead of column name ### Does this PR introduce _any_ user-facing change? doc change ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47429 from zhengruifeng/py_fix_hint_202407. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	21 July 2024, 23:53:31 UTC
df510ae	Ruifeng Zheng	21 July 2024, 23:52:55 UTC	[SPARK-48955][SQL] `ArrayCompact`'s datatype should be `containsNull = false` ### What changes were proposed in this pull request? `ArrayCompact`'s datatype should be `containsNull = false` ### Why are the changes needed? `ArrayCompact` - Removes null values from the array ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test before: ``` scala> val df = spark.range(1).select(lit(Array(1,2,3)).alias("a")) val df: org.apache.spark.sql.DataFrame = [a: array<int>] scala> df.printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root \|-- a: array (nullable = false) \| \|-- element: integer (containsNull = true) scala> df.select(array_compact(col("a"))).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root \|-- array_compact(a): array (nullable = false) \| \|-- element: integer (containsNull = true) ``` after ``` scala> df.select(array_compact(col("a"))).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root \|-- array_compact(a): array (nullable = false) \| \|-- element: integer (containsNull = false) ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47430 from zhengruifeng/sql_array_compact_data_type. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	21 July 2024, 23:52:55 UTC
2fb3c35	Anish Shrigondekar	21 July 2024, 23:23:10 UTC	[SPARK-48891][SS] Refactor StateSchemaCompatibilityChecker to unify all state schema formats ### What changes were proposed in this pull request? Refactor StateSchemaCompatibilityChecker to unify all state schema formats ### Why are the changes needed? Needed to integrate future changes around state data source reader and schema evolution and consolidate these changes - Consolidates all state schema reader/writers in one place - Consolidates all validation logic through the same API ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` 12:38:45.481 WARN org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityCheckerSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.StateSchemaCompatibilityCheckerSuite, threads: rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true) ===== [info] Run completed in 12 seconds, 565 milliseconds. [info] Total number of tests run: 30 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 30, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47359 from anishshri-db/task/SPARK-48891. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	21 July 2024, 23:23:10 UTC
15c98e0	Serge Rielau	21 July 2024, 08:47:46 UTC	[SPARK-48954] try_mod() replaces try_remainder() ### What changes were proposed in this pull request? for consistency try_remainder() gets renamed to try_mod(). this is Spark 4.0.0 only, so no need for config. ### Why are the changes needed? To keep consistent naming. ### Does this PR introduce _any_ user-facing change? Yes, replaces try_remainder() with try_mod() ### How was this patch tested? Existing try_remainder() tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47427 from srielau/SPARK-48954-try-mod. Authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>	21 July 2024, 08:47:46 UTC
41f67dd	Amanda Liu	20 July 2024, 17:00:40 UTC	[SPARK-48592][INFRA] Add structured logging style script and GitHub workflow ### What changes were proposed in this pull request? This PR checks for Scala logging messages using logInfo, logWarning, logError and containing variables without MDC wrapper Example error output: ``` [error] spark/mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:225:4 [error] Logging message should use Structured Logging Framework style, such as log"...${MDC(TASK_ID, taskId)..." Refer to the guidelines in the file `internal/Logging.scala`. ``` ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test, verified it will throw errors on invalid logging messages. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47239 from asl3/structuredlogstylescript. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	20 July 2024, 17:00:40 UTC
5785098	allisonwang-db	20 July 2024, 00:54:06 UTC	[SPARK-48938][PYTHON] Improve error messages when registering Python UDTFs ### What changes were proposed in this pull request? This PR improves the error messages when registering Python UDTFs. Before this PR: ```python class TestUDTF: ... spark.udtf.register("test_udtf", TestUDTF) ``` This fails with ``` AttributeError: type object "TestUDTF" has no attribute "evalType" ``` After this PR: ```python spark.udtf.register("test_udtf", TestUDTF) ``` Now we have a nicer error: ``` [CANNOT_REGISTER_UDTF] Cannot register the UDTF 'test_udtf': expected a 'UserDefinedTableFunction'. Please make sure the UDTF is correctly defined as a class, and then either wrap it in the `udtf()` function or annotate it with `udtf(...)`.` ``` ### Why are the changes needed? To improve usability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47408 from allisonwang-db/spark-48938-udtf-register-err-msg. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	20 July 2024, 00:54:06 UTC
1632568	Ruifeng Zheng	20 July 2024, 00:53:25 UTC	[SPARK-48944][CONNECT] Unify the JSON-format schema handling in Connect Server ### What changes were proposed in this pull request? Simplify the JSON-format schema handling in Connect Server, by introducing a helper function `extractDataTypeFromJSON` ### Why are the changes needed? to unify the schema handling ### Does this PR introduce _any_ user-facing change? No, minor refactoring ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47415 from zhengruifeng/simplfy_from_json. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	20 July 2024, 00:53:25 UTC
69d0c6a	Ruifeng Zheng	20 July 2024, 00:53:04 UTC	[SPARK-48945][PYTHON] Simplify regex functions with `lit` ### What changes were proposed in this pull request? Simplify a group of function with `lit` ### Why are the changes needed? code clean up, these branchings are not necessary ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47417 from zhengruifeng/py_func_simplity_lit. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	20 July 2024, 00:53:04 UTC
c629dc0	jingz-db	19 July 2024, 22:03:01 UTC	[SPARK-48836] Integrate SQL schema with state schema/metadata ### What changes were proposed in this pull request? This PR enables TWS operator writes the "real" schema of the state variables that is initialized on the executors to be written to the `StateSchemaV3` that is being written by drivers. We'll integrate the SQL schema of the state variables with this [StateSchemaV3 implementation PR](https://github.com/apache/spark/pull/47104). ### Why are the changes needed? When reloading the state after query restart, we'll need the schema/encoder of the state variables before restart. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Integration tests in `TransformWithStateSuite` and `TransformWithStateTTLSuite` that tests with all state variable types to have correct schema. Existing integration tests in `TransformWith*State(TTL)Suite` for verifying SQL serialization is correct. Existing unit test suites & newly added unit suites in `ValueStateSuite`, `ListStateSuite`, `MapStateSuite`, `TimerSuite` for non-primitive types. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47257 from jingz-db/metadata-schema-compatible. Authored-by: jingz-db <jing.zhan@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	19 July 2024, 22:03:01 UTC
1fa53c7	Szilard Miko	19 July 2024, 14:58:47 UTC	[SPARK-48946][SQL] NPE in redact method when session is null ### What changes were proposed in this pull request? If we call DataSourceV2ScanExecBase redact method from a thread that don't have a session in thread local we get an NPE. Getting stringRedactionPattern from conf could prevent this problem as conf checks if session is null or not. We also use this in DataSourceScanExec trait. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L93-L95 ### Why are the changes needed? To prevent NPE when session is null. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47419 from mikoszilard/SPARK-48946. Authored-by: Szilard Miko <smiko@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	19 July 2024, 14:58:47 UTC
af5eb08	wforget	19 July 2024, 07:58:22 UTC	[SPARK-47307][SQL][FOLLOWUP] Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: https://github.com/apache/spark/pull/47303#issuecomment-2237785431 ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in #47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47410 from wForget/SPARK-47307_followup. Lead-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	19 July 2024, 07:58:22 UTC
3f6e2d6	Wenchen Fan	19 July 2024, 07:27:26 UTC	[SPARK-48498][SQL][FOLLOWUP] do padding for char-char comparison ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	19 July 2024, 07:27:26 UTC
1645046	panbingkun	19 July 2024, 04:49:12 UTC	[SPARK-48940][BUILD] Upgrade `Arrow` to 17.0.0 ### What changes were proposed in this pull request? The pr aims to upgrade `arrow` from `16.1.0` to `17.0.0`. ### Why are the changes needed? The full release notes: https://arrow.apache.org/release/17.0.0.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47409 from panbingkun/SPARK-48940. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	19 July 2024, 04:49:12 UTC
c439bc8	panbingkun	19 July 2024, 03:09:49 UTC	[SPARK-48933][BUILD] Upgrade `protobuf-java` to `3.25.3` ### What changes were proposed in this pull request? The pr aims to upgrade `protobuf-java` from `3.25.1` to `3.25.3`. ### Why are the changes needed? - v3.25.1 VS v.3.25.3: https://github.com/protocolbuffers/protobuf/compare/v3.25.1...v3.25.3 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47397 from panbingkun/SPARK-48933. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	19 July 2024, 03:09:49 UTC
d3a9055	Kent Yao	19 July 2024, 02:26:21 UTC	Revert "[SPARK-47307][DOCS][FOLLOWUP] Add a migration guide for the behavior change of base64 function" This reverts commit b2e0a4dfb8d265635ef6199d24be8fa036d25786.	19 July 2024, 02:26:21 UTC
110b558	Wei Guo	19 July 2024, 00:21:30 UTC	[SPARK-48915][SQL][TESTS][FOLLOWUP] Add some uncovered predicates(!=, <, <=, >, >=) for correlation in `GeneratedSubquerySuite` ### What changes were proposed in this pull request? In PR #47386, we improves coverage of predicate types of scalar subquery in the WHERE clause. Follow up, this PR as aims to add some uncovered predicates(!=, <, <=, >, >=) for correlation in `GeneratedSubquerySuite`. ### Why are the changes needed? Better coverage of current subquery tests with correlation in `GeneratedSubquerySuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47399 from wayneguow/SPARK-48915_follow_up. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	19 July 2024, 00:21:30 UTC
a90cb0a	Russell Spitzer	18 July 2024, 23:33:15 UTC	[SPARK-48495][DOCS][FOLLOW-UP] Fix Table Markdown in Shredding.md Minor change that shouldn't require a Jira to fix the unbalanced row in the example of Shredding.md Closes #47407 from RussellSpitzer/patch-1. Authored-by: Russell Spitzer <russell.spitzer@GMAIL.COM> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	18 July 2024, 23:33:15 UTC
a54daa1	Siying Dong	18 July 2024, 22:30:33 UTC	[SPARK-48934][SS] Python datetime types converted incorrectly for setting timeout in applyInPandasWithState ### What changes were proposed in this pull request? Fix the way applyInPandasWithState's setTimeoutTimestamp() handles argument of datetime ### Why are the changes needed? In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in with datetime.datetime type, it doesn't function as expected. Fix it. Also, fix another bug of reporting VALUE_NOT_POSITIVE. This issue will trigger when the converted value is 0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add unit test coverage for thie scenario ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47398 from siying/state_set_timeout. Lead-authored-by: Siying Dong <siying.dong@databricks.com> Co-authored-by: Siying Dong <dong.sy@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	18 July 2024, 22:30:33 UTC
fab6d83	Liang-Chi Hsieh	18 July 2024, 20:00:14 UTC	[SPARK-48921][SQL] ScalaUDF encoders in subquery should be resolved for MergeInto ### What changes were proposed in this pull request? We got a customer issue that a `MergeInto` query on Iceberg table works earlier but cannot work after upgrading to Spark 3.4. The error looks like ``` Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object upcast(getcolumnbyordinal(0, StringType), StringType, - root class: java.lang.String).toString. ``` The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark invokes the deserializer of input encoder of the `ScalaUDF` and the deserializer is not resolved yet. The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` which will be applied at the end of analysis phase. During rewriting `MergeInto` to `ReplaceData` query, Spark creates an `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note that the `ScalaUDF` is already resolved by the analyzer. Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve the subquery plan if it is not resolved yet. Because the subquery containing `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with encoders unresolved that cause the error. This patch modifies `ResolveSubquery` so it will resolve subquery plan if it is not analyzed to make sure subquery plan is fully analyzed. This patch moves `ResolveEncodersInUDF` rule before rewriting `MergeInto` to make sure the `ScalaUDF` in the subquery plan is fully analyzed. ### Why are the changes needed? Fixing production query error. ### Does this PR introduce _any_ user-facing change? Yes, fixing user-facing issue. ### How was this patch tested? Manually test with `MergeInto` query and add an unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47380 from viirya/fix_subquery_resolve. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	18 July 2024, 20:00:14 UTC
723c1c5	Wei Liu	18 July 2024, 16:52:56 UTC	[SPARK-48890][CORE][SS] Add Structured Streaming related fields to log4j ThreadContext ### What changes were proposed in this pull request? There are some special informations needed for structured streaming queries. Specifically, each query has a query_id and run_id. Also if using MicrobatchExecution (default), there is a batch_id. A (query_id, run_id, batch_id) identifies the microbatch the streaming query runs. Adding these field to a threadContext would help especially when there are multiple queries running. ### Why are the changes needed? Logging improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run a streaming query through spark-submit, here are sample logs (search for query_id, run_id, or batch_id): ``` {"ts":"2024-07-15T19:56:01.577Z","level":"INFO","msg":"Starting new streaming query.","context":{"query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"MicroBatchExecution"} {"ts":"2024-07-15T19:56:01.579Z","level":"INFO","msg":"Stream started from {}","context":{"query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","streaming_offsets_start":"{}"},"logger":"MicroBatchExecution"} {"ts":"2024-07-15T19:56:01.602Z","level":"INFO","msg":"Writing atomically to file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0 using temp file file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp","context":{"final_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","temp_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp"},"logger":"CheckpointFileManager"} {"ts":"2024-07-15T19:56:01.675Z","level":"INFO","msg":"Renamed temp file file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp to file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","context":{"final_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","temp_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp"},"logger":"CheckpointFileManager"} {"ts":"2024-07-15T19:56:01.676Z","level":"INFO","msg":"Committed offsets for batch 0. Metadata OffsetSeqMetadata(0,1721073361582,HashMap(spark.sql.streaming.stateStore.providerClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider, spark.sql.streaming.stateStore.rocksdb.formatVersion -> 5, spark.sql.streaming.statefulOperator.useStrictDistribution -> true, spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion -> 2, spark.sql.streaming.multipleWatermarkPolicy -> min, spark.sql.streaming.aggregation.stateFormatVersion -> 2, spark.sql.shuffle.partitions -> 200, spark.sql.streaming.join.stateFormatVersion -> 2, spark.sql.streaming.stateStore.compression.codec -> lz4))","context":{"batch_id":"0","offset_sequence_metadata":"OffsetSeqMetadata(0,1721073361582,HashMap(spark.sql.streaming.stateStore.providerClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider, spark.sql.streaming.stateStore.rocksdb.formatVersion -> 5, spark.sql.streaming.statefulOperator.useStrictDistribution -> true, spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion -> 2, spark.sql.streaming.multipleWatermarkPolicy -> min, spark.sql.streaming.aggregation.stateFormatVersion -> 2, spark.sql.shuffle.partitions -> 200, spark.sql.streaming.join.stateFormatVersion -> 2, spark.sql.streaming.stateStore.compression.codec -> lz4))","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"MicroBatchExecution"} {"ts":"2024-07-15T19:56:02.074Z","level":"INFO","msg":"Code generated in 97.122375 ms","context":{"batch_id":"0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","total_time":"97.122375"},"logger":"CodeGenerator"} {"ts":"2024-07-15T19:56:02.125Z","level":"INFO","msg":"Start processing data source write support: MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.datasources.noop.NoopStreamingWrite$20ba1e29]. The input RDD has 1} partitions.","context":{"batch_id":"0","batch_write":"MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.datasources.noop.NoopStreamingWrite$20ba1e29]","count":"1","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"WriteToDataSourceV2Exec"} {"ts":"2024-07-15T19:56:02.129Z","level":"INFO","msg":"Starting job: start at NativeMethodAccessorImpl.java:0","context":{"batch_id":"0","call_site_short_form":"start at NativeMethodAccessorImpl.java:0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"SparkContext"} {"ts":"2024-07-15T19:56:02.135Z","level":"INFO","msg":"Got job 0 (start at NativeMethodAccessorImpl.java:0) with 1 output partitions","context":{"call_site_short_form":"start at NativeMethodAccessorImpl.java:0","job_id":"0","num_partitions":"1"},"logger":"DAGScheduler"} ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47340 from WweiL/structured-logging-streaming-id-aware. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	18 July 2024, 16:52:56 UTC
83a9791	David Milicevic	18 July 2024, 10:52:16 UTC	[SPARK-48388][SQL] Fix SET statement behavior for SQL Scripts ### What changes were proposed in this pull request? `SET` statement is used to set config values and it has a poorly designed grammar rule `#setConfiguration` that matches everything after `SET` - `SET .*?`. This conflicts with the usage of `SET` for setting session variables, and we needed to introduce `SET (VAR \| VARIABLE)` grammar rule to make distinction between setting the config values and session variables - [SET VAR pull request](https://github.com/apache/spark/pull/40474). However, this is not by SQL standard, so for SQL scripting ([JIRA](https://issues.apache.org/jira/browse/SPARK-48338)) we are opting to disable `SET` for configs and use it only for session variables. This enables use to use only `SET` for setting values to session variables. Config values can still be set from SQL scripts using `EXECUTE IMMEDIATE`. This change simply reorders grammar rules to achieve above behavior, and alters only visitor functions where name of the rule had to be changed or completely new rule was added. ### Why are the changes needed? These changes are supposed to resolve the issues poorly designed `SET` statement for the case of SQL scripts. ### Does this PR introduce _any_ user-facing change? No. This PR is in a series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged. In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts. ### How was this patch tested? Already existing tests should cover the changes. New tests for SQL scripts were added to: - `SqlScriptingParserSuite` - `SqlScriptingInterpreterSuite` ### Was this patch authored or co-authored using generative AI tooling? Closes #47272 from davidm-db/sql_scripting_set_statement. Authored-by: David Milicevic <david.milicevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	18 July 2024, 10:52:16 UTC
1a428c1	lizongze	18 July 2024, 07:58:38 UTC	[SPARK-36680][SQL][FOLLOWUP] Files with options should be put into resolveDataSource function ### What changes were proposed in this pull request? When reading csv, json and other files, pass the options parameter to the rules.resolveDataSource method to make the options parameter effective. This is a bug fix for [#46707](https://github.com/apache/spark/pull/46707) szehon-ho ### Why are the changes needed? For the following SQL, the options parameter passed in does not take effect. This is because the rules.resolveDataSource method does not pass the options parameter during the datasource construction process ``` SELECT * FROM csv.`/test/data.csv` WITH (`header` = true, 'delimiter' = '\|') ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test in SQLQuerySuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #47370 from logze/hint-options. Authored-by: lizongze <lizongze@xiaomi.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	18 July 2024, 07:58:38 UTC
546da0d	yangjie01	18 July 2024, 07:58:21 UTC	[MINOR][SQL][TESTS] Enable test case `testOrcAPI` in `JavaDataFrameReaderWriterSuite` ### What changes were proposed in this pull request? This PR enabled test case `testOrcAPI` in `JavaDataFrameReaderWriterSuite` because this test no longer depends on Hive classes, we can test it like other test cases in this Suite. ### Why are the changes needed? Enable test case `testOrcAPI` in `JavaDataFrameReaderWriterSuite` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #47400 from LuciferYang/minor-testOrcAPI. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>	18 July 2024, 07:58:21 UTC
3b4c423	Haejoon Lee	18 July 2024, 05:04:11 UTC	[SPARK-48752][PYTHON][CONNECT][DOCS] Introduce `pyspark.logger` for improved structured logging for PySpark ### What changes were proposed in this pull request? This PR introduces the `pyspark.logger` module to facilitate structured client-side logging for PySpark users. This module includes a `PySparkLogger` class that provides several methods for logging messages at different levels in a structured JSON format: - `PySparkLogger.info` - `PySparkLogger.warning` - `PySparkLogger.error` The logger can be easily configured to write logs to either the console or a specified file. ## DataFrame error log improvement This PR also improves the DataFrame API error logs by leveraging this new logging framework: ### Before We introduced structured logging from https://github.com/apache/spark/pull/45729, but PySpark log is still hard to figure out in the current structured log, because it is hidden and mixed within bunch of complex JVM stacktraces and it's also not very Python-friendly: ```json { "ts": "2024-06-28T10:53:48.528Z", "level": "ERROR", "msg": "Exception in task 7.0 in stage 0.0 (TID 7)", "context": { "task_name": "task 7.0 in stage 0.0 (TID 7)" }, "exception": { "class": "org.apache.spark.SparkArithmeticException", "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/.../spark/python/test_error_context.py:17\n", "stacktrace": [ { "class": "org.apache.spark.sql.errors.QueryExecutionErrors$", "method": "divideByZeroError", "file": "QueryExecutionErrors.scala", "line": 203 }, { "class": "org.apache.spark.sql.errors.QueryExecutionErrors", "method": "divideByZeroError", "file": "QueryExecutionErrors.scala", "line": -1 }, { "class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1", "method": "project_doConsume_0$", "file": null, "line": -1 }, { "class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1", "method": "processNext", "file": null, "line": -1 }, { "class": "org.apache.spark.sql.execution.BufferedRowIterator", "method": "hasNext", "file": "BufferedRowIterator.java", "line": 43 }, { "class": "org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1", "method": "hasNext", "file": "WholeStageCodegenEvaluatorFactory.scala", "line": 50 }, { "class": "org.apache.spark.sql.execution.SparkPlan", "method": "$anonfun$getByteArrayRdd$1", "file": "SparkPlan.scala", "line": 388 }, { "class": "org.apache.spark.rdd.RDD", "method": "$anonfun$mapPartitionsInternal$2", "file": "RDD.scala", "line": 896 }, { "class": "org.apache.spark.rdd.RDD", "method": "$anonfun$mapPartitionsInternal$2$adapted", "file": "RDD.scala", "line": 896 }, { "class": "org.apache.spark.rdd.MapPartitionsRDD", "method": "compute", "file": "MapPartitionsRDD.scala", "line": 52 }, { "class": "org.apache.spark.rdd.RDD", "method": "computeOrReadCheckpoint", "file": "RDD.scala", "line": 369 }, { "class": "org.apache.spark.rdd.RDD", "method": "iterator", "file": "RDD.scala", "line": 333 }, { "class": "org.apache.spark.scheduler.ResultTask", "method": "runTask", "file": "ResultTask.scala", "line": 93 }, { "class": "org.apache.spark.TaskContext", "method": "runTaskWithListeners", "file": "TaskContext.scala", "line": 171 }, { "class": "org.apache.spark.scheduler.Task", "method": "run", "file": "Task.scala", "line": 146 }, { "class": "org.apache.spark.executor.Executor$TaskRunner", "method": "$anonfun$run$5", "file": "Executor.scala", "line": 644 }, { "class": "org.apache.spark.util.SparkErrorUtils", "method": "tryWithSafeFinally", "file": "SparkErrorUtils.scala", "line": 64 }, { "class": "org.apache.spark.util.SparkErrorUtils", "method": "tryWithSafeFinally$", "file": "SparkErrorUtils.scala", "line": 61 }, { "class": "org.apache.spark.util.Utils$", "method": "tryWithSafeFinally", "file": "Utils.scala", "line": 99 }, { "class": "org.apache.spark.executor.Executor$TaskRunner", "method": "run", "file": "Executor.scala", "line": 647 }, { "class": "java.util.concurrent.ThreadPoolExecutor", "method": "runWorker", "file": "ThreadPoolExecutor.java", "line": 1136 }, { "class": "java.util.concurrent.ThreadPoolExecutor$Worker", "method": "run", "file": "ThreadPoolExecutor.java", "line": 635 }, { "class": "java.lang.Thread", "method": "run", "file": "Thread.java", "line": 840 } ] }, "logger": "Executor" } ``` ### After Now we can get a improved, simplified and also Python-friendly error log for DataFrame errors: ```json { "ts": "2024-06-28 19:53:48,563", "level": "ERROR", "logger": "DataFrameQueryContextLogger", "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/.../spark/python/test_error_context.py:17\n", "context": { "file": "/.../spark/python/test_error_context.py", "line_no": "17", "fragment": "__truediv__" "error_class": "DIVIDE_BY_ZERO" }, "exception": { "class": "Py4JJavaError", "msg": "An error occurred while calling o52.showString.\n: org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/Users/haejoon.lee/Desktop/git_repos/spark/python/test_error_context.py:22\n\n\tat org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)\n\tat org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)\n\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)\n\tat org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:896)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:333)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:146)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1007)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2479)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2523)\n\tat org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1052)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:412)\n\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1051)\n\tat org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)\n\tat org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4449)\n\tat org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3393)\n\tat org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4439)\n\tat org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)\n\tat org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4437)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:154)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:263)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:118)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:74)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:218)\n\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:4437)\n\tat org.apache.spark.sql.Dataset.head(Dataset.scala:3393)\n\tat org.apache.spark.sql.Dataset.take(Dataset.scala:3626)\n\tat org.apache.spark.sql.Dataset.getRows(Dataset.scala:294)\n\tat org.apache.spark.sql.Dataset.showString(Dataset.scala:330)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:568)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)\n\tat py4j.ClientServerConnection.run(ClientServerConnection.java:106)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n", "stacktrace": ["Traceback (most recent call last):", " File \"/Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/errors/exceptions/captured.py\", line 272, in deco", " return f(a, kw)", " File \"/Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/py4j/protocol.py\", line 326, in get_return_value", " raise Py4JJavaError(", "py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.", ": org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012", "== DataFrame ==", "\"__truediv__\" was called from", "/Users/haejoon.lee/Desktop/git_repos/spark/python/test_error_context.py:22", "", "\tat org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)", "\tat org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)", "\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)", "\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)", "\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)", "\tat org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)", "\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)", "\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)", "\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:896)", "\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)", "\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)", "\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:333)", "\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)", "\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)", "\tat org.apache.spark.scheduler.Task.run(Task.scala:146)", "\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)", "\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)", "\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)", "\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)", "\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)", "\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)", "\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)", "\tat java.base/java.lang.Thread.run(Thread.java:840)", "\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1007)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2479)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2523)", "\tat org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1052)", "\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)", "\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)", "\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:412)", "\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1051)", "\tat org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)", "\tat org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4449)", "\tat org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3393)", "\tat org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4439)", "\tat org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)", "\tat org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4437)", "\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:154)", "\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:263)", "\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:118)", "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)", "\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:74)", "\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:218)", "\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:4437)", "\tat org.apache.spark.sql.Dataset.head(Dataset.scala:3393)", "\tat org.apache.spark.sql.Dataset.take(Dataset.scala:3626)", "\tat org.apache.spark.sql.Dataset.getRows(Dataset.scala:294)", "\tat org.apache.spark.sql.Dataset.showString(Dataset.scala:330)", "\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)", "\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)", "\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)", "\tat java.base/java.lang.reflect.Method.invoke(Method.java:568)", "\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)", "\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)", "\tat py4j.Gateway.invoke(Gateway.java:282)", "\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)", "\tat py4j.commands.CallCommand.execute(CallCommand.java:79)", "\tat py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)", "\tat py4j.ClientServerConnection.run(ClientServerConnection.java:106)", "\tat java.base/java.lang.Thread.run(Thread.java:840)"] }, } ``` ### Why are the changes needed? Before* Currently we don't have PySpark dedicated logging module so we have to manually set up and customize the Python logging module, for example: ```python logger = logging.getLogger("TestLogger") user = "test_user" action = "test_action" logger.info(f"User {user} takes an {action}") ``` This logs an information just in a following simple string: ``` INFO:TestLogger:User test_user takes an test_action ``` This is not very actionable, and it is hard to analyze not since it is not well-structured. Or we can use Log4j from JVM which resulting in excessively detailed logs as described in the above example, and this way even cannot be applied to Spark Connect. After We can simply import and use `PySparkLogger` with minimal setup: ```python from pyspark.logger import PySparkLogger logger = PySparkLogger.getLogger("TestLogger") user = "test_user" action = "test_action" logger.info(f"User {user} takes an {action}", user=user, action=action) ``` This logs an information in a following JSON format: ```json { "ts": "2024-06-28 19:44:19,030", "level": "WARNING", "logger": "TestLogger", "msg": "User test_user takes an test_action", "context": { "user": "test_user", "action": "test_action" }, } ``` NOTE: we can add as many keyword arguments as we want for each logging methods. These keyword arguments, such as `user` and `action` in the example, are included within the `"context"` field of the JSON log. This structure makes it easy to track and analyze the log. ### Does this PR introduce _any_ user-facing change? No API changes, but the PySpark client-side logging is improved. Also added user-facing documentation "Logging in PySpark": <img width="1395" alt="Screenshot 2024-07-16 at 5 40 41 PM" src="https://github.com/user-attachments/assets/c77236aa-1c6f-4b5b-ad14-26ccdc474f59"> Also added API reference: <img width="1417" alt="Screenshot 2024-07-16 at 5 40 58 PM" src="https://github.com/user-attachments/assets/6bb3fb23-6847-4086-8f4b-bcf9f4242724"> ### How was this patch tested? Added UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47145 from itholic/pyspark_logger. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	18 July 2024, 05:04:11 UTC
3831d1d	Amanda Liu	18 July 2024, 04:10:51 UTC	[SPARK-48623][CORE] Migrate FileAppender logs to structured logging ### What changes were proposed in this pull request? This PR migrates `src/main/scala/org/apache/spark/util/logging/FileAppender.scala` to comply with the scala style changes in https://github.com/apache/spark/pull/46947 ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested by ensuring dev/scalastyle checks pass ### Was this patch authored or co-authored using generative AI tooling? No Closes #47394 from asl3/asl3/migratenewfiles. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	18 July 2024, 04:10:51 UTC
41b37ae	Mingkang Li	18 July 2024, 02:22:52 UTC	[SPARK-48900] Add `reason` field for `cancelJobGroup` and `cancelJobsWithTag` ### What changes were proposed in this pull request? This PR introduces the optional `reason` field for `cancelJobGroup` and `cancelJobsWithTag` in `SparkContext.scala`, while keeping the old APIs without the `reason`, similar to how `cancelJob` is implemented currently. ### Why are the changes needed? Today it is difficult to determine why a job, stage, or job group was canceled. We should leverage existing Spark functionality to provide a reason string explaining the cancellation cause, and should add new APIs to let us provide this reason when canceling job groups. Details: Since [SPARK-19549](https://issues.apache.org/jira/browse/SPARK-19549) Allow providing reasons for stage/job cancelling - ASF JIRA (Spark 2.20), Spark’s cancelJob and cancelStage methods accept an optional reason: String that is added to logging output and user-facing error messages when jobs or stages are canceled. In our internal calls to these methods, we should always supply a reason. For example, we should set an appropriate reason when the “kill” links are clicked in the Spark UI (see [code](https://github.com/apache/spark/blob/b14c1f036f8f394ad1903998128c05d04dd584a9/core/src/main/scala/org/apache/spark/ui/jobs/JobsTab.scala#L54C1-L55)). Other APIs currently lack a reason field. For example, cancelJobGroup and cancelJobsWithTag don’t provide any way to specify a reason, so we only see generic logs like “asked to cancel job group <group name>”. We should add an ability to pass in a group cancellation reason and thread that through into the scheduler’s logging and job failure reasons. This feature can be implemented in two PRs: 1. Modify the current SparkContext and its downstream APIs to add the reason string, such as cancelJobGroup and cancelJobsWithTag 2. Add reasons for all internal calls to these methods. Note: This is the first of the two PRs to implement this new feature ### Does this PR introduce _any_ user-facing change? Yes, it modifies the SparkContext API, allowing users to add an optional `reason: String` to `cancelJobsWithTags` and `cancelJobGroup`, while the old methods without the `reason` are also kept. This creates a more uniform interface where the user can supply an optional reason for all job/stage cancellation calls. ### How was this patch tested? New tests are added to `JobCancellationSuite` to test the reason fields for these calls. For the API changes in R and PySpark, tests are added to these files: - R/pkg/tests/fulltests/test_context.R - python/pyspark/tests/test_pin_thread.py ### Was this patch authored or co-authored using generative AI tooling? No Closes #47361 from mingkangli-db/reason_job_cancellation. Authored-by: Mingkang Li <mingkang.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	18 July 2024, 02:22:52 UTC
9d4ebf7	Wei Guo	18 July 2024, 02:15:25 UTC	[SPARK-48915][SQL][TESTS] Add some uncovered predicates(!=, <=, >, >=) in test cases of `GeneratedSubquerySuite` ### What changes were proposed in this pull request? This PR aims to add some predicates(!=, <=, >, >=) which are not covered in test cases of `GeneratedSubquerySuite`. ### Why are the changes needed? Better coverage of current subquery tests in `GeneratedSubquerySuite`. For more information about subqueries in `postgresq`, refer to: https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA and Manual testing with `GeneratedSubquerySuite`. ![image](https://github.com/user-attachments/assets/4b265def-a7a9-405e-94ce-e9902efb79fa) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47386 from wayneguow/SPARK-48915. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	18 July 2024, 02:15:25 UTC
2cd4a4a	panbingkun	18 July 2024, 01:54:56 UTC	[SPARK-48932][BUILD] Upgrade `commons-lang3` to 3.15.0 ### What changes were proposed in this pull request? The pr aims to upgrade `commons-lang3` from `3.14.0` to `3.15.0` ### Why are the changes needed? - v3.14.0 VS v3.15.0 https://github.com/apache/commons-lang/compare/rel/commons-lang-3.14.0...rel/commons-lang-3.15.0 - The new version has brought some bug fixes, eg: https://github.com/apache/commons-lang/pull/1140 https://github.com/apache/commons-lang/pull/1151 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47396 from panbingkun/SPARK-48932. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	18 July 2024, 01:54:56 UTC
f26eeb0	Wei Guo	18 July 2024, 00:53:26 UTC	[SPARK-48926][SQL][TESTS] Use `checkError` method to optimize exception check logic related to `UNRESOLVED_COLUMN` error classes ### What changes were proposed in this pull request? This PR aims to use `checkError` method to optimize exception check logic related to `UNRESOLVED_COLUMN` error classes ### Why are the changes needed? Unify error classes check way to `checkError` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass related test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47389 from wayneguow/op_un_col. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	18 July 2024, 00:53:26 UTC
8b18914	Paddy Xu	18 July 2024, 00:52:46 UTC	[SPARK-48510][CONNECT][FOLLOW-UP-MK2] Fix for UDAF `toColumn` API when running tests in Maven ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/47368 as another try to fix the broken tests. The previous try failed due to NPE, caused by `Iterator.iterate` generating an infinite flow of values. I can't reproduce the previous issue locally, so my fix is purely based on the error message: https://github.com/apache/spark/actions/runs/9974746135/job/27562881993. ### Why are the changes needed? Because previous one failed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Locally. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47387 from xupefei/udaf-tocolumn-fixup-mk2. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	18 July 2024, 00:52:46 UTC
171c794	Ruifeng Zheng	18 July 2024, 00:48:55 UTC	[SPARK-48924][PS] Add a pandas-like `make_interval` helper function ### What changes were proposed in this pull request? Add a pandas-like `make_interval` helper function ### Why are the changes needed? factor it out as a helper function to be reusable ### Does this PR introduce _any_ user-facing change? No, internal change only ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47385 from zhengruifeng/ps_simplify_make_interval. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	18 July 2024, 00:48:55 UTC
1e17c39	Dongjoon Hyun	17 July 2024, 22:57:03 UTC	[SPARK-48930][CORE] Redact `awsAccessKeyId` by including `accesskey` pattern ### What changes were proposed in this pull request? This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern. - Apache Spark 4.0.0-preview1 There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all. ``` $ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell ``` ![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426) ### Why are the changes needed? Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently. - #450 https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486 ### Does this PR introduce _any_ user-facing change? Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials. ### How was this patch tested? Pass the CIs with the newly added test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47392 from dongjoon-hyun/SPARK-48930. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	17 July 2024, 22:57:03 UTC
ebf8da1	Dongjoon Hyun	17 July 2024, 19:54:10 UTC	[SPARK-48927][CORE] Show the number of cached RDDs in `StoragePage` ### What changes were proposed in this pull request? This PR aims to show the number of cached RDDs in `StoragePage` like the other `Jobs` page or `Stages` page. ### Why are the changes needed? To improve the UX by providing additional summary information in a consistent way. BEFORE ![Screenshot 2024-07-17 at 09 46 44](https://github.com/user-attachments/assets/3e57bf91-e97d-404d-aeda-159ab9cb65e3) AFTER ![Screenshot 2024-07-17 at 09 46 01](https://github.com/user-attachments/assets/d416ea16-8255-48d8-ade4-624dcac8f46e) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47390 from dongjoon-hyun/SPARK-48927. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	17 July 2024, 19:54:10 UTC
5d16c31	panbingkun	17 July 2024, 14:57:41 UTC	[SPARK-48907][SQL] Fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT` ### What changes were proposed in this pull request? The pr aims to - fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT`. - use `checkError` to check exception in `CollationSQLExpressionsSuite` and `CollationStringExpressionsSuite`. ### Why are the changes needed? Only fix bug, eg: ``` SELECT concat_ws(' ', collate('Spark', 'UTF8_LCASE'), collate('SQL', 'UNICODE')) ``` - Before: ``` [COLLATION_MISMATCH.EXPLICIT] Could not determine which collation to use for string functions and operators. Error occurred due to the mismatch between explicit collations: `string collate UTF8_LCASE`.`string collate UNICODE`. Decide on a single explicit collation and remove others. SQLSTATE: 42P21 ``` <img width="747" alt="image" src="https://github.com/user-attachments/assets/4e026cb5-2875-4370-9bb9-878f0b607f41"> - After: ``` [COLLATION_MISMATCH.EXPLICIT] Could not determine which collation to use for string functions and operators. Error occurred due to the mismatch between explicit collations: [`string collate UTF8_LCASE`, `string collate UNICODE`]. Decide on a single explicit collation and remove others. SQLSTATE: 42P21 ``` <img width="738" alt="image" src="https://github.com/user-attachments/assets/86f489a2-9f2d-4f59-bdb1-95c051a93ee8"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47365 from panbingkun/SPARK-48907. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	17 July 2024, 14:57:41 UTC
74ca836	panbingkun	17 July 2024, 14:12:32 UTC	[SPARK-48923][SQL][TESTS] Fix the incorrect logic of `CollationFactorySuite` ### What changes were proposed in this pull request? The pr aims to fix the incorrect logic of `CollationFactorySuite`. ### Why are the changes needed? Only fix `CollationFactorySuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47382 from panbingkun/fix_CollationFactorySuite. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	17 July 2024, 14:12:32 UTC
41a2beb	wforget	17 July 2024, 13:50:16 UTC	[SPARK-48865][SQL] Add try_url_decode function ### What changes were proposed in this pull request? Add a `try_url_decode` function that performs the same operation as `url_decode`, but returns a NULL value instead of raising an error if the decoding cannot be performed. ### Why are the changes needed? In hive we usually do url decoding like: `reflect('java.net.URLDecoder', 'decode', 'test%1')`, and return a `NULL` value instead of raising an error if the decoding cannot be performed. Although spark provides a `try_reflect` function to do this, but as commented in https://github.com/apache/spark/pull/34023#issuecomment-2113995703, the `reflect` function may cause partition pruning to does not take effect. So I propose to add a new `try_url_decode` function. ### Does this PR introduce _any_ user-facing change? add a new function ### How was this patch tested? added tests and did manual testing spark-sql: ![image](https://github.com/apache/spark/assets/17894939/0ffd3aa2-98f7-4af4-b478-67002b8b0d4b) pyspark: ![image](https://github.com/apache/spark/assets/17894939/d2c1926b-f9a0-422c-abc9-5f224d822811) ### Was this patch authored or co-authored using generative AI tooling? No Closes #47294 from wForget/try_url_decode. Lead-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	17 July 2024, 13:50:16 UTC
3a24555	Siying Dong	17 July 2024, 03:26:59 UTC	[SPARK-48889][SS] testStream to unload state stores before finishing ### What changes were proposed in this pull request? In the end of each testStream() call, unload all state stores from the executor ### Why are the changes needed? Currently, after a test, we don't unload state store or disable maintenance task. So after a test, the maintenance task can run and fail as the checkpoint directory is already deleted. This might cause an issue and fail the next test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? See existing tests to pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47339 from siying/SPARK-48889. Authored-by: Siying Dong <siying.dong@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	17 July 2024, 03:26:59 UTC
b2e0a4d	wforget	17 July 2024, 02:36:50 UTC	[SPARK-47307][DOCS][FOLLOWUP] Add a migration guide for the behavior change of base64 function <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. 8. If you want to add or modify an error type or message, please read the guideline first in 'common/utils/src/main/resources/error/README.md'. --> ### What changes were proposed in this pull request? Follow up to #47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> doc change ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No Closes #47371 from wForget/SPARK-47307_doc. Authored-by: wforget <643348094@qq.com> Signed-off-by: allisonwang-db <allison.wang@databricks.com>	17 July 2024, 02:36:50 UTC
c9868b5	Paddy Xu	17 July 2024, 01:38:32 UTC	[SPARK-48510][CONNECT][FOLLOW-UP] Fix for UDAF `toColumn` API when running tests in Maven ### What changes were proposed in this pull request? This PR fixes an issue where the TypeTag look up during `udaf.toColumn` failed in Maven test env with the following error: > java.lang.IllegalArgumentException: Type tag defined in [JavaMirror with jdk.internal.loader.ClassLoaders$AppClassLoader1dbd16a6 of type class jdk.internal.loader.ClassLoaders$AppClassLoader with classpath [<unknown>] and parent being jdk.internal.loader.ClassLoaders$PlatformClassLoader6bd61f98 of type class jdk.internal.loader.ClassLoaders$PlatformClassLoader with classpath [<unknown>] and parent being primordial classloader with boot classpath [<unknown>]] cannot be migrated to another mirror [JavaMirror <ins>with java.net.URLClassLoader5a4041cc of type class java.net.URLClassLoader with classpath [file:/\<redacted\>/spark/connector/connect/client/jvm/target/scala-2.13/classes/,file:/\<redacted\>/spark/connector/connect/client/jvm/target/scala-2.13/test-classes/]</ins> and parent being jdk.internal.loader.ClassLoaders$AppClassLoader1dbd16a6 of type class jdk.internal.loader.ClassLoaders$AppClassLoader with classpath [<unknown>] and parent being jdk.internal.loader.ClassLoaders$PlatformClassLoader6bd61f98 of type class jdk.internal.loader.ClassLoaders$PlatformClassLoader with classpath [<unknown>] and parent being primordial classloader with boot classpath [<unknown>]]. The problem is caused by Maven adding a `URLClassLoader` on top of the original `AppClassLoader` (see the underlined texts in the above error message). This PR changes the mirror-matching logic from `eq` to `hasCommonAncestors`. ### Why are the changes needed? Previous logic fails in tests env. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47368 from xupefei/udaf-tocolumn-fixup. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Haejoon Lee <haejoon.lee@databricks.com>	17 July 2024, 01:38:32 UTC
2ce5d72	Anish Shrigondekar	17 July 2024, 01:21:28 UTC	[SPARK-48903][SS] Set the RocksDB last snapshot version correctly on remote load ### What changes were proposed in this pull request? Set the RocksDB last snapshot version correctly on remote load ### Why are the changes needed? Avoid creating full snapshot on every first batch after restart and also reset a snapshot that is likely no longer valid ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common... [info] Run completed in 4 minutes, 40 seconds. [info] Total number of tests run: 176 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47363 from anishshri-db/task/SPARK-48903. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	17 July 2024, 01:21:28 UTC
c0f6db8	Hyukjin Kwon	16 July 2024, 23:56:10 UTC	[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API ### What changes were proposed in this pull request? This PR is a retry of https://github.com/apache/spark/pull/47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., https://github.com/apache/spark/pull/29063, https://github.com/apache/spark/pull/15813, https://github.com/apache/spark/pull/17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	16 July 2024, 23:56:10 UTC
3755d51	Ruifeng Zheng	16 July 2024, 23:18:18 UTC	[SPARK-48892][ML] Avoid per-row param read in `Tokenizer` ### What changes were proposed in this pull request? Inspired by https://github.com/apache/spark/pull/47258, I am checking other ML implementations, and find that we can also optimize `Tokenizer` in the same way ### Why are the changes needed? the function `createTransformFunc` is to build the udf for `UnaryTransformer.transform`: https://github.com/apache/spark/blob/d679dabdd1b5ad04b8c7deb1c06ce886a154a928/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L118 existing implementation read the params for each row. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI and manually tests: create test dataset ``` spark.range(1000000).select(uuid().as("uuid")).write.mode("overwrite").parquet("/tmp/regex_tokenizer.parquet") ``` duration ``` val df = spark.read.parquet("/tmp/regex_tokenizer.parquet") import org.apache.spark.ml.feature._ val tokenizer = new RegexTokenizer().setPattern("-").setInputCol("uuid") Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()) // warm up val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic ``` result (before this PR) ``` scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic val tic: Long = 1720613235068 val res5: Long = 50397 ``` result (after this PR) ``` scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic val tic: Long = 1720612871256 val res5: Long = 43748 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47342 from zhengruifeng/opt_tokenizer. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>	16 July 2024, 23:18:18 UTC
0e8d403	Hyukjin Kwon	16 July 2024, 16:00:51 UTC	[SPARK-48909][ML][MLLIB] Uses SparkSession over SparkContext when writing metadata ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See https://github.com/apache/spark/pull/47347#issuecomment-2229701812 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47366 from HyukjinKwon/SPARK-48909. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	16 July 2024, 16:00:51 UTC
2e1a39c	Hyukjin Kwon	16 July 2024, 15:56:51 UTC	[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata ### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also https://github.com/apache/spark/pull/47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	16 July 2024, 15:56:51 UTC
d7f633a	Chenhao Li	16 July 2024, 08:12:57 UTC	[SPARK-48873][SQL] Use UnsafeRow in JSON parser ### What changes were proposed in this pull request? It uses `UnsafeRow` to represent struct result in the JSON parser. It saves memory compared to the current `GenericInternalRow`. The change is guarded by a flag and disabled by default. The benchmark shows that enabling the flag brings ~10% slowdown. This is basically expected because converting to `UnsafeRow` requires some work. The purpose of the PR is to provide an alternative to save memory. I did the following experiment. It generates a big `.gz` JSON file containing a single large array. Each array element is a struct with 50 string fields and will be parsed into a row by the JSON reader. ``` s = b'{"field00":null,"field01":"field01_<v>","field02":"field02_<v>","field03":"field03_<v>","field04":"field04_<v>","field05":"field05_<v>","field06":"field06_<v>","field07":"field07_<v>","field08":"field08_<v>","field09":"field09_<v>","field10":null,"field11":"field11_<v>","field12":"field12_<v>","field13":"field13_<v>","field14":"field14_<v>","field15":"field15_<v>","field16":"field16_<v>","field17":"field17_<v>","field18":"field18_<v>","field19":"field19_<v>","field20":null,"field21":"field21_<v>","field22":"field22_<v>","field23":"field23_<v>","field24":"field24_<v>","field25":"field25_<v>","field26":"field26_<v>","field27":"field27_<v>","field28":"field28_<v>","field29":"field29_<v>","field30":null,"field31":"field31_<v>","field32":"field32_<v>","field33":"field33_<v>","field34":"field34_<v>","field35":"field35_<v>","field36":"field36_<v>","field37":"field37_<v>","field38":"field38_<v>","field39":"field39_<v>","field40":null,"field41":"field41_<v>","field42":"field42_<v>","field43":"field43_<v>","field44":"field44_<v>","field45":"field45_<v>","field46":"field46_<v>","field47":"field47_<v>","field48":"field48_<v>","field49":"field49_<v>"}' import gzip def write(n): with gzip.open(f'json{n}.gz', 'w') as f: f.write(b'[') for i in range(n): if i != 0: f.write(b',') f.write(s.replace(b'<v>', str(i).encode('ascii'))) f.write(b']') write(100000) ``` Then it processes the file in Spark shell with the following command: ``` ./bin/spark-shell --conf spark.driver.memory=1g --conf spark.executor.memory=1g --master "local[1]" > val schema = "field00 string, field01 string, field02 string, field03 string, field04 string, field05 string, field06 string, field07 string, field08 string, field09 string, field10 string, field11 string, field12 string, field13 string, field14 string, field15 string, field16 string, field17 string, field18 string, field19 string, field20 string, field21 string, field22 string, field23 string, field24 string, field25 string, field26 string, field27 string, field28 string, field29 string, field30 string, field31 string, field32 string, field33 string, field34 string, field35 string, field36 string, field37 string, field38 string, field39 string, field40 string, field41 string, field42 string, field43 string, field44 string, field45 string, field46 string, field47 string, field48 string, field49 string" > spark.conf.set("spark.sql.json.useUnsafeRow", "false") > spark.read.schema(schema).option("multiline", "true").json("json100000.gz").selectExpr("sum(hash(struct(*)))").collect() ``` When the flag is off (the current behavior), the query can process 2.5e5 rows but fails to process 3e5 rows. When the flag is on, the query can process 8e5 rows but fails to process 9e5 rows. We can say this change reduces the memory consumption to about 1/3. ### Why are the changes needed? It reduces the memory requirement of JSON-related query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new JSON unit test with the config flag on. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47310 from chenhao-db/json_unsafe_row. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	16 July 2024, 08:12:57 UTC
3923d7f	panbingkun	16 July 2024, 06:47:56 UTC	[SPARK-48902][BUILD] Upgrade `commons-codec` to 1.17.1 ### What changes were proposed in this pull request? The pr aims to upgrade `commons-codec` from `1.17.0` to `1.17.1`. ### Why are the changes needed? The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.17.1 This version has fixed some bugs from the previous version, eg: - Md5Crypt now throws IllegalArgumentException on an invalid prefix ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47362 from panbingkun/SPARK-48902. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	16 July 2024, 06:47:56 UTC
d0a1c4d	Wei Guo	16 July 2024, 05:32:34 UTC	[SPARK-48846][PYTHON][DOCS][FOLLOWUP] Add a missing param doc in python api `partitioning` functions docs ### What changes were proposed in this pull request? Add a missing param in func docs of `partitioning.py`. ### Why are the changes needed? - Make python api docs better. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA and docs check. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47345 from wayneguow/py_f_docs. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	16 July 2024, 05:32:34 UTC
b9b7669	Ruifeng Zheng	16 July 2024, 05:21:15 UTC	[SPARK-45190][SPARK-48897][PYTHON][CONNECT] Make `from_xml` support StructType schema ### What changes were proposed in this pull request? Make `from_xml` support StructType schema ### Why are the changes needed? StructType schema was supported in Spark Classic, but not in Spark Connect to address https://github.com/apache/spark/pull/43680#discussion_r1385332357 ### Does this PR introduce _any_ user-facing change? before: ``` from pyspark.sql.types import StructType, LongType import pyspark.sql.functions as sf data = [(1, '''<p><a>1</a></p>''')] df = spark.createDataFrame(data, ("key", "value")) schema = StructType().add("a", LongType()) df.select(sf.from_xml(df.value, schema)).show() --------------------------------------------------------------------------- AnalysisException Traceback (most recent call last) Cell In[1], line 7 ... AnalysisException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601 JVM stacktrace: org.apache.spark.sql.AnalysisException at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:278) at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:98) at org.apache.spark.sql.catalyst.parser.AbstractParser.parseDataType(parsers.scala:40) at org.apache.spark.sql.types.DataType$.$anonfun$fromDDL$1(DataType.scala:126) at org.apache.spark.sql.types.DataType$.parseTypeWithFallback(DataType.scala:145) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:127) ``` after: ``` +---------------+ \|from_xml(value)\| +---------------+ \| {1}\| +---------------+ ``` ### How was this patch tested? added doctest and enabled unit tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47355 from zhengruifeng/from_xml_struct. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	16 July 2024, 05:21:15 UTC
4a44a9e	Ruifeng Zheng	16 July 2024, 02:31:50 UTC	[SPARK-48884][PYTHON] Remove unused helper function `PythonSQLUtils.makeInterval` ### What changes were proposed in this pull request? Remove unused helper function `PythonSQLUtils.makeInterval` ### Why are the changes needed? As a followup cleanup of https://github.com/apache/spark/commit/bd14d6412a3124eecce1493fcad436280915ba71 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes #47330 from zhengruifeng/py_sql_utils_cleanup. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>	16 July 2024, 02:31:50 UTC
c0943b3	Kent Yao	16 July 2024, 02:18:12 UTC	[SPARK-48885][SQL] Make some subclasses of RuntimeReplaceable override replacement to lazy val ### What changes were proposed in this pull request? This PR makes 8 subclasses of RuntimeReplaceable override replacement to lazy val to align with other 60+ members and avoid recreation of new expressions ```scala Value read (51 usages found) spark-catalyst_2.13 (50 usages found) AnyValue.scala (1 usage found) 54 override lazy val replacement: Expression = First(child, ignoreNulls) arithmetic.scala (1 usage found) 127 override lazy val replacement: Expression = child bitmapExpressions.scala (3 usages found) 52 override lazy val replacement: Expression = StaticInvoke( 85 override lazy val replacement: Expression = StaticInvoke( 134 override lazy val replacement: Expression = StaticInvoke( boolAggregates.scala (2 usages found) 39 override lazy val replacement: Expression = Min(child) 61 override lazy val replacement: Expression = Max(child) collationExpressions.scala (1 usage found) 123 override def replacement: Expression = { collectionOperations.scala (5 usages found) 168 override lazy val replacement: Expression = Size(child, legacySizeOfNull = false) 231 override lazy val replacement: Expression = ArrayContains(MapKeys(left), right) 1596 override lazy val replacement: Expression = new ArrayInsert(left, Literal(1), right) 1631 override lazy val replacement: Expression = new ArrayInsert(left, Literal(-1), right) 5203 override lazy val replacement: Expression = ArrayFilter(child, lambda) CountIf.scala (1 usage found) 42 override lazy val replacement: Expression = Count(new NullIf(child, Literal.FalseLiteral)) datetimeExpressions.scala (2 usages found) 2070 override lazy val replacement: Expression = format.map { f => 2145 override lazy val replacement: Expression = format.map { f => linearRegression.scala (5 usages found) 45 override lazy val replacement: Expression = Count(Seq(left, right)) 79 override lazy val replacement: Expression = 114 override lazy val replacement: Expression = 176 override lazy val replacement: Expression = 232 override lazy val replacement: Expression = misc.scala (3 usages found) 294 override lazy val replacement: Expression = StaticInvoke( 397 override lazy val replacement: Expression = StaticInvoke( 475 override lazy val replacement: Expression = StaticInvoke( percentiles.scala (2 usages found) 346 override def replacement: Expression = percentile 365 override def replacement: Expression = percentile regexpExpressions.scala (3 usages found) 262 override lazy val replacement: Expression = Like(Lower(left), Lower(right), escapeChar) 1034 override lazy val replacement: Expression = 1072 override lazy val replacement: Expression = stringExpressions.scala (14 usages found) 561 override lazy val replacement = 723 override lazy val replacement: Expression = Invoke(input, "isValid", BooleanType) 770 override lazy val replacement: Expression = Invoke(input, "makeValid", input.dataType) 810 override lazy val replacement: Expression = StaticInvoke( 859 override lazy val replacement: Expression = StaticInvoke( 1854 override lazy val replacement: Expression = StaticInvoke( 2246 override lazy val replacement: Expression = If( 2284 override lazy val replacement: Expression = Substring(str, Literal(1), len) 2713 override def replacement: Expression = StaticInvoke( 2940 override def replacement: Expression = StaticInvoke( 3004 override val replacement: Expression = StaticInvoke( 3075 override lazy val replacement: Expression = if (fmt == null) { 3473 override lazy val replacement: Expression = 3533 override lazy val replacement: Expression = StaticInvoke( toFromAvroSqlFunctions.scala (2 usages found) 96 override def replacement: Expression = { 168 override def replacement: Expression = { urlExpressions.scala (2 usages found) 55 override def replacement: Expression = 92 override def replacement: Expression = variantExpressions.scala (3 usages found) 58 override lazy val replacement: Expression = StaticInvoke( 100 override lazy val replacement: Expression = StaticInvoke( 635 override lazy val replacement: Expression = StaticInvoke( spark-examples_2.13 (1 usage found) AgeExample.scala (1 usage found) 27 override lazy val replacement: Expression = SubtractDates(CurrentDate(), birthday) ``` ### Why are the changes needed? Improve RuntimeReplaceable implementations ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47333 from yaooqinn/SPARK-48885. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	16 July 2024, 02:18:12 UTC
d8820a0	Kent Yao	16 July 2024, 02:13:44 UTC	[SPARK-47172][DOCS][FOLLOWUP] Fix spark.network.crypto.ciphersince version field on security page ### What changes were proposed in this pull request? Given that SPARK-47172 was an improvement but got merged into 3.4/3.5, we need to fix the since version to eliminate misunderstandings. ### Why are the changes needed? doc fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ### Was this patch authored or co-authored using generative AI tooling? no Closes #47353 from yaooqinn/SPARK-47172. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	16 July 2024, 02:13:44 UTC
23080ac	Paddy Xu	16 July 2024, 00:05:31 UTC	[SPARK-45155][CONNECT] Add API Docs for Spark Connect JVM/Scala Client This PR is based on https://github.com/apache/spark/pull/42911. ### What changes were proposed in this pull request? - Enables Scala and Java Unidoc generation for the `connectClient` project. - Generates docs and moves them to the `docs/api/connect` folder. Some methods' documentation in the connect directory had to be modified to remove references to avoid javadoc generation failures. References API docs in the main index page and the global floating header will be added in a later PR. ### Why are the changes needed? Increasing scope of documentation for the Spark Connect JVM/Scala Client project. ### Does this PR introduce _any_ user-facing change? Nope. ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47332 from xupefei/connnect-doc-web. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	16 July 2024, 00:05:31 UTC
5dafa4c	Milan Dankovic	15 July 2024, 23:37:54 UTC	[SPARK-48350][SQL] Introduction of Custom Exceptions for Sql Scripting ### What changes were proposed in this pull request? Previous PRs introduced basic changes for SQL Scripting. This PR is a follow-up to introduce custom exceptions that can arise while using SQL Scripting language. ### Why are the changes needed? The intent is to add precise errors for various SQL scripting concepts. ### Does this PR introduce any user-facing change? Users will now see specific SQL Scripting language errors. ### How was this patch tested? There are tests for newly introduced parser changes: SqlScriptingParserSuite - unit tests for execution nodes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47147 from miland-db/sql_batch_custom_errors. Lead-authored-by: Milan Dankovic <milan.dankovic@databricks.com> Co-authored-by: David Milicevic <david.milicevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	15 July 2024, 23:37:54 UTC
5ff6a52	Anish Shrigondekar	15 July 2024, 22:57:17 UTC	[SPARK-48886][SS] Add version info to changelog v2 to allow for easier evolution ### What changes were proposed in this pull request? Add version info to changelog v2 to allow for easier evolution ### Why are the changes needed? Currently the changelog file format does not add the version info. With format v2, we propose to add this to the changelog file itself to make future evolution easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Augmented unit tests ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common... [info] Run completed in 4 minutes, 23 seconds. [info] Total number of tests run: 176 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47336 from anishshri-db/task/SPARK-48886. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	15 July 2024, 22:57:17 UTC
40899c1	Dongjoon Hyun	15 July 2024, 20:11:06 UTC	[SPARK-48899][K8S] Fix `ENV` key value format in K8s Dockerfiles ### What changes were proposed in this pull request? This PR aims to fix `ENV` key value format in K8s Dockerfiles. ### Why are the changes needed? To follow the Docker guideline to fix the following legacy format. - https://docs.docker.com/reference/build-checks/legacy-key-value-format/ ``` - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47357 from dongjoon-hyun/SPARK-48899. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	15 July 2024, 20:11:06 UTC
b6c0525	panbingkun	15 July 2024, 09:11:17 UTC	[MINOR][SQL][TESTS] Fix compilation warning `adaptation of an empty argument list by inserting () is deprecated` ### What changes were proposed in this pull request? The pr aims to fix compilation warning: `adaptation of an empty argument list by inserting () is deprecated` ### Why are the changes needed? Fix compilation warning. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually check. Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47350 from panbingkun/ParquetCommitterSuite_deprecated. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>	15 July 2024, 09:11:17 UTC
95de02e	Uros Bojanic	15 July 2024, 08:33:53 UTC	[SPARK-48441][SQL] Fix StringTrim behaviour for non-UTF8_BINARY collations ### What changes were proposed in this pull request? String searching in UTF8_LCASE now works on character-level, rather than on byte-level. For example: `ltrim("İ", "i")` now returns `"İ"`, because there exist no characters in `"İ"`, starting from the left, such that lowercased version of those characters are equal to `"i"`. Note, however, that there is a byte subsequence of `"İ"` such that lowercased version of that UTF-8 byte sequence equals to `"i"` (so the new behaviour is different than the old behaviour). Also, translation for ICU collations works by repeatedly trimming the longest possible substring that matches a character in the trim string, starting from the left side of the input string, until trimming is done. ### Why are the changes needed? Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_LCASE (see example above). ### Does this PR introduce _any_ user-facing change? Yes, behaviour of `trim*` expressions is changed for collated strings for edge cases with one-to-many case mapping. ### How was this patch tested? New unit tests in `CollationSupportSuite` and new e2e sql tests in `CollationStringExpressionsSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46762 from uros-db/alter-trim. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	15 July 2024, 08:33:53 UTC
9bff2c8	Weichen Xu	15 July 2024, 07:19:59 UTC	[SPARK-48463][ML] Make StringIndexer supporting nested input columns ### What changes were proposed in this pull request? Make StringIndexer supporting nested input columns ### Why are the changes needed? User demand. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? Closes #47283 from WeichenXu123/SPARK-48463. Lead-authored-by: Weichen Xu <weichen.xu@databricks.com> Co-authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	15 July 2024, 07:19:59 UTC
d343b23	Wei Guo	15 July 2024, 06:57:14 UTC	[SPARK-48894][TESTS] Upgrade `docker-java` to 3.4.0 ### What changes were proposed in this pull request? This PR aims to upgrade `docker-java` to 3.4.0. ### Why are the changes needed? There some improvements, such as: - Enhancements Enable protocol configuration of SSLContext (https://github.com/docker-java/docker-java/pull/2337) - Bug Fixes Consider already existing images as successful pulls (https://github.com/docker-java/docker-java/pull/2335) Full release notes: https://github.com/docker-java/docker-java/releases/tag/3.4.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47344 from wayneguow/SPARK-48894. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>	15 July 2024, 06:57:14 UTC
0855473	ulysses-you	15 July 2024, 03:26:13 UTC	[SPARK-48880][CORE] Avoid throw NullPointerException if driver plugin fails to initialize ### What changes were proposed in this pull request? This pr skips clear memoryStore if memoryManager is null. This could happen if driver plugin fails to initialize, since we initialize MemoryManager after DriverPlugin. ### Why are the changes needed? before it would throw: ``` {"class":"java.lang.NullPointerException","msg":"Cannot invoke \"org.apache.spark.memory.MemoryManager.maxOnHeapStorageMemory()\" because \"this.memoryManager\" is null","stacktrace":[{"class":"org.apache.spark.storage.memory.MemoryStore","method":"maxMemory","file":"MemoryStore.scala","line":110}, {"class":"org.apache.spark.storage.memory.MemoryStore","method":"<init>","file":"MemoryStore.scala","line":113}, {"class":"org.apache.spark.storage.BlockManager","method":"memoryStore$lzycompute","file":"BlockManager.scala","line":234}, {"class":"org.apache.spark.storage.BlockManager","method":"memoryStore","file":"BlockManager.scala","line":233}, {"class":"org.apache.spark.storage.BlockManager","method":"stop","file":"BlockManager.scala","line":2167}, {"class":"org.apache.spark.SparkEnv","method":"stop","file":"SparkEnv.scala","line":118}, {"class":"org.apache.spark.SparkContext","method":"$anonfun$stop$25","file":"SparkContext.scala","line":2369}, {"class":"org.apache.spark.util.Utils$","method":"tryLogNonFatalError","file":"Utils.scala","line":1299}, {"class":"org.apache.spark.SparkContext","method":"stop","file":"SparkContext.scala","line":2369} ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually test ### Was this patch authored or co-authored using generative AI tooling? no Closes #47321 from ulysses-you/minor. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>	15 July 2024, 03:26:13 UTC
23b2849	Anish Shrigondekar	15 July 2024, 03:14:24 UTC	[SPARK-48888][SS] Remove snapshot creation based on changelog ops size ### What changes were proposed in this pull request? Remove snapshot creation based on changelog ops size ### Why are the changes needed? Current mechanism to create snapshot is based on num batches or num ops in changelog. However, the latter is not configurable and might not be analogous to large snapshot sizes in all cases leading to variance in e2e latency. Hence, removing this condition for now. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Augmented unit tests ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common... [info] Run completed in 5 minutes, 7 seconds. [info] Total number of tests run: 176 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 332 s (05:32), completed Jul 12, 2024, 2:46:44 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #47338 from anishshri-db/task/SPARK-48888. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	15 July 2024, 03:14:24 UTC
effd4d8	Richard Chen	15 July 2024, 02:02:55 UTC	[SPARK-48834][SQL] Disable variant input/output to python scalar UDFs, UDTFs, UDAFs during query compilation ### What changes were proposed in this pull request? Throws an exception if a variant is the input/output type to/from python UDF, UDAF, UDTF ### Why are the changes needed? currently, variant input/output types to scalar UDFs will fail during execution or return a `net.razorvine.pickle.objects.ClassDictConstructor` to the user python code. For a better UX, we should fail during query compilation for failures, and block returning `ClassDictConstructor` to user code as we one day want to actually return `VariantVal`s to the user code. ### Does this PR introduce _any_ user-facing change? yes - attempting to use variants in python UDFs will now throw an exception rather than returning a `ClassDictConstructor` as before. However, we want to make this change now as we one day want to be able to return `VariantVal`s to the user code and do not want users relying on this current behavior ### How was this patch tested? added UTs ### Was this patch authored or co-authored using generative AI tooling? no Closes #47253 from richardc-db/variant_scalar_udfs. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	15 July 2024, 02:02:55 UTC
206cc1a	Szehon Ho	14 July 2024, 23:53:32 UTC	[SPARK-48613][SQL] SPJ: Support auto-shuffle one side + less join keys than partition keys ### What changes were proposed in this pull request? This is the final planned SPJ scenario: auto-shuffle one side + less join keys than partition keys. Background: - Auto-shuffle works by creating ShuffleExchange for the non-partitioned side, with a clone of the partitioned side's KeyGroupedPartitioning. - "Less join key than partition key" works by 'projecting' all partition values by join keys (ie, keeping only partition columns that are join columns). It makes a target KeyGroupedShuffleSpec with 'projected' partition values, and then pushes this down to BatchScanExec. The BatchScanExec then 'groups' its projected partition value (except in the skew case but that's a different story..). This combination is hard because the SPJ planning calls is spread in several places in this scenario. Given two sides, a non-partitioned side and a partitioned side, and the join keys are only a subset: 1. EnsureRequirements creates the target KeyGroupedShuffleSpec from the join's required distribution (ie, using only the join keys, not all partition keys). 2. EnsureRequirements copies this to the non-partitoned side's KeyGroupedPartition (for the auto-shuffle case) 3. BatchScanExec groups the partitions (for the partitioned side), including by join keys (if they differ from partition keys). Take the example partition columns (id, name) , and partition values: (1, "bob"), (2, "alice"), (2, "sam"). Projection leaves us (1, 2, 2), and the final grouped partition values are (1, 2). The problem is, that the two sides of the join do not match at all times. After the steps 1 and 2, the partitioned side has the 'projected' partition values (1, 2, 2), and the non-partitioned side creates a matching KeyGroupedPartitioning (1, 2, 2) for ShuffleExechange. But on step 3, the BatchScanExec for partitioned side groups the partitions to become (1, 2), but the non-partitioned side does not group and still retains (1, 2, 2) partitions. This leads to following assert error from the join: ``` requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions. java.lang.IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions. at scala.Predef$.require(Predef.scala:337) at org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.<init>(partitioning.scala:550) at org.apache.spark.sql.execution.joins.ShuffledJoin.outputPartitioning(ShuffledJoin.scala:49) at org.apache.spark.sql.execution.joins.ShuffledJoin.outputPartitioning$(ShuffledJoin.scala:47) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:39) at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:66) at scala.collection.immutable.Vector1.map(Vector.scala:2140) at scala.collection.immutable.Vector1.map(Vector.scala:385) at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:65) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:657) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:632) ``` The fix is to do the de-duplication in first pass. 1. Pushing down join keys to the BatchScanExec to return a de-duped outputPartitioning (partitioned side) 2. Creating the non-partitioned side's KeyGroupedPartitioning with de-duped partition keys (non-partitioned side). ### Why are the changes needed? This is the last planned scenario for SPJ not yet supported. ### How was this patch tested? Update existing unit test in KeyGroupedPartitionSuite ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47064 from szehon-ho/spj_less_join_key_auto_shuffle. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Chao Sun <chao@openai.com>	14 July 2024, 23:53:32 UTC
e07dc5a	Ruifeng Zheng	14 July 2024, 23:31:23 UTC	[SPARK-48714][SPARK-48794][FOLLOW-UP][PYTHON][DOCS] Add `mergeInto` to API reference ### What changes were proposed in this pull request? Add `mergeInto` to API reference ### Why are the changes needed? this feature was missing in doc ### Does this PR introduce _any_ user-facing change? yes, doc change ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #47329 from zhengruifeng/py_doc_merge_into. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	14 July 2024, 23:31:23 UTC
c4766a2	Dongjoon Hyun	14 July 2024, 23:30:56 UTC	[SPARK-48895][R][INFRA] Use R 4.4.1 in `windows` R GitHub Action job ### What changes were proposed in this pull request? This PR aims to use R 4.4.1 in `windows` R GitHub Action job. ### Why are the changes needed? R 4.4.1 is the latest release which is released on 2024-06-14 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47346 from dongjoon-hyun/SPARK-48895. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	14 July 2024, 23:30:56 UTC

Newer
Older