https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
c09039a [SPARK-48597][SQL] Introduce a marker for isStreaming property in text representation of logical plan ### What changes were proposed in this pull request? This PR proposes to introduce a marker for isStreaming property in text representation of logical plan. The marker will be `~`, along with `!` (invalid) and `'` (unresolved). This PR proposes to retain the prefix marker as single character (opposed to up to two characters). This would be OK in practice, since the moment the marker for isStreaming would be useful is to look into the plan which is already analyzed - that said, it’s unlikely that we need to see the both one of existing marker and the marker for streaming. ### Why are the changes needed? This would help tracking down QO issues happening with streaming query much easier. For example, here is the example of the rule which triggered [SPARK-47305](https://issues.apache.org/jira/browse/SPARK-47305): ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters === WriteToMicroBatchDataSource MemorySink, 49358516-bb57-4f10-badf-00c9f5e2484b, Append, 1 WriteToMicroBatchDataSource MemorySink, 49358516-bb57-4f10-badf-00c9f5e2484b, Append, 1 +- Project [value#45] +- Project [value#45] +- Join Inner +- Join Inner :- Project [value#45] :- Project [value#45] : +- StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource : +- StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource +- Project +- Project ! +- Filter false +- LocalRelation <empty>, [id#54L] ! +- Range (1, 5, step=1, splits=Some(2)) ``` The bug of SPARK-47305 was, LocalRelation in above was "incorrectly" marked as `streaming=true` where it should be `streaming=false`. There is no notion of isStreaming flag in the text representation of LocalRelation, hence from the text plan we would never know the rule had a bug. Even though we assume we show the value of isStreaming in LocalRelation, the depth of subtree could be huge in practice and it's not friendly to go down to the leaf node to figure out the isStreaming value of the entire subtree. After this PR, the above rule information will be changed as below: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters === ~WriteToMicroBatchDataSource MemorySink, dcf936d0-8580-47e9-b898-4fffffc21db5, Append, 1 ~WriteToMicroBatchDataSource MemorySink, dcf936d0-8580-47e9-b898-4fffffc21db5, Append, 1 +- ~Project [value#45] +- ~Project [value#45] +- ~Join Inner +- ~Join Inner :- ~Project [value#45] :- ~Project [value#45] : +- StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource : +- StreamingDataSourceV2ScanRelation[value#45] MemoryStreamDataSource ! +- Project +- ~Project ! +- Filter false +- ~LocalRelation <empty>, [id#54L] ! +- Range (1, 5, step=1, splits=Some(2)) ``` Now it's obvious that isStreaming flag of leaf node had changed. Also, to check the isStreaming flag of children for Join, we just need to look at the first node of subtree for children, instead of going down to leaf nodes. ### Does this PR introduce _any_ user-facing change? Yes, since the textual representation of logical plan will be changed a bit. But it's only applied to the streaming Dataset, and also the textual representation of logical plan is arguably not a public API. (Keeping backward compatibility of the text is technically very hard.) ### How was this patch tested? Existing UTs for regression test on batch and streaming query. For streaming query, this PR updated the golden file to match with the change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46953 from HeartSaVioR/SPARK-48597. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 June 2024, 00:18:34 UTC
347f9c6 [SPARK-48302][PYTHON] Preserve nulls in map columns in PyArrow Tables ### What changes were proposed in this pull request? This is a small follow-up to #46529. It fixes a known issue affecting PyArrow Tables passed to `spark.createDataFrame()`. After this PR, if the user is running PyArrow 17.0.0 or higher, null values in MapArray columns containing nested fields or timestamps will be preserved. ### Why are the changes needed? Before this PR, null values in MapArray columns containing nested fields or timestamps are replaced by empty lists when a PyArrow Table is passed to `spark.createDataFrame()`. ### Does this PR introduce _any_ user-facing change? It prevents loss of nulls in the case described above. There are no other user-facing changes. ### How was this patch tested? A test is included. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46837 from ianmcook/SPARK-48302. Authored-by: Ian Cook <ianmcook@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 June 2024, 05:12:03 UTC
0775ea7 [SPARK-48611][CORE] Log TID for input split in HadoopRDD and NewHadoopRDD ### What changes were proposed in this pull request? Log `TID` for "input split" in `HadoopRDD` and `NewHadoopRDD` ### Why are the changes needed? This change should benefit both structured logging enabled/disabled cases. When structured logging is disabled, and executor cores > 1, the logs of tasks are mixed in stdout, something like ``` 24/06/12 21:40:10 INFO Executor: Running task 26.0 in stage 2.0 (TID 10) 24/06/12 21:40:10 INFO Executor: Running task 27.0 in stage 2.0 (TID 11) 24/06/12 21:40:11 INFO HadoopRDD: Input split: hdfs://.../part-00025-53bc40ae-399f-4291-b5ac-617c980deb86-c000:0+124138257 24/06/12 21:40:11 INFO HadoopRDD: Input split: hdfs://.../part-00045-53bc40ae-399f-4291-b5ac-617c980deb86-c000:0+121726684 ``` it's hard to say which file is read by which task because they run in parallel. If something goes wrong, the log prints `TID` and exception stack trace, the error may related to the input data, sometimes that `exception message` is clear enough to show which file that input data comes from, but sometimes not, in the latter case, the current log is not clear enough to allow us to identify the bad file quickly. ``` 24/06/12 21:40:18 ERROR Executor: Exception in task 27.0 in stage 2.0 (TID 11) (... exception message) (... stacktraces) ``` When structured logging is enabled, exposing TID as a LogKey makes the logs more selective. ### Does this PR introduce _any_ user-facing change? Yes, it supplies additional information in logs. ### How was this patch tested? Review, as it only touches log contents. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46966 from pan3793/SPARK-48611. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 15 June 2024, 03:21:09 UTC
8ee8aba [SPARK-48621][SQL] Fix Like simplification in Optimizer for collated strings ### What changes were proposed in this pull request? Enable `LikeSimplification` optimizer rule for collated strings. ### Why are the changes needed? Optimize how `Like` expression works with collated strings and ensure collation awareness when replacing `Like` expressions with `StartsWith` / `EndsWith` / `Contains` / `EqualTo` under special conditions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New e2e sql tests in `CollationSQLRegexpSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46976 from uros-db/like-simplification. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Kent Yao <yao@apache.org> 14 June 2024, 09:32:16 UTC
aa4bfb0 Revert "[SPARK-48591][PYTHON] Simplify the if-else branches with `F.lit`" revert https://github.com/apache/spark/pull/46946 since it may cause circular import issue ``` File "/home/jenkins/python/pyspark/sql/connect/functions/__init__.py", line 20, in <module> from pyspark.sql.connect.functions.builtin import * # noqa: F401,F403 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jenkins/python/pyspark/sql/connect/functions/builtin.py", line 60, in <module> from pyspark.sql.connect.udf import _create_py_udf File "/home/jenkins/python/pyspark/sql/connect/udf.py", line 38, in <module> from pyspark.sql.connect.column import Column ImportError: cannot import name 'Column' from partially initialized module 'pyspark.sql.connect.column' (most likely due to a circular import) (/home/jenkins/python/pyspark/sql/connect/column.py) Had test failures in delta.connect.tests.test_deltatable with python; see logs. ``` Closes #46985 from zhengruifeng/revert_simplify_column. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 14 June 2024, 09:01:06 UTC
2d2bedf [SPARK-48056][CONNECT][FOLLOW-UP] Scala Client re-execute plan if a SESSION_NOT_FOUND error is raised and no partial response was received ### What changes were proposed in this pull request? This change lets a Scala Spark Connect client reattempt execution of a plan when it receives a SESSION_NOT_FOUND error from the Spark Connect service if it has not received any partial responses. This is a Scala version of the previous fix of the same issue - https://github.com/apache/spark/pull/46297. ### Why are the changes needed? Spark Connect clients often get a spurious error from the Spark Connect service if the service is busy or the network is congested. This error leads to a situation where the client immediately attempts to reattach without the service being aware of the client; this leads to a query failure. ### Does this PR introduce _any_ user-facing change? Prevoiusly, a Scala Spark Connect client would fail with the error code "INVALID_HANDLE.SESSION_NOT_FOUND" in the very first attempt to make a request to the service, but with this change, the client will automatically retry. ### How was this patch tested? Attached unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46971 from changgyoopark-db/SPARK-48056. Authored-by: Changgyoo Park <changgyoo.park@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 14 June 2024, 08:31:37 UTC
dd8b05f [SPARK-42252][CORE] Add `spark.shuffle.localDisk.file.output.buffer` and deprecate `spark.shuffle.unsafe.file.output.buffer` ### What changes were proposed in this pull request? Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config spark.shuffle.localDisk.file.output.buffer instead. ### Why are the changes needed? The old config is desgined to be used in UnsafeShuffleWriter, but now it has been used in all local shuffle writers through LocalDiskShuffleMapOutputWriter, introduced by #25007. ### Does this PR introduce _any_ user-facing change? Old still works, advised to use new. ### How was this patch tested? Passed existing tests. Closes #39819 from wayneguow/shuffle_output_buffer. Authored-by: wayneguow <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 14 June 2024, 07:11:33 UTC
878de00 [SPARK-48626][CORE] Change the scope of object LogKeys as private in Spark ### What changes were proposed in this pull request? Change the scope of object LogKeys as private in Spark. ### Why are the changes needed? LogKeys are internal and developing. Making it private can avoid future confusion or compiling failures. This is suggested by pan3793 in https://github.com/apache/spark/pull/46947#issuecomment-2167164424 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #46983 from gengliangwang/changeScope. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> 14 June 2024, 07:08:41 UTC
3831886 [SPARK-48625][BUILD] Upgrade `mssql-jdbc` to 12.6.2.jre11 ### What changes were proposed in this pull request? Upgrade `mssql-jdbc` to 12.6.2.jre11 ### Why are the changes needed? There are some issue fixes and enhancements: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.6.2 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46981 from wayneguow/mssql-jdbc. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 14 June 2024, 05:54:54 UTC
157b1e3 [SPARK-48612][SQL][SS] Cleanup deprecated api usage related to commons-pool2 ### What changes were proposed in this pull request? This pr make the following changes - o.a.c.pool2.impl.BaseObjectPoolConfig#setMinEvictableIdleTime -> o.a.c.pool2.impl.BaseObjectPoolConfig#setMinEvictableIdleDuration - o.a.c.pool2.impl.BaseObjectPoolConfig#setSoftMinEvictableIdleTime -> o.a.c.pool2.impl.BaseObjectPoolConfig#setSoftMinEvictableIdleDuration to fix the following compilation warnings related to 'commons-pool2': ``` [WARNING] [Warn] /Users/yangjie01/SourceCode/git/spark-mine-13/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/InternalKafkaConsumerPool.scala:186: method setMinEvictableIdleTime in class BaseObjectPoolConfig is deprecated Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation, site=org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool.PoolConfig.init, origin=org.apache.commons.pool2.impl.BaseObjectPoolConfig.setMinEvictableIdleTime [WARNING] [Warn] /Users/yangjie01/SourceCode/git/spark-mine-13/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/InternalKafkaConsumerPool.scala:187: method setSoftMinEvictableIdleTime in class BaseObjectPoolConfig is deprecated Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation, site=org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumerPool.PoolConfig.init, origin=org.apache.commons.pool2.impl.BaseObjectPoolConfig.setSoftMinEvictableIdleTime ``` The fix refers to: - https://github.com/apache/commons-pool/blob/e5c44f5184a55a58fef4a1efec8124d162a348bd/src/main/java/org/apache/commons/pool2/impl/BaseObjectPoolConfig.java#L765-L789 - https://github.com/apache/commons-pool/blob/e5c44f5184a55a58fef4a1efec8124d162a348bd/src/main/java/org/apache/commons/pool2/impl/BaseObjectPoolConfig.java#L815-L839 ```java /** * Sets the value for the {code minEvictableIdleTime} configuration attribute for pools created with this configuration instance. * * param minEvictableIdleTime The new setting of {code minEvictableIdleTime} for this configuration instance * see GenericObjectPool#getMinEvictableIdleDuration() * see GenericKeyedObjectPool#getMinEvictableIdleDuration() * since 2.10.0 * deprecated Use {link #setMinEvictableIdleDuration(Duration)}. */ Deprecated public void setMinEvictableIdleTime(final Duration minEvictableIdleTime) { this.minEvictableIdleDuration = PoolImplUtils.nonNull(minEvictableIdleTime, DEFAULT_MIN_EVICTABLE_IDLE_TIME); } /** * Sets the value for the {code minEvictableIdleTime} configuration attribute for pools created with this configuration instance. * * param minEvictableIdleTime The new setting of {code minEvictableIdleTime} for this configuration instance * see GenericObjectPool#getMinEvictableIdleDuration() * see GenericKeyedObjectPool#getMinEvictableIdleDuration() * since 2.12.0 */ public void setMinEvictableIdleDuration(final Duration minEvictableIdleTime) { this.minEvictableIdleDuration = PoolImplUtils.nonNull(minEvictableIdleTime, DEFAULT_MIN_EVICTABLE_IDLE_TIME); } /** * Sets the value for the {code softMinEvictableIdleTime} configuration attribute for pools created with this configuration instance. * * param softMinEvictableIdleTime The new setting of {code softMinEvictableIdleTime} for this configuration instance * see GenericObjectPool#getSoftMinEvictableIdleDuration() * see GenericKeyedObjectPool#getSoftMinEvictableIdleDuration() * since 2.10.0 * deprecated Use {link #setSoftMinEvictableIdleDuration(Duration)}. */ Deprecated public void setSoftMinEvictableIdleTime(final Duration softMinEvictableIdleTime) { this.softMinEvictableIdleDuration = PoolImplUtils.nonNull(softMinEvictableIdleTime, DEFAULT_SOFT_MIN_EVICTABLE_IDLE_TIME); } /** * Sets the value for the {code softMinEvictableIdleTime} configuration attribute for pools created with this configuration instance. * * param softMinEvictableIdleTime The new setting of {code softMinEvictableIdleTime} for this configuration instance * see GenericObjectPool#getSoftMinEvictableIdleDuration() * see GenericKeyedObjectPool#getSoftMinEvictableIdleDuration() * since 2.12.0 */ public void setSoftMinEvictableIdleDuration(final Duration softMinEvictableIdleTime) { this.softMinEvictableIdleDuration = PoolImplUtils.nonNull(softMinEvictableIdleTime, DEFAULT_SOFT_MIN_EVICTABLE_IDLE_TIME); } ``` ### Why are the changes needed? Cleanup deprecated api usage related to `commons-pool2` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46967 from LuciferYang/commons-pool2-deprecated. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 14 June 2024, 04:36:22 UTC
75fff90 [SPARK-45685][SQL][FOLLOWUP] Add handling for `Stream` where `LazyList.force` is called ### What changes were proposed in this pull request? Refer to the suggestion of https://github.com/apache/spark/pull/43563#pullrequestreview-2114900378, this pr add handling for Stream where LazyList.force is called ### Why are the changes needed? Even though `Stream` is deprecated in 2.13, it is not _removed_ and thus is is possible that some parts of Spark / Catalyst (or third-party code) might continue to pass around `Stream` instances. Hence, we should restore the call to `Stream.force` where `.force` is called on `LazyList`, to avoid losing the eager materialization for Streams that happen to flow to these call sites. This is also a guarantee of compatibility. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add some new tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46970 from LuciferYang/SPARK-45685-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 14 June 2024, 04:30:44 UTC
0b214f1 [MINOR][DOCS][TESTS] Update repo name and link from `parquet-mr` to `parquet-java` ### What changes were proposed in this pull request? This pr replaces parquet related repo name from `parquet-mr` to `parquet-java` and repo link from `https://github.com/apache/parquet-mr` to `https://github.com/apache/parquet-java`. ### Why are the changes needed? The upstream repo name has made a change with [INFRA-25802](https://issues.apache.org/jira/browse/INFRA-25802), [PARQUET-2475](https://issues.apache.org/jira/browse/PARQUET-2475), it's better to update with the latest name and link. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46963 from wayneguow/parquet. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 14 June 2024, 02:33:49 UTC
70bdcc9 [MINOR][DOCS] Fix metrics info of shuffle service ### What changes were proposed in this pull request? Fix metrics info of shuffleService ### Why are the changes needed? Documentation is outdated. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46973 from j7nhai/doc. Authored-by: j7nhai <j7nhai@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 14 June 2024, 02:10:02 UTC
be154a3 [SPARK-48622][SQL] get SQLConf once when resolving column names ### What changes were proposed in this pull request? `SQLConf.caseSensitiveAnalysis` is currently being retrieved for every column when resolving column names. This is expensive if there are many columns. We can instead retrieve it once before the loop, and reuse the result. ### Why are the changes needed? Performance improvement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Profiles of adding 1 column on an empty 10k column table (hms-parquet): Before (55s): <img width="1728" alt="Screenshot 2024-06-10 at 12 17 49 PM" src="https://github.com/databricks/runtime/assets/169104436/58de6a56-943e-465a-9005-ae98f960779e"> After (13s): <img width="1726" alt="Screenshot 2024-06-10 at 12 17 04 PM" src="https://github.com/databricks/runtime/assets/169104436/e9bdabc4-6e29-4012-bb01-103fa0b640fc"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #46979 from andrewxue-db/andrewxue-db/spark-48622. Authored-by: Andrew Xue <andrew.xue@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> 14 June 2024, 02:04:24 UTC
08e741b [SPARK-48604][SQL] Replace deprecated `new ArrowType.Decimal(precision, scale)` method call ### What changes were proposed in this pull request? This pr replaces deprecated classes and methods of `arrow-vector` called in Spark: - `Decimal(int precision, int scale)` -> `Decimal( JsonProperty("precision") int precision, JsonProperty("scale") int scale, JsonProperty("bitWidth") int bitWidth )` All `arrow-vector` related Spark classes, I made a double check, only in `ArrowUtils` there is a deprecated method call. ### Why are the changes needed? Clean up deprecated API usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46961 from wayneguow/deprecated_arrow. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 13 June 2024, 10:11:30 UTC
bdcb79f [SPARK-48543][SS] Track state row validation failures using explicit error class ### What changes were proposed in this pull request? Track state row validation failures using explicit error class ### Why are the changes needed? We want to track these exceptions explicitly since they could be indicative of underlying corruptions/data loss scenarios. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ``` 13:06:32.803 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /Users/anish.shrigondekar/spark/spark/target/tmp/spark-6d90d3f3-0f37-48b8-8506-a8cdee3d25d7 [info] Run completed in 9 seconds, 861 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46885 from anishshri-db/task/SPARK-48543. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 13 June 2024, 07:47:49 UTC
b8c7aee [SPARK-48609][BUILD] Upgrade `scala-xml` to 2.3.0 ### What changes were proposed in this pull request? The pr aims to upgrade `scala-xml` from `2.2.0` to `2.3.0` ### Why are the changes needed? The full release notes: https://github.com/scala/scala-xml/releases/tag/v2.3.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46964 from panbingkun/SPARK-48609. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 13 June 2024, 06:45:28 UTC
78fd4e3 [SPARK-48584][SQL][FOLLOWUP] Improve the unescapePathName ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/46938 and improve the `unescapePathName`. ### Why are the changes needed? Improve the `unescapePathName` by cut off slow path. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46957 from beliefer/SPARK-48584_followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: beliefer <beliefer@163.com> 13 June 2024, 06:41:49 UTC
ea2bca7 [SPARK-48602][SQL] Make csv generator support different output style with spark.sql.binaryOutputStyle ### What changes were proposed in this pull request? In SPARK-47911, we introduced a universal BinaryFormatter to make binary output consistent across all clients, such as beeline, spark-sql, and spark-shell, for both primitive and nested binaries. But unfortunately, `to_csv` and `csv writer` have interceptors for binary output which is hard-coded to use `SparkStringUtils.getHexString`. In this PR we make it also configurable. ### Why are the changes needed? feature parity ### Does this PR introduce _any_ user-facing change? Yes, we have make spark.sql.binaryOutputStyle work for csv but the AS-IS behavior is kept. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46956 from yaooqinn/SPARK-48602. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 13 June 2024, 03:50:41 UTC
fd045c9 [SPARK-48583][SQL][TESTS] Replace deprecated classes and methods of `commons-io` called in Spark ### What changes were proposed in this pull request? This pr replaces deprecated classes and methods of `commons-io` called in Spark: - `writeStringToFile(final File file, final String data)` -> `writeStringToFile(final File file, final String data, final Charset charset)` - `CountingInputStream` -> `BoundedInputStream` ### Why are the changes needed? Clean up deprecated API usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed related test cases in `UDFXPathUtilSuite` and `XmlSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46935 from wayneguow/deprecated. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 13 June 2024, 03:03:45 UTC
3988548 [SPARK-48593][PYTHON][CONNECT] Fix the string representation of lambda function ### What changes were proposed in this pull request? Fix the string representation of lambda function ### Why are the changes needed? I happen to hit this bug ### Does this PR introduce _any_ user-facing change? yes before ``` In [2]: array_sort("data", lambda x, y: when(x.isNull() | y.isNull(), lit(0)).otherwise(length(y) - length(x))) Out[2]: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/core/formatters.py:711, in PlainTextFormatter.__call__(self, obj) 704 stream = StringIO() 705 printer = pretty.RepresentationPrinter(stream, self.verbose, 706 self.max_width, self.newline, 707 max_seq_length=self.max_seq_length, 708 singleton_pprinters=self.singleton_printers, 709 type_pprinters=self.type_printers, 710 deferred_pprinters=self.deferred_printers) --> 711 printer.pretty(obj) 712 printer.flush() 713 return stream.getvalue() File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/lib/pretty.py:411, in RepresentationPrinter.pretty(self, obj) 408 return meth(obj, self, cycle) 409 if cls is not object \ 410 and callable(cls.__dict__.get('__repr__')): --> 411 return _repr_pprint(obj, self, cycle) 413 return _default_pprint(obj, self, cycle) 414 finally: File ~/.dev/miniconda3/envs/spark_dev_312/lib/python3.12/site-packages/IPython/lib/pretty.py:779, in _repr_pprint(obj, p, cycle) 777 """A pprint that just redirects to the normal repr function.""" 778 # Find newlines and replace them with p.break_() --> 779 output = repr(obj) 780 lines = output.splitlines() 781 with p.group(): File ~/Dev/spark/python/pyspark/sql/connect/column.py:441, in Column.__repr__(self) 440 def __repr__(self) -> str: --> 441 return "Column<'%s'>" % self._expr.__repr__() File ~/Dev/spark/python/pyspark/sql/connect/expressions.py:626, in UnresolvedFunction.__repr__(self) 624 return f"{self._name}(distinct {', '.join([str(arg) for arg in self._args])})" 625 else: --> 626 return f"{self._name}({', '.join([str(arg) for arg in self._args])})" File ~/Dev/spark/python/pyspark/sql/connect/expressions.py:962, in LambdaFunction.__repr__(self) 961 def __repr__(self) -> str: --> 962 return f"(LambdaFunction({str(self._function)}, {', '.join(self._arguments)})" TypeError: sequence item 0: expected str instance, UnresolvedNamedLambdaVariable found ``` after ``` In [2]: array_sort("data", lambda x, y: when(x.isNull() | y.isNull(), lit(0)).otherwise(length(y) - length(x))) Out[2]: Column<'array_sort(data, LambdaFunction(CASE WHEN or(isNull(x_0), isNull(y_1)) THEN 0 ELSE -(length(y_1), length(x_0)) END, x_0, y_1))'> ``` ### How was this patch tested? CI, added test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46948 from zhengruifeng/fix_string_rep_lambda. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 June 2024, 23:51:10 UTC
c059c84 [SPARK-48421][SQL] SPJ: Add documentation ### What changes were proposed in this pull request? Add docs for SPJ ### Why are the changes needed? There are no docs describing SPJ, even though it is mentioned in migration notes: https://github.com/apache/spark/pull/46673 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Checked the new text ### Was this patch authored or co-authored using generative AI tooling? No Closes #46745 from szehon-ho/doc_spj. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 June 2024, 23:50:15 UTC
0bbd049 [SPARK-48591][PYTHON] Simplify the if-else branches with `F.lit` ### What changes were proposed in this pull request? Simplify the if-else branches with `F.lit` which accept both Column and non-Column input ### Why are the changes needed? code clean up ### Does this PR introduce _any_ user-facing change? No, internal minor refactor ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46946 from zhengruifeng/column_simplify. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 June 2024, 23:49:29 UTC
d1d29c9 [SPARK-48598][PYTHON][CONNECT] Propagate cached schema in dataframe operations ### What changes were proposed in this pull request? Propagate cached schema in dataframe operations: - DataFrame.alias - DataFrame.coalesce - DataFrame.repartition - DataFrame.repartitionByRange - DataFrame.dropDuplicates - DataFrame.distinct - DataFrame.filter - DataFrame.where - DataFrame.limit - DataFrame.sort - DataFrame.sortWithinPartitions - DataFrame.orderBy - DataFrame.sample - DataFrame.hint - DataFrame.randomSplit - DataFrame.observe ### Why are the changes needed? to avoid unnecessary RPCs if possible ### Does this PR introduce _any_ user-facing change? No, optimization only ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46954 from zhengruifeng/py_connect_propagate_schema. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 June 2024, 23:48:24 UTC
2d0b122 [SPARK-48594][PYTHON][CONNECT] Rename `parent` field to `child` in `ColumnAlias` ### What changes were proposed in this pull request? Rename `parent` field to `child` in `ColumnAlias` ### Why are the changes needed? it should be `child` other than `parent`, to be consistent with both other expressions in `expressions.py` and the Scala side. ### Does this PR introduce _any_ user-facing change? No, it is just an internal change ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46949 from zhengruifeng/minor_column_alias. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 June 2024, 23:47:40 UTC
b5e1b79 [SPARK-48596][SQL] Perf improvement for calculating hex string for long ### What changes were proposed in this pull request? This pull request optimizes the `Hex.hex(num: Long)` method by removing leading zeros, thus eliminating the need to copy the array to remove them afterward. ### Why are the changes needed? - Unit tests added - Did a benchmark locally (30~50% speedup) ```scala Hex Long Tests: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Legacy 1062 1094 16 9.4 106.2 1.0X New 739 807 26 13.5 73.9 1.4X ``` ```scala object HexBenchmark extends BenchmarkBase { override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val N = 10_000_000 runBenchmark("Hex") { val benchmark = new Benchmark("Hex Long Tests", N, 10, output = output) val range = 1 to 12 benchmark.addCase("Legacy") { _ => (1 to N).foreach(x => range.foreach(y => hexLegacy(x - y))) } benchmark.addCase("New") { _ => (1 to N).foreach(x => range.foreach(y => Hex.hex(x - y))) } benchmark.run() } } def hexLegacy(num: Long): UTF8String = { // Extract the hex digits of num into value[] from right to left val value = new Array[Byte](16) var numBuf = num var len = 0 do { len += 1 // Hex.hexDigits need to be seen here value(value.length - len) = Hex.hexDigits((numBuf & 0xF).toInt) numBuf >>>= 4 } while (numBuf != 0) UTF8String.fromBytes(java.util.Arrays.copyOfRange(value, value.length - len, value.length)) } } ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? no Closes #46952 from yaooqinn/SPARK-48596. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 12 June 2024, 12:23:03 UTC
a3625a9 [SPARK-48595][CORE] Cleanup deprecated api usage related to `commons-compress` ### What changes were proposed in this pull request? This pr use `org.apache.commons.io.output.CountingOutputStream` instead of `org.apache.commons.compress.utils.CountingOutputStream` to fix the following compilation warnings related to 'commons-compress': ``` [WARNING] [Warn] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala:308: class CountingOutputStream in package utils is deprecated Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation, site=org.apache.spark.deploy.history.RollingEventLogFilesWriter.countingOutputStream, origin=org.apache.commons.compress.utils.CountingOutputStream [WARNING] [Warn] /Users/yangjie01/SourceCode/git/spark-mine-13/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala:351: class CountingOutputStream in package utils is deprecated Applicable -Wconf / nowarn filters for this warning: msg=<part of the message>, cat=deprecation, site=org.apache.spark.deploy.history.RollingEventLogFilesWriter.rollEventLogFile.$anonfun, origin=org.apache.commons.compress.utils.CountingOutputStream ``` The fix refers to: https://github.com/apache/commons-compress/blob/95727006cac0892c654951c4e7f1db142462f22a/src/main/java/org/apache/commons/compress/utils/CountingOutputStream.java#L25-L33 ``` /** * Stream that tracks the number of bytes read. * * since 1.3 * NotThreadSafe * deprecated Use {link org.apache.commons.io.output.CountingOutputStream}. */ Deprecated public class CountingOutputStream extends FilterOutputStream { ``` ### Why are the changes needed? Cleanup deprecated api usage related to `commons-compress` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46950 from LuciferYang/SPARK-48595. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Kent Yao <yao@apache.org> 12 June 2024, 09:11:22 UTC
da81d8e [SPARK-48584][SQL] Perf improvement for unescapePathName ### What changes were proposed in this pull request? This PR improves perf for unescapePathName with algorithms briefly described as: - If a path contains no '%' or contains '%' at `position > path.length-2`, we return the original identity instead of creating a new StringBuilder to append char by char - Otherwise, we loop with 2 indices, `plaintextStartIdx` which starts from 0 and then points to the next char after resolving `%xx`, and `plaintextEndIdx` which points to the next `'%'`. `plaintextStartIdx` moves to `plaintextEndIdx + 3` if `%xx` is valid, or moves to `plaintextEndIdx + 1` if `%xx` is invalid. - Instead of using Integer.parseInt with error capture, we identify the high and low characters manually. ### Why are the changes needed? performance improvement for hotspots ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - new tests in ExternalCatalogUtilsSuite - Benchmark results (9-11x faster) ### Was this patch authored or co-authored using generative AI tooling? no Closes #46938 from yaooqinn/SPARK-48584. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 12 June 2024, 08:39:49 UTC
8870efc [SPARK-48581][BUILD] Upgrade dropwizard metrics to 4.2.26 ### What changes were proposed in this pull request? Upgrade dropwizard metrics to 4.2.26. ### Why are the changes needed? There are some bug fixes as belows: - Correction for the Jetty-12 QTP metrics by dkaukov in https://github.com/dropwizard/metrics/pull/4181 - Fix metrics for InstrumentedEE10Handler by zUniQueX in https://github.com/dropwizard/metrics/pull/3928 The full release notes: https://github.com/dropwizard/metrics/releases/tag/v4.2.26 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46932 from wayneguow/codahale. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 12 June 2024, 07:41:40 UTC
334816a [SPARK-48411][SS][PYTHON] Add E2E test for DropDuplicateWithinWatermark ### What changes were proposed in this pull request? This PR adds a test for API DropDuplicateWithinWatermark in Python, which was previously missing. ### Why are the changes needed? Check the correctness of API DropDuplicateWithinWatermark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed: ``` python/run-tests --testnames pyspark.sql.tests.streaming.test_streaming python/run-tests --testnames pyspark.sql.tests.connect.streaming.test_parity_streaming ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46740 from eason-yuchen-liu/DropDuplicateWithinWatermark_test. Authored-by: Yuchen Liu <yuchen.liu@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 12 June 2024, 03:45:38 UTC
72df3cb [SPARK-48582][BUILD] Upgrade `braces` from 3.0.2 to 3.0.3 in ui-test ### What changes were proposed in this pull request? This pr aims to upgrade `braces` from 3.0.2 to 3.0.3 in ui-test. The original pr was submitted by `dependabot`: https://github.com/apache/spark/pull/46931 ### Why are the changes needed? The new version fix vulnerability https://security.snyk.io/vuln/SNYK-JS-BRACES-6838727 - https://github.com/micromatch/braces/commit/9f5b4cf47329351bcb64287223ffb6ecc9a5e6d3 The complete list of changes is as follows: - https://github.com/micromatch/braces/compare/3.0.2...3.0.3 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46933 from LuciferYang/SPARK-48582. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Kent Yao <yao@apache.org> 12 June 2024, 02:14:38 UTC
82a84ed [SPARK-46937][SQL] Revert "[] Improve concurrency performance for FunctionRegistry" ### What changes were proposed in this pull request? Reverts https://github.com/apache/spark/pull/44976 as it breaks thread-safety ### Why are the changes needed? Fix thread-safety ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? no Closes #46940 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 June 2024, 19:55:16 UTC
6107836 [SPARK-48576][SQL][FOLLOWUP] Rename UTF8_BINARY_LCASE to UTF8_LCASE ### What changes were proposed in this pull request? Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE` in leftover tests. ### Why are the changes needed? Due to a merge conflict, one additional test was using the old collation name. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46939 from uros-db/renaming-fix. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 June 2024, 17:38:36 UTC
aad6771 [SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE ### What changes were proposed in this pull request? Renaming `UTF8_BINARY_LCASE` collation to `UTF8_LCASE`. ### Why are the changes needed? As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call `UTF8_BINARY_LCASE` is now more precisely `UTF8_LCASE`. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: https://github.com/apache/spark/pull/46511, https://github.com/apache/spark/pull/46589, https://github.com/apache/spark/pull/46682, https://github.com/apache/spark/pull/46761, https://github.com/apache/spark/pull/46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: https://github.com/apache/spark/pull/46700. ### Does this PR introduce _any_ user-facing change? Yes, what was previously named `UTF8_BINARY_LCASE` collation, will from now on be named `UTF8_LCASE`. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46924 from uros-db/rename-lcase. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 June 2024, 17:26:30 UTC
224ba16 [SPARK-48556][SQL] Fix incorrect error message pointing to UNSUPPORTED_GROUPING_EXPRESSION ### What changes were proposed in this pull request? Following sequence of queries produces `UNSUPPORTED_GROUPING_EXPRESSION` error: ``` create table t1(a int, b int) using parquet; select grouping(a), dummy from t1 group by a with rollup; ``` However, the appropriate error should point the user to the invalid `dummy` column name. Fix the problem by deprioritizing `Grouping` and `GroupingID` nodes in plan which were not resolved and thus cause the unwanted error. ### Why are the changes needed? To fix the described issue. ### Does this PR introduce _any_ user-facing change? Yes, it displays proper error message to user instead of misleading one. ### How was this patch tested? Added test to `QueryCompilationErrorsSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46900 from nikolamand-db/SPARK-48556. Authored-by: Nikola Mandic <nikola.mandic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 June 2024, 17:01:32 UTC
583ab05 [SPARK-47415][SQL] Add collation support for Levenshtein expression ### What changes were proposed in this pull request? Introduce collation support for `levenshtein` string expression (pass-through). ### Why are the changes needed? Add collation support for Levenshtein expression in Spark. ### Does this PR introduce _any_ user-facing change? Yes, users should now be able to use collated strings within arguments for string function: levenshtein. ### How was this patch tested? E2e sql tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46788 from uros-db/levenshtein. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 June 2024, 16:42:01 UTC
df4156a [SPARK-48372][SPARK-45716][PYTHON][FOLLOW-UP] Remove unused helper method ### What changes were proposed in this pull request? followup of https://github.com/apache/spark/pull/46685, to remove unused helper method ### Why are the changes needed? method `_tree_string` is no longer needed ### Does this PR introduce _any_ user-facing change? No, internal change only ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46936 from zhengruifeng/tree_string_followup. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 11 June 2024, 10:45:02 UTC
452c1b6 [SPARK-48551][SQL] Perf improvement for escapePathName ### What changes were proposed in this pull request? This PR improves perf for escapePathName with algorithms briefly described as: - If a path contains no special characters, we return the original identity instead of creating a new StringBuilder to append char by char - If a path contains special characters, we relocate the IDX of the first special character. Then initialize the StringBuilder with [0, IDX) of the original string, and do heximal padding if necessary starting from IDX. - An optimized char-to-hex function replaces the `String.format` Add a fast path for storage paths or their parts that do not require escaping to avoid creating a StringBuilder to append per character. ### Why are the changes needed? performance improvement for hotspots ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - new tests in ExternalCatalogUtilsSuite - Benchmark results (9x faster) ### Was this patch authored or co-authored using generative AI tooling? no Closes #46894 from yaooqinn/SPARK-48551. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> 11 June 2024, 07:55:32 UTC
53d65fd [SPARK-48565][UI] Fix thread dump display in UI ### What changes were proposed in this pull request? Thread dump display in UI is not pretty as before, this is side-effect introduced by SPARK-44863 ### Why are the changes needed? Restore thread dump display in UI. ### Does this PR introduce _any_ user-facing change? Yes, it only affects UI display. ### How was this patch tested? Current master: <img width="1545" alt="master-branch" src="https://github.com/apache/spark/assets/26535726/5c6fd770-467f-481c-a635-2855a2853633"> With this patch applied: <img width="1542" alt="Xnip2024-06-07_20-00-38" src="https://github.com/apache/spark/assets/26535726/3998c2aa-671f-4921-8444-b7bca8667202"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #46916 from pan3793/SPARK-48565. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 11 June 2024, 03:28:50 UTC
1e4750e [SPARK-47500][PYTHON][CONNECT][FOLLOWUP] Restore error message for `DataFrame.select(None)` ### What changes were proposed in this pull request? the refactor PR https://github.com/apache/spark/pull/45636 changed the error message of `DataFrame.select(None)` from `PySparkTypeError` to `AssertionError`, this PR restore the previous error message ### Why are the changes needed? error message improvement ### Does this PR introduce _any_ user-facing change? yes, error message improvement ### How was this patch tested? added test ### Was this patch authored or co-authored using generative AI tooling? no Closes #46930 from zhengruifeng/py_restore_select_error. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 June 2024, 02:53:21 UTC
3fe6abd [SPARK-48563][BUILD] Upgrade `pickle` to 1.5 ### What changes were proposed in this pull request? This pr aims upgrade `pickle` from 1.3 to 1.5. ### Why are the changes needed? The new version include a new fix related to [empty bytes object construction](https://github.com/irmen/pickle/commit/badc8fe08c9e47b87df66b8a16c67010e3614e35) All changes from 1.3 to 1.5 are as follows: - https://github.com/irmen/pickle/compare/pickle-1.3...pickle-1.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46913 from LuciferYang/pickle-1.5. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 11 June 2024, 02:36:32 UTC
5a2f374 [SPARK-48544][SQL] Reduce memory pressure of empty TreeNode BitSets ### What changes were proposed in this pull request? - Changed the `ineffectiveRules` variable of the `TreeNode` class to initialize lazily. This will reduce unnecessary driver memory pressure. ### Why are the changes needed? - Plans with large expression or operator trees are known to cause driver memory pressure; this is one step in alleviating that issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT covers behavior. Outwards facing behavior does not change. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46919 from n-young-db/ineffective-rules-lazy. Authored-by: Nick Young <nick.young@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> 10 June 2024, 18:16:11 UTC
ec6db63 [SPARK-48569][SS][CONNECT] Handle edge cases in query.name ### What changes were proposed in this pull request? 1. In connect, when a streaming query name is not specified, it's query.name should return None. Currently it returns an empty string without this patch. 2. In classic spark, one cannot set the streaming query's name to be empty string. This check was missing in Spark Connect. Adding it back. ### Why are the changes needed? Edge case handling. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46920 from WweiL/SPARK-48569-query-name-None. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 June 2024, 18:06:43 UTC
3857a9d [SPARK-48410][SQL] Fix InitCap expression for UTF8_BINARY_LCASE & ICU collations ### What changes were proposed in this pull request? String titlecase conversion under UTF8_BINARY_LCASE and other ICU collations now work using the appropriate ICU default locale for character mapping, and uses ICU BreakIterator.getWordInstance to locate boundaries between words. ### Why are the changes needed? Similar Spark expressions such as Lower & Upper use the same interface (UCharacter) to perform collation-aware string transformation, and InitCap should offer a consistant way to titlecase strings across the collation space. ### Does this PR introduce _any_ user-facing change? Yes, InitCap should now work properly for all collations other than UTF8_BINARY. ### How was this patch tested? New and existing unit tests, as well as existing e2e sql tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46732 from uros-db/initcap-icu. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 June 2024, 16:18:17 UTC
61fd936 [SPARK-48403][SQL] Fix Lower & Upper expressions for UTF8_BINARY_LCASE & ICU collations ### What changes were proposed in this pull request? String lowercase/uppercase conversion in UTF8_BINARY_LCASE now works using ICU default locale, similar to how other ICU collations currently work in Spark. ### Why are the changes needed? All collations apart from UTF8_BINARY should use the same interface (UCharacter) that utilizes ICU toLowerCase/toUpperCase implementation, rather than mixing JVM & ICU implementations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests and e2e sql tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46720 from uros-db/lower-upper-initcap. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 June 2024, 16:14:16 UTC
1901669 [SPARK-48564][PYTHON][CONNECT] Propagate cached schema in set operations ### What changes were proposed in this pull request? Propagate cached schema in set operations ### Why are the changes needed? to avoid extra RPC to get the schema of result data frame ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46915 from zhengruifeng/set_op_schema. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 June 2024, 16:10:21 UTC
d9394ee [SPARK-48560][SS][PYTHON] Make StreamingQueryListener.spark settable ### What changes were proposed in this pull request? This PR proposes to make StreamingQueryListener.spark settable ### Why are the changes needed? ```python from pyspark.sql.streaming.listener import StreamingQueryListener class MyListener(StreamingQueryListener): def __init__(self, spark): self.spark = spark def onQueryStarted(self, event): pass def onQueryProgress(self, event): pass def onQueryTerminated(self, event): pass MyListener(spark) ``` is broken from 3.5.0 after SPARK-42941. ### Does this PR introduce _any_ user-facing change? Yes, end users who implement `StreamingQueryListener` can add `spark` attribute in their implementation. ### How was this patch tested? Manually tested, and added a unittest. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46909 from HyukjinKwon/compat-spark-prop. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 June 2024, 15:36:05 UTC
24bce72 [SPARK-48012][SQL] SPJ: Support Transfrom Expressions for One Side Shuffle ### Why are the changes needed? Support SPJ one-side shuffle if other side has partition transform expression ### How was this patch tested? New unit test in KeyGroupedPartitioningSuite ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46255 from szehon-ho/spj_auto_bucket. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Chao Sun <chao@openai.com> 09 June 2024, 14:22:21 UTC
201df0d [MINOR][PYTHON][TESTS] Move a test out of parity tests ### What changes were proposed in this pull request? Move a test out of parity tests ### Why are the changes needed? it is not tested in Spark Classic, not a parity test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46914 from zhengruifeng/move_a_non_parity_test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 07 June 2024, 23:49:15 UTC
8911d59 [SPARK-46393][SQL][FOLLOWUP] Classify exceptions in JDBCTableCatalog.loadTable and Fix UT ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/46905, to fix `some UT` on GA. ### Why are the changes needed? Fix UT. ### Does this PR introduce _any_ user-facing change? No., ### How was this patch tested? Manually test. Pass GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46912 from panbingkun/SPARK-46393_FOLLOWUP. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 June 2024, 20:56:40 UTC
d81b1e3 [SPARK-48559][SQL] Fetch globalTempDatabase name directly without invoking initialization of GlobalaTempViewManager ### What changes were proposed in this pull request? It's not necessary to create `GlobalaTempViewManager` only for getting the global temp db name. This PR updates the code to avoid this, as global temp db name is just a config. ### Why are the changes needed? avoid unnecessary RPC calls to check existence of global temp db ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46907 from willwwt/master. Authored-by: Weitao Wen <weitao.wen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 June 2024, 08:53:57 UTC
87b0f59 [SPARK-48561][PS][CONNECT] Throw `PandasNotImplementedError` for unsupported plotting functions ### What changes were proposed in this pull request? Throw `PandasNotImplementedError` for unsupported plotting functions: - {Frame, Series}.plot.hist - {Frame, Series}.plot.kde - {Frame, Series}.plot.density - {Frame, Series}.plot(kind="hist", ...) - {Frame, Series}.plot(kind="hist", ...) - {Frame, Series}.plot(kind="density", ...) ### Why are the changes needed? the previous error message is confusing: ``` In [3]: psdf.plot.hist() /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1017: PandasAPIOnSparkAdviceWarning: The config 'spark.sql.ansi.enabled' is set to True. This can cause unexpected behavior from pandas API on Spark since pandas API on Spark follows the behavior of pandas, not SQL. warnings.warn(message, PandasAPIOnSparkAdviceWarning) [*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned[*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned[*********************************************-----------------------------------] 57.14% Complete (0 Tasks running, 1s, Scanned --------------------------------------------------------------------------- PySparkAttributeError Traceback (most recent call last) Cell In[3], line 1 ----> 1 psdf.plot.hist() File ~/Dev/spark/python/pyspark/pandas/plot/core.py:951, in PandasOnSparkPlotAccessor.hist(self, bins, **kwds) 903 def hist(self, bins=10, **kwds): 904 """ 905 Draw one histogram of the DataFrame’s columns. 906 A `histogram`_ is a representation of the distribution of data. (...) 949 >>> df.plot.hist(bins=12, alpha=0.5) # doctest: +SKIP 950 """ --> 951 return self(kind="hist", bins=bins, **kwds) File ~/Dev/spark/python/pyspark/pandas/plot/core.py:580, in PandasOnSparkPlotAccessor.__call__(self, kind, backend, **kwargs) 577 kind = {"density": "kde"}.get(kind, kind) 578 if hasattr(plot_backend, "plot_pandas_on_spark"): 579 # use if there's pandas-on-Spark specific method. --> 580 return plot_backend.plot_pandas_on_spark(plot_data, kind=kind, **kwargs) 581 else: 582 # fallback to use pandas' 583 if not PandasOnSparkPlotAccessor.pandas_plot_data_map[kind]: File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:41, in plot_pandas_on_spark(data, kind, **kwargs) 39 return plot_pie(data, **kwargs) 40 if kind == "hist": ---> 41 return plot_histogram(data, **kwargs) 42 if kind == "box": 43 return plot_box(data, **kwargs) File ~/Dev/spark/python/pyspark/pandas/plot/plotly.py:87, in plot_histogram(data, **kwargs) 85 psdf, bins = HistogramPlotBase.prepare_hist_data(data, bins) 86 assert len(bins) > 2, "the number of buckets must be higher than 2." ---> 87 output_series = HistogramPlotBase.compute_hist(psdf, bins) 88 prev = float("%.9f" % bins[0]) # to make it prettier, truncate. 89 text_bins = [] File ~/Dev/spark/python/pyspark/pandas/plot/core.py:189, in HistogramPlotBase.compute_hist(psdf, bins) 183 for group_id, (colname, bucket_name) in enumerate(zip(colnames, bucket_names)): 184 # creates a Bucketizer to get corresponding bin of each value 185 bucketizer = Bucketizer( 186 splits=bins, inputCol=colname, outputCol=bucket_name, handleInvalid="skip" 187 ) --> 189 bucket_df = bucketizer.transform(sdf) 191 if output_df is None: 192 output_df = bucket_df.select( 193 F.lit(group_id).alias("__group_id"), F.col(bucket_name).alias("__bucket") 194 ) File ~/Dev/spark/python/pyspark/ml/base.py:260, in Transformer.transform(self, dataset, params) 258 return self.copy(params)._transform(dataset) 259 else: --> 260 return self._transform(dataset) 261 else: 262 raise TypeError("Params must be a param map but got %s." % type(params)) File ~/Dev/spark/python/pyspark/ml/wrapper.py:412, in JavaTransformer._transform(self, dataset) 409 assert self._java_obj is not None 411 self._transfer_params_to_java() --> 412 return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sparkSession) File ~/Dev/spark/python/pyspark/sql/connect/dataframe.py:1696, in DataFrame.__getattr__(self, name) 1694 def __getattr__(self, name: str) -> "Column": 1695 if name in ["_jseq", "_jdf", "_jmap", "_jcols", "rdd", "toJSON"]: -> 1696 raise PySparkAttributeError( 1697 error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name} 1698 ) 1700 if name not in self.columns: 1701 raise PySparkAttributeError( 1702 error_class="ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name} 1703 ) PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jdf` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail. ``` after this PR: ``` In [3]: psdf.plot.hist() --------------------------------------------------------------------------- PandasNotImplementedError Traceback (most recent call last) Cell In[3], line 1 ----> 1 psdf.plot.hist() File ~/Dev/spark/python/pyspark/pandas/plot/core.py:957, in PandasOnSparkPlotAccessor.hist(self, bins, **kwds) 909 """ 910 Draw one histogram of the DataFrame’s columns. 911 A `histogram`_ is a representation of the distribution of data. (...) 954 >>> df.plot.hist(bins=12, alpha=0.5) # doctest: +SKIP 955 """ 956 if is_remote(): --> 957 return unsupported_function(class_name="pd.DataFrame", method_name="hist")() 959 return self(kind="hist", bins=bins, **kwds) File ~/Dev/spark/python/pyspark/pandas/missing/__init__.py:23, in unsupported_function.<locals>.unsupported_function(*args, **kwargs) 22 def unsupported_function(*args, **kwargs): ---> 23 raise PandasNotImplementedError( 24 class_name=class_name, method_name=method_name, reason=reason 25 ) PandasNotImplementedError: The method `pd.DataFrame.hist()` is not implemented yet. ``` ### Does this PR introduce _any_ user-facing change? yes, error message improvement ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46911 from zhengruifeng/ps_plotting_unsupported. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 07 June 2024, 08:36:58 UTC
b7d9c31 Revert "[SPARK-46393][SQL][FOLLOWUP] Classify exceptions in JDBCTableCatalog.loadTable" This reverts commit 82b4ad2af64845503604da70ff02748c3969c991. 07 June 2024, 08:31:47 UTC
9491292 [SPARK-48548][BUILD] Add LICENSE/NOTICE for spark-core with shaded dependencies ### What changes were proposed in this pull request? The core module shipped with some bundled dependencies, it's better to add LICENSE/NOTICE to conform to the ASF policies. ### Why are the changes needed? ASF legal compliance ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass build ### Was this patch authored or co-authored using generative AI tooling? no Closes #46891 from yaooqinn/SPARK-48548. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 07 June 2024, 02:14:24 UTC
82b4ad2 [SPARK-46393][SQL][FOLLOWUP] Classify exceptions in JDBCTableCatalog.loadTable ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/44335 , which missed to handle `loadTable` ### Why are the changes needed? better error message ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test ### Was this patch authored or co-authored using generative AI tooling? no Closes #46905 from cloud-fan/jdbc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> 07 June 2024, 02:11:40 UTC
56b9b13 [SPARK-48552][SQL] multi-line CSV schema inference should also throw FAILED_READ_FILE ### What changes were proposed in this pull request? multi-line CSV uses `BinaryFileRDD` instead of `FileScanRDD`, and we need to replicate the error handling code from `FileScanRDD`. Currently we already replicate the handling of ignore missing/corrupted files, and this PR replicates the error wrapping code. ### Why are the changes needed? to have consistent error message ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated test ### Was this patch authored or co-authored using generative AI tooling? no Closes #46890 from cloud-fan/error. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 07 June 2024, 02:07:53 UTC
9de0a2e [SPARK-42944][FOLLOWUP][SS][CONNECT] Reenable ApplyInPandasWithState tests ### What changes were proposed in this pull request? The tests for ApplyInPandasWithState was skipped in connect before. This was because the tests uses foreachBatch, which was not ready when the development is done. So they were skipped. This PR reenables them. ### Why are the changes needed? Necessary tests ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test only addition. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46853 from WweiL/apply-in-pandas-with-state-test. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 07 June 2024, 00:14:57 UTC
cf3051b [MINOR][PYTHON] add export class and fix doc typo for python streaming data source ### What changes were proposed in this pull request? Add SimpleDataSourceStreamReader to default export class. Fix the typo in python_data_source.dst ### Why are the changes needed? To improve user experience. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? Covered by existing test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46903 from chaoqin-li1123/fix_export. Authored-by: Chaoqin Li <chaoqin.li@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 07 June 2024, 00:10:47 UTC
3c4cb40 [SPARK-47952][CORE][CONNECT] Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn ### What changes were proposed in this pull request? 1. Add configuration `spark.connect.grpc.port.maxRetries` (default 0, no retries). [Before this PR]: The SparkConnectService would fail to start in case of port conflicts on Yarn. [After this PR]: Allow the internal GRPC server to retry new ports until it finds an available port before reaching the maxRetries. 2. Post SparkListenerEvent containing the location of the remote SparkConnectService on Yarn. [Before this PR]: We needed to manually find the final location (host and port) of the SparkConnectService on Yarn and then use the SparkConnect Client to connect. [After this PR]: The location will be posted via SparkListenerEvent `SparkListenerConnectServiceStarted` `SparkListenerConnectServiceEnd` Allowing users to register a listener to receive this event and expose it by some way like sending it to a third coordinator server. 3. Shutdown SparkPlugins before stopping the ListenerBus. [Before this PR]: If the SparkConnectService was launched in the SparkConnectPlugin way, currently the SparkPlugins would be shutdown after the stop of ListenerBus, causing events posted during the shutdown to not be delivered to the listener. [After this PR]: The SparkPlugins will be shutdown before the stop of ListenerBus, ensuring that the ListenerBus remains active during the shutdown and the listener can receive the SparkConnectService stop event. 4. Minor method refactoring for 1~3. ### Why are the changes needed? #User Story: Our data analysts and data scientists use Jupyter notebooks provisioned on Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark for interactively development via terminal under Yarn Client mode. However, Yarn Client mode consumes significant local memory if the job is heavy, and the total resource pool of k8s for notebooks is limited. To leverage the abundant resources of our Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This allows the driver on Yarn with SparkConnectService started and uses SparkConnect client to connect to the remote driver. To provide a seamless experience with one command startup for both server and client, we've wrapped the following processes in one script: 1) Start a local coordinator server (implemented by us internally, not in this PR) in the host of jupyter notebook. 2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with user-input Spark configurations and the local coordinator server's address and port. Append an additional listener class in the configuration for SparkConnectService callback with the actual address and port on Yarn to the coordinator server. 3) Wait for the coordinator server to receive the address callback from the SparkConnectService on Yarn and export the real address. 4) Start the client (pyspark --remote $callback_address) with the remote address. Finally, a remote SparkConnect Server is started on Yarn with a local SparkConnect client connected. Users no longer need to start the server beforehand and connect to the remote server after they manually explore the address on Yarn. #Problem statement of this change: 1) The specified port for the SparkConnectService GRPC server might be occupied on the node of the Hadoop Cluster. To increase the success rate of startup, it needs to retry on conflicts rather than fail directly. 2) Because the final binding port could be uncertain based on 1) when retry and the remote address is also unpredictable on Yarn, we need to retrieve the address and port programmatically and inject it automatically on the start of 'pyspark --remote'. To get the address of SparkConnectService on Yarn programmatically, the SparkConnectService needs to communicate its location back to the launcher side. ### Does this PR introduce _any_ user-facing change? 1. Add configuration `spark.connect.grpc.port.maxRetries` to enable port retries until an available port is found before reaching the maximum number of retries. 3. The start and stop events of the SparkConnectService are observable through the SparkListener. Two new events have been introduced: - SparkListenerConnectServiceStarted: the SparkConnectService(with address and port) tis online for serving - SparkListenerConnectServiceEnd: the SparkConnectService(with address and port) is offline ### How was this patch tested? By UT and verified the feature in our production environment by our binary build ### Was this patch authored or co-authored using generative AI tooling? No Closes #46182 from TakawaAkirayo/SPARK-47952. Lead-authored-by: tatian <tatian@ebay.com> Co-authored-by: TakawaAkirayo <153728772+TakawaAkirayo@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 07 June 2024, 00:02:17 UTC
edb9236 [SPARK-48504][PYTHON][CONNECT][FOLLOW-UP] Code clean up ### What changes were proposed in this pull request? Code clean up ### Why are the changes needed? Code clean up ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46898 from zhengruifeng/win_refactor. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 06 June 2024, 23:30:48 UTC
ce1b08f [SPARK-48553][PYTHON][CONNECT] Cache more properties ### What changes were proposed in this pull request? Cache more properties: - Dataframe.isEmpty - Dataframe.isLocal - Dataframe.inputFiles - Dataframe.semanticHash - Dataframe.explain - SparkSession.version ### Why are the changes needed? to avoid unnecessary RPCs ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46896 from zhengruifeng/df_cache_more. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 06 June 2024, 23:15:44 UTC
0f21df0 [SPARK-48286] Fix analysis of column with exists default expression - Add user facing error ### What changes were proposed in this pull request? FIRST CHANGE Pass correct parameter list to `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is invoked from `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`. `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method accepts 3 parameter 1) Field to analyze 2) Statement type - String 3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT Method `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column` pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is not metadata key, instead of that, it is statement type, so different expression is analyzed. Pull requests where original change was introduced https://github.com/apache/spark/pull/40049 - Initial commit https://github.com/apache/spark/pull/44876 - Refactor that did not touch the issue https://github.com/apache/spark/pull/44935 - Another refactor that did not touch the issue SECOND CHANGE Add user facing exception when default value is not foldable or resolved. Otherwise, user would see message "You hit a bug in Spark ...". ### Why are the changes needed? It is needed to pass correct value to `Column` object ### Does this PR introduce _any_ user-facing change? Yes, this is a bug fix, existence default value has now proper expression, but before this change, existence default value was actually current default value of column. ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46594 from urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression. Lead-authored-by: Uros Stankovic <uros.stankovic@databricks.com> Co-authored-by: Uros Stankovic <155642965+urosstan-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 June 2024, 20:08:48 UTC
84fa052 [SPARK-48283][SQL] Modify string comparison for UTF8_BINARY_LCASE ### What changes were proposed in this pull request? String comparison and hashing in UTF8_BINARY_LCASE is now context-unaware, and uses ICU root locale rules to convert string to lowercase at code point level, taking into consideration special cases for one-to-many case mapping. For example: comparing "ΘΑΛΑΣΣΙΝΟΣ" and "θαλασσινοσ" under UTF8_BINARY_LCASE now returns true, because Greek final sigma is special-cased in the new comparison implementation. ### Why are the changes needed? 1. UTF8_BINARY_LCASE should use ICU root locale rules (instead of JVM) 2. comparing strings under UTF8_BINARY_LCASE should be context-insensitive ### Does this PR introduce _any_ user-facing change? Yes, comparing strings under UTF8_BINARY_LCASE will now give different results in two kinds of special cases (Turkish dotted letter "i" and Greek final letter "sigma"). ### How was this patch tested? Unit tests in `CollationSupportSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46700 from uros-db/lcase-casing. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 June 2024, 19:05:28 UTC
b5a4b32 [SPARK-48435][SQL] UNICODE collation should not support binary equality ### What changes were proposed in this pull request? CollationFactory has been updated to no longer mark UNICODE as collation that supportsBinaryCollation. To reflect these changes, various tests have been updated. However, some tests have been (temporarily) removed because StringTrim no longer supports UNICODE collation given the new UNICODE definition in CollationFactory. At this time, StringTrim expression only supports UTF8_BINARY & UTF8_BINARY_LCASE, but not ICU collations. This work is in progress (https://github.com/apache/spark/pull/46762), so we'll ensure appropriate test coverage with those changes. ### Why are the changes needed? UNICODE collation should not support binary collation. Note: in the future, we may want to consider a collation such as UNICODE_BINARY, which will support binary equality, but also maintain UNICODE ordering. ### Does this PR introduce _any_ user-facing change? Yes, UNICODE is no longer treated as a binary collation. This affects how equality works for UNICODE, and also which codepath is taken for various collation-aware string expression given UNICODE collated string arguments. ### How was this patch tested? Updated existing unit and e2e sql test for UNICODE collation. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46772 from uros-db/fix-unicode. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 June 2024, 19:01:58 UTC
3878b57 [SPARK-48526][SS] Allow passing custom sink to testStream() ### What changes were proposed in this pull request? Update `StreamTest:testStream()` to allow passing a custom sink. This allows writing better tests covering streaming sinks, in particular: - reusing a sink across calls to testStream. - passing a custom sink implementation. ### Why are the changes needed? Better testing infrastructure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #46866 from johanl-db/allow-passing-custom-sink-stream-test. Authored-by: Johan Lasperas <johan.lasperas@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 June 2024, 18:19:53 UTC
9f4007f [SPARK-48546][SQL] Fix ExpressionEncoder after replacing NullPointerExceptions with proper error classes in AssertNotNull expression ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/46793, we replaced NullPointerExceptions with proper error classes in AssertNotNull expression. However, that PR forgot to update the `ExpressionEncoder` to check for these new error classes. This PR fixes it to make sure we use the new error classes in all cases. ### Why are the changes needed? See above ### Does this PR introduce _any_ user-facing change? Yes, see above ### How was this patch tested? This PR updates tests with the new error classes ### Was this patch authored or co-authored using generative AI tooling? No Closes #46888 from dtenedor/fix-expr-encoder. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 June 2024, 18:18:24 UTC
7cba1ab [SPARK-48554][INFRA] Use R 4.4.0 in `windows` R GitHub Action Window job ### What changes were proposed in this pull request? This PR aims to use R 4.4.0 in `windows` R GitHub Action job. ### Why are the changes needed? R 4.4.0 is the latest release which is released on 2024-04-24. https://www.r-project.org/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46897 from panbingkun/SPARK-48554. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 June 2024, 16:42:09 UTC
65db876 [SPARK-48513][SS] Add error class for state schema compatibility and minor refactoring ### What changes were proposed in this pull request? Add error class for state schema compatibility and minor refactoring ### Why are the changes needed? Add error class for state schema compatibility and minor refactoring so that these errors can be tracked using the NERF framework ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new unit tests ``` [info] Run completed in 8 seconds, 250 milliseconds. [info] Total number of tests run: 29 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46856 from anishshri-db/task/SPARK-48513. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 06 June 2024, 12:52:02 UTC
ab00533 [SPARK-47933][PYTHON][TESTS][FOLLOW-UP] Enable doctest `pyspark.sql.connect.column` ### What changes were proposed in this pull request? Enable doctest `pyspark.sql.connect.column` ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? manually check: I manually broke some doctests in `Column`, then found `pyspark.sql.connect.column` didn't fail: ``` (spark_dev_312) ➜ spark git:(master) ✗ python/run-tests -k --python-executables python3 --testnames 'pyspark.sql.classic.column' Running PySpark tests. Output is in /Users/ruifeng.zheng/Dev/spark/python/unit-tests.log Will test against the following Python executables: ['python3'] Will test the following Python tests: ['pyspark.sql.classic.column'] python3 python_implementation is CPython python3 version is: Python 3.12.2 Starting test(python3): pyspark.sql.classic.column (temp output: /Users/ruifeng.zheng/Dev/spark/python/target/4bdd14b8-92ba-43ba-a7fb-655e6769aeb9/python3__pyspark.sql.classic.column__i2_c1zct.log) WARNING: Using incubator modules: jdk.incubator.vector Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). ********************************************************************** File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/column.py", line 385, in pyspark.sql.column.Column.contains Failed example: df.filter(df.name.contains('o')).collect() Differences (ndiff with -expected +actual): - [Row(age=5, name='Bobx')] ? - + [Row(age=5, name='Bob')] ********************************************************************** 1 of 2 in pyspark.sql.column.Column.contains ***Test Failed*** 1 failures. Had test failures in pyspark.sql.classic.column with python3; see logs. (spark_dev_312) ➜ spark git:(master) ✗ python/run-tests -k --python-executables python3 --testnames 'pyspark.sql.connect.column' Running PySpark tests. Output is in /Users/ruifeng.zheng/Dev/spark/python/unit-tests.log Will test against the following Python executables: ['python3'] Will test the following Python tests: ['pyspark.sql.connect.column'] python3 python_implementation is CPython python3 version is: Python 3.12.2 Starting test(python3): pyspark.sql.connect.column (temp output: /Users/ruifeng.zheng/Dev/spark/python/target/2acaff3c-ef1d-41eb-b63e-509f3e0192c0/python3__pyspark.sql.connect.column__66td62h9.log) Finished test(python3): pyspark.sql.connect.column (3s) Tests passed in 3 seconds ``` after this PR, it fails as expected: ``` (spark_dev_312) ➜ spark git:(master) ✗ python/run-tests -k --python-executables python3 --testnames 'pyspark.sql.connect.column' Running PySpark tests. Output is in /Users/ruifeng.zheng/Dev/spark/python/unit-tests.log Will test against the following Python executables: ['python3'] Will test the following Python tests: ['pyspark.sql.connect.column'] python3 python_implementation is CPython python3 version is: Python 3.12.2 Starting test(python3): pyspark.sql.connect.column (temp output: /Users/ruifeng.zheng/Dev/spark/python/target/390ff7ae-7683-425c-b0d2-ee336e1ad452/python3__pyspark.sql.connect.column__f69b3smc.log) WARNING: Using incubator modules: jdk.incubator.vector Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). org.apache.spark.SparkSQLException: [INVALID_CURSOR.DISCONNECTED] The cursor is invalid. The cursor has been disconnected by the server. SQLSTATE: HY109 at org.apache.spark.sql.connect.execution.ExecuteGrpcResponseSender.execute(ExecuteGrpcResponseSender.scala:281) at org.apache.spark.sql.connect.execution.ExecuteGrpcResponseSender$$anon$1.run(ExecuteGrpcResponseSender.scala:101) ********************************************************************** File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/column.py", line 385, in pyspark.sql.column.Column.contains Failed example: df.filter(df.name.contains('o')).collect() Expected: [Row(age=5, name='Bobx')] Got: [Row(age=5, name='Bob')] ********************************************************************** 1 of 2 in pyspark.sql.column.Column.contains ***Test Failed*** 1 failures. Had test failures in pyspark.sql.connect.column with python3; see logs. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46895 from zhengruifeng/fix_connect_column_doc_test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 06 June 2024, 11:51:11 UTC
8cb78a7 [SPARK-48550][PS] Directly use the parent Window class ### What changes were proposed in this pull request? Directly use the parent Window class ### Why are the changes needed? the `get_window_class` method is no longer needed ### Does this PR introduce _any_ user-facing change? no, refactor only ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46892 from zhengruifeng/del_get_win_class. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 June 2024, 11:16:14 UTC
f4434c3 [SPARK-48540][CORE] Avoid ivy output loading settings to stdout ### What changes were proposed in this pull request? This PR aims to avoid ivy output loading settings to stdout. ### Why are the changes needed? Now `org.apache.spark.util.MavenUtils#getModuleDescriptor` will output the following string to stdout. This is due to the modified code order in SPARK-32596 . ``` :: loading settings :: url = jar:file:xxxx/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml ``` Stack trace ```java at org.apache.ivy.core.settings.IvySettings.load(IvySettings.java:404) at org.apache.ivy.core.settings.IvySettings.loadDefault(IvySettings.java:443) at org.apache.ivy.Ivy.configureDefault(Ivy.java:435) at org.apache.ivy.core.IvyContext.getDefaultIvy(IvyContext.java:201) at org.apache.ivy.core.IvyContext.getIvy(IvyContext.java:180) at org.apache.ivy.core.IvyContext.getSettings(IvyContext.java:216) at org.apache.ivy.core.module.status.StatusManager.getCurrent(StatusManager.java:40) at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.<init>(DefaultModuleDescriptor.java:206) at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.newDefaultInstance(DefaultModuleDescriptor.java:107) at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.newDefaultInstance(DefaultModuleDescriptor.java:66) at org.apache.spark.deploy.SparkSubmitUtils$.getModuleDescriptor(SparkSubmit.scala:1413) at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1460) at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:327) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:942) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:181) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46882 from cxzl25/SPARK-48540. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Kent Yao <yao@apache.org> 06 June 2024, 06:35:52 UTC
b3700ac [SPARK-48539][BUILD][TESTS] Upgrade docker-java to 3.3.6 ### What changes were proposed in this pull request? Upgrades docker-java to 3.3.6 ### Why are the changes needed? There are some bug fixes and enhancements: https://github.com/docker-java/docker-java/releases/tag/3.3.6 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46881 from wayneguow/docker_upgrade. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 06 June 2024, 06:16:03 UTC
966c3d9 [SPARK-47552][CORE][FOLLOWUP] Set spark.hadoop.fs.s3a.connection.establish.timeout to numeric ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/45710 . Some custom `FileSystem` implementations read the `hadoop.fs.s3a.connection.establish.timeout` config as numeric, and do not support the `30s` syntax. To make it safe, this PR proposes to set this conf to `30000` instead of `30s`. I checked the doc page and this config is milliseconds. ### Why are the changes needed? more compatible with custom `FileSystem` implementations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual ### Was this patch authored or co-authored using generative AI tooling? no Closes #46874 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 June 2024, 03:49:03 UTC
31ce2db [SPARK-48538][SQL] Avoid HMS memory leak casued by bonecp ### What changes were proposed in this pull request? As described in [HIVE-15551](https://issues.apache.org/jira/browse/HIVE-15551), HMS will memory leak when directsql is enabled for MySQL metastore DB. Although HIVE-15551 has been resolved already, the bug can still occur on our side as we have multiple hive version supported. Considering bonecp has been removed from hive since 4.0.0 and HikariCP is not supported by all hive versions we support, we replace bonecp with `DBCP` to avoid memory leak ### Why are the changes needed? fix memory leak of HMS ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Run `org.apache.spark.sql.hive.execution.SQLQuerySuite` and pass without linkage errors ### Was this patch authored or co-authored using generative AI tooling? no Closes #46879 from yaooqinn/SPARK-48538. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 06 June 2024, 02:22:24 UTC
d5c33c6 [SPARK-48307][SQL][FOLLOWUP] Allow outer references in un-referenced CTE relations ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/46617 . Subquery expression has a bunch of correlation checks which need to match certain plan shapes. We broke this by leaving `WithCTE` in the plan for un-referenced CTE relations. This PR fixes the issue by skipping CTE plan nodes in correlated subquery expression checks. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no bug is not released yet. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46869 from cloud-fan/check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 June 2024, 21:38:44 UTC
490a4b3 [SPARK-48498][SQL] Always do char padding in predicates ### What changes were proposed in this pull request? For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true. This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46832 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 June 2024, 20:01:11 UTC
34ac7de [SPARK-48536][PYTHON][CONNECT] Cache user specified schema in applyInPandas and applyInArrow ### What changes were proposed in this pull request? Cache user specified schema in applyInPandas and applyInArrow ### Why are the changes needed? to avoid extra RPCs ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46877 from zhengruifeng/cache_schema_apply_in_x. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 05 June 2024, 12:42:00 UTC
88b8dc2 [SPARK-46937][SQL][FOLLOWUP] Properly check registered function replacement ### What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/44976 . `ConcurrentHashMap#put` has a different semantic than the scala map, and it returns null if the key is new. We should update the checking code accordingly. ### Why are the changes needed? avoid wrong warning messages ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual ### Was this patch authored or co-authored using generative AI tooling? no Closes #46876 from cloud-fan/log. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> 05 June 2024, 09:40:59 UTC
c4f720d [SPARK-48535][SS] Update config docs to indicate possibility of data loss/corruption issue if skip nulls for stream-stream joins config is enabled ### What changes were proposed in this pull request? Update config docs to indicate possibility of data loss/corruption issue if skip nulls for stream-stream joins config is enabled ### Why are the changes needed? Clarifying the implications of turning off this config after a certain Spark version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A - config doc only change ### Was this patch authored or co-authored using generative AI tooling? No Closes #46875 from anishshri-db/task/SPARK-48535. Authored-by: Anish Shrigondekar <anish.shrigondekar@databricks.com> Signed-off-by: Kent Yao <yao@apache.org> 05 June 2024, 08:34:45 UTC
db527ac Revert "[SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC`" This reverts commit abbe301d7645217f22641cf3a5c41502680e65be. 05 June 2024, 05:20:30 UTC
adbfd17 [SPARK-48533][CONNECT][PYTHON][TESTS] Add test for cached schema ### What changes were proposed in this pull request? Add test for cached schema, to make Spark Classic's mapInXXX also works within `SparkConnectSQLTestCase`, also add a new `contextmanager` for `os.environ` ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #46871 from zhengruifeng/test_cached_schema. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 June 2024, 05:11:13 UTC
4075ce6 [SPARK-48374][PYTHON][TESTS][FOLLOW-UP] Explicitly enable ANSI mode for non-ANSI build ### What changes were proposed in this pull request? This PR proposes to explicitly set ANSI mode in `test_toArrow_error` test. ### Why are the changes needed? To make non-ANSI build passing https://github.com/apache/spark/actions/runs/9342888897/job/25711689943: ``` ====================================================================== FAIL [0.180s]: test_toArrow_error (pyspark.sql.tests.connect.test_parity_arrow.ArrowParityTests.test_toArrow_error) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/sql/tests/test_arrow.py", line 1207, in test_toArrow_error with self.assertRaises(ArithmeticException): AssertionError: ArithmeticException not raised ---------------------------------------------------------------------- Ran 88 tests in 17.797s ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46872 from HyukjinKwon/SPARK-48374-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 June 2024, 03:12:17 UTC
a17ab57 [MINOR][DOCS] Fix a typo in core-migration-guide.md ### What changes were proposed in this pull request? Fix a typo in core-migration-guide.md: - agressively -> aggressively ### Why are the changes needed? Fix mistakes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46864 from wayneguow/typo. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 June 2024, 00:07:24 UTC
33aa467 [SPARK-48523][DOCS] Add `grpc_max_message_size ` description to `client-connection-string.md` ### What changes were proposed in this pull request? The pr aims to - add `grpc_max_message_size` description to `client-connection-string.md` - rename `hostname` to `host`. - fix some typo. ### Why are the changes needed? - In PR https://github.com/apache/spark/pull/45842, we extract a `constant` as a `parameter` for the connect client, and we need to update the related doc. - Make the parameter names in our doc consistent with those in the code, In the doc, it is called `hostname`, but in the code, it is called `host` https://github.com/apache/spark/blob/d273fdf37bc291aadf8677305bda2a91b593219f/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClientParser.scala#L36 ### Does this PR introduce _any_ user-facing change? Yes, only for doc `client-connection-string.md`. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46862 from panbingkun/SPARK-48523. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 June 2024, 23:55:05 UTC
e47ce47 [SPARK-48485][CONNECT][SS] Support interruptTag and interruptAll in streaming queries ### What changes were proposed in this pull request? This PR proposes to support interruptTag and interruptAll in streaming queries ### Why are the changes needed? In order to provide a way to interrupt streaming queries in batch. ### Does this PR introduce _any_ user-facing change? Yes, `spark.interruptTag` and `spark.interruptAll` cancel streaming queries. ### How was this patch tested? TBD ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46819 from HyukjinKwon/interrupt-all. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 June 2024, 23:53:16 UTC
8a0927c [SPARK-48307][SQL] InlineCTE should keep not-inlined relations in the original WithCTE node ### What changes were proposed in this pull request? I noticed an outdated comment in the rule `InlineCTE` ``` // CTEs in SQL Commands have been inlined by `CTESubstitution` already, so it is safe to add // WithCTE as top node here. ``` This is not true anymore after https://github.com/apache/spark/pull/42036 . It's not a big deal as we replace not-inlined CTE relations with `Repartition` during optimization, so it doesn't matter where we put the `WithCTE` node with not-inlined CTE relations, as it will disappear eventually. But it's still better to keep it at its original place, as third-party rules may be sensitive about the plan shape. ### Why are the changes needed? to keep the plan shape as much as can after inlining CTE relations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #46617 from cloud-fan/cte. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 June 2024, 22:04:22 UTC
c7caac9 [SPARK-47972][SQL][FOLLOWUP] Restrict CAST expression for collations ### What changes were proposed in this pull request? Removal of immutable Seq import. ### Why are the changes needed? This import was added with https://github.com/apache/spark/pull/46474, but in reality is changing behaviour of other AstBuilder.scala rules and because of this needs to be removed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests cover this. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46860 from mihailom-db/SPARK-47972-FOLLOWUP. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 June 2024, 18:05:58 UTC
651f687 [SPARK-48531][INFRA] Fix `Black` target version to Python 3.9 ### What changes were proposed in this pull request? This PR aims to fix `Black` target version to `Python 3.9`. ### Why are the changes needed? Since SPARK-47993 dropped Python 3.8 support officially at Apache Spark 4.0.0, we had better update target version to `Python 3.9`. - #46228 `py39` is the version for `Python 3.9`. ``` $ black --help | grep target -t, --target-version [py33|py34|py35|py36|py37|py38|py39|py310|py311|py312] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with Python linter. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46867 from dongjoon-hyun/SPARK-48531. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 June 2024, 17:28:50 UTC
8b88f5a [SPARK-48522][BUILD] Update Stream Library to 2.9.8 and attach its NOTICE ### What changes were proposed in this pull request? Update Stream Library to 2.9.8 and attach its NOTICE ### Why are the changes needed? update dep and notice file ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46861 from yaooqinn/SPARK-48522. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> 04 June 2024, 13:33:01 UTC
f4afa22 [SPARK-48506][CORE] Compression codec short names are case insensitive except for event logging ### What changes were proposed in this pull request? Compression codec short names, e.g. map statuses, broadcasts, shuffle, parquet/orc/avro outputs, are case insensitive except for event logging. Calling `org.apache.spark.io.CompressionCodec.getShortName` causes this issue. In this PR, we make `CompressionCodec.getShortName` handle case sensitivity correctly. ### Why are the changes needed? Feature parity ### Does this PR introduce _any_ user-facing change? Yes, spark.eventLog.compression.codec now accepts not only the lowercased form of lz4, lzf, snappy, and zstd, but also forms with any of the characters to be upcased。 ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46847 from yaooqinn/SPARK-48506. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com> 04 June 2024, 12:33:51 UTC
d273fdf [SPARK-48519][BUILD] Upgrade jetty to 11.0.21 ### What changes were proposed in this pull request? This pr aims to upgrade jetty from 11.0.20 to 11.0.21. ### Why are the changes needed? The new version bring some bug fix like [Reduce ByteBuffer churning in HttpOutput](https://github.com/jetty/jetty.project/commit/fe94c9f8a40df49021b28280f708448870c5b420). The full release notes as follows: - https://github.com/jetty/jetty.project/releases/tag/jetty-11.0.21 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46843 from LuciferYang/jetty-11.0.21. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 04 June 2024, 11:08:40 UTC
90ee299 [SPARK-48518][CORE] Make LZF compression be able to run in parallel ### What changes were proposed in this pull request? This PR introduced a config that turns on LZF compression to parallel mode via using PLZFOutputStream. FYI, https://github.com/ning/compress?tab=readme-ov-file#parallel-processing ### Why are the changes needed? Improve performance ``` [info] OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5 [info] Apple M2 Max [info] Compress large objects: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------------------------------------- [info] Compression 1024 array values in 7 threads 12 13 1 0.1 11788.2 1.0X [info] Compression 1024 array values single-threaded 23 23 0 0.0 22512.7 0.5X ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? benchmark ### Was this patch authored or co-authored using generative AI tooling? no Closes #46858 from yaooqinn/SPARK-48518. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 04 June 2024, 10:58:33 UTC
02c6456 [SPARK-48512][PYTHON][TESTS] Refactor Python tests ### What changes were proposed in this pull request? Use withSQLConf in tests when it is appropriate. ### Why are the changes needed? Enforce good practice for setting config in test cases. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing UT ### Was this patch authored or co-authored using generative AI tooling? NO Closes #46852 from amaliujia/refactor_pyspark. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 04 June 2024, 09:50:29 UTC
8cebb9b [SPARK-46937][SQL] Improve concurrency performance for FunctionRegistry ### What changes were proposed in this pull request? This PR propose to improve concurrency performance for `FunctionRegistry`. ### Why are the changes needed? Currently, `SimpleFunctionRegistryBase` adopted the `mutable.Map` caching function infos. The `SimpleFunctionRegistryBase` guarded by this so as ensure security under multithreading. Because all the mutable state are related to `functionBuilders`, we can delegate security to `ConcurrentHashMap`. `ConcurrentHashMap ` has higher concurrency activity and responsiveness. After this change, `FunctionRegistry` have better perf than before. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. The benchmark test. ``` object FunctionRegistryBenchmark extends BenchmarkBase { override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { runBenchmark("FunctionRegistry") { val iters = 1000000 val threadNum = 4 val functionRegistry = FunctionRegistry.builtin val names = FunctionRegistry.expressions.keys.toSeq val barrier = new CyclicBarrier(threadNum + 1) val threadPool = ThreadUtils.newDaemonFixedThreadPool(threadNum, "test-function-registry") val benchmark = new Benchmark("SimpleFunctionRegistry", iters, output = output) benchmark.addCase("only read") { _ => for (_ <- 1 to threadNum) { threadPool.execute(new Runnable { val random = new Random() override def run(): Unit = { barrier.await() for (_ <- 1 to iters) { val name = names(random.nextInt(names.size)) val fun = functionRegistry.lookupFunction(new FunctionIdentifier(name)) assert(fun.map(_.getName).get == name) functionRegistry.listFunction() } barrier.await() } }) } barrier.await() barrier.await() } benchmark.run() } } } ``` The benchmark output before this PR. ``` Java HotSpot(TM) 64-Bit Server VM 17.0.9+11-LTS-201 on Mac OS X 10.14.6 Intel(R) Core(TM) i5-5350U CPU 1.80GHz SimpleFunctionRegistry: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ only read 54858 55043 261 0.0 54858.1 1.0X ``` The benchmark output after this PR. ``` Java HotSpot(TM) 64-Bit Server VM 17.0.9+11-LTS-201 on Mac OS X 10.14.6 Intel(R) Core(TM) i5-5350U CPU 1.80GHz SimpleFunctionRegistry: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ only read 20202 20264 88 0.0 20202.1 1.0X ``` ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44976 from beliefer/SPARK-46937. Authored-by: beliefer <beliefer@163.com> Signed-off-by: beliefer <beliefer@163.com> 04 June 2024, 08:23:49 UTC
abbe301 [SPARK-48505][CORE] Simplify the implementation of `Utils#isG1GC` ### What changes were proposed in this pull request? This PR changes to use the result of `ManagementFactory.getGarbageCollectorMXBeans` to determine whether G1GC is used. When G1GC is used, `ManagementFactory.getGarbageCollectorMXBeans` will return two instances of `GarbageCollectorExtImpl`, their names are `G1 Young Generation` and `G1 Old Generation` respectively. ### Why are the changes needed? Simplify the implementation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46783 from LuciferYang/refactor-isG1GC. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Kent Yao <yao@apache.org> 04 June 2024, 07:41:41 UTC
c852c4f [SPARK-48318][SQL] Enable hash join support for all collations (complex types) ### What changes were proposed in this pull request? Enable collation support for hash join on complex types. - Logical plan is rewritten in analysis to (recursively) replace all non-binary strings with CollationKey - CollationKey is a unary expression that transforms StringType to BinaryType - Collation keys allow correct & efficient string comparison under specific collation rules ### Why are the changes needed? Improve JOIN performance for complex types containing collated strings. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Unit tests for `CollationKey` in `CollationExpressionSuite` - E2e SQL tests for `RewriteCollationJoin` in `CollationSuite` - Various queries with JOIN in existing TPCDS collation test suite ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46722 from uros-db/hash-join-cmx. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 June 2024, 07:10:50 UTC
6475ddf [SPARK-48514][BUILD][K8S] Upgrade `kubernetes-client` to 6.13.0 ### What changes were proposed in this pull request? Upgrade kubernetes-client from 6.12.1 to 6.13.0 ### Why are the changes needed? Upgrade Fabric8 Kubernetes Model to Kubernetes v1.30.0 [Release log 6.13.0](https://github.com/fabric8io/kubernetes-client/releases/tag/v6.13.0) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46854 from bjornjorgensen/kubclient6.13.0. Authored-by: Bjørn Jørgensen <bjornjorgensen@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> 04 June 2024, 06:51:15 UTC
560c083 [SPARK-48482][PYTHON] dropDuplicates and dropDuplicatesWIthinWatermark should accept variable length args ### What changes were proposed in this pull request? In scala, `dropDuplicates` and `dropDuplicatesWIthinWatermark` accepts varargs, i.e. `df.dropDuplicates("id", "value")`. However this is not supported in Python, users have to wrap them with list. This PR fixes it. ### Why are the changes needed? Better API, integrated scala and python experience ### Does this PR introduce _any_ user-facing change? Yes, now users can use var args as parameters to `dropDuplicates` and `dropDuplicatesWithinWatermark` ### How was this patch tested? Added & modified existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46817 from WweiL/dropDuplicates-accept-vararg. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 June 2024, 00:24:17 UTC
6272c05 [SPARK-48508][CONNECT][PYTHON] Cache user specified schema in `DataFrame.{to, mapInPandas, mapInArrow}` ### What changes were proposed in this pull request? Cache user specified schema in `DataFrame.{to, mapInPandas, mapInArrow}` ### Why are the changes needed? to avoid extra RPC to get the schema ### Does this PR introduce _any_ user-facing change? no, it should only be an optimization ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes #46848 from zhengruifeng/py_user_define_schema. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 June 2024, 23:44:45 UTC
cfb79d9 [SPARK-47933][PYTHON][FOLLOWUP] Correct the error message ### What changes were proposed in this pull request? Correct the error message Closes #46850 from zhengruifeng/nit_dispatch_col_method. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 June 2024, 23:43:58 UTC
back to top