https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
1eb558c Preparing Spark release v3.4.3-rc2 15 April 2024, 00:21:11 UTC
df3e8e4 Preparing development version 3.4.4-SNAPSHOT 14 April 2024, 23:37:08 UTC
025af02 Preparing Spark release v3.4.3-rc1 14 April 2024, 23:37:04 UTC
572e97a [SPARK-47844][BUILD][3.4] Upgrade ORC to 1.8.7 ### What changes were proposed in this pull request? This PR aims to upgrade ORC to 1.8.7. ### Why are the changes needed? To bring the latest bug fixes. - https://orc.apache.org/news/2024/04/14/ORC-1.8.7/ - https://github.com/apache/orc/releases/tag/v1.8.7 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46042 from dongjoon-hyun/orc187. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 14 April 2024, 20:48:23 UTC
6736024 [SPARK-47318][CORE][3.4] Adds HKDF round to AuthEngine key derivation to follow standard KEX practices ### What changes were proposed in this pull request? Backport of SPARK-47318 to v3.4.0 This change adds an additional pass through a key derivation function (KDF) to the key exchange protocol in `AuthEngine`. Currently, it uses the shared secret from a bespoke key negotiation protocol directly. This is an encoded X coordinate on the X25519 curve. It is atypical and not recommended to use that coordinate directly as a key, but rather to pass it to an KDF. Note, Spark now supports TLS for RPC calls. It is preferable to use that rather than the bespoke AES RPC encryption implemented by `AuthEngine` and `TransportCipher`. ### Why are the changes needed? This follows best practices of key negotiation protocols. The encoded X coordinate is not guaranteed to be uniformly distributed over the 32-byte key space. Rather, we pass it through a HKDF function to map it uniformly to a 16-byte key space. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Exiting tests under: `build/sbt "network-common/test:testOnly"` Specifically: `build/sbt "network-common/test:testOnly org.apache.spark.network.crypto.AuthEngineSuite"` `build/sbt "network-common/test:testOnly org.apache.spark.network.crypto.AuthIntegrationSuite"` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46015 from sweisdb/SPARK-47318-v3.4.0. Lead-authored-by: sweisdb <60895808+sweisdb@users.noreply.github.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Steve Weis <steve.weis@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 13 April 2024, 23:55:15 UTC
d0fd730 [SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof ### What changes were proposed in this pull request? Use the monotonically ID as a sorting condition for `max_by` instead of a literal string. ### Why are the changes needed? https://github.com/apache/spark/pull/35191 had a error where the literal string `"__monotonically_increasing_id__"` was used as the tie-breaker in `max_by` instead of the actual ID. ### Does this PR introduce _any_ user-facing change? Fixes nondeterminism in `asof` ### How was this patch tested? In some circumstances `//python:pyspark.pandas.tests.connect.series.test_parity_as_of` is sufficient to reproduce ### Was this patch authored or co-authored using generative AI tooling? No Closes #46018 from markj-db/SPARK-47824. Authored-by: Mark Jarvin <mark.jarvin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a0ccdf27e5ff30817b8f058f08f98d5b44bad2db) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 April 2024, 00:37:56 UTC
b1ea200 [MINOR][DOCS] Make the link of spark properties with YARN more accurate ### What changes were proposed in this pull request? This PR propose to make the link of spark properties with YARN more accurate. ### Why are the changes needed? Currently, the link of `YARN Spark Properties` is just the page of `running-on-yarn.html`. We should add the anchor point. ### Does this PR introduce _any_ user-facing change? 'Yes'. More convenient for readers to read. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #45994 from beliefer/accurate-yarn-link. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit aca3d1025e2d85c02737456bfb01163c87ca3394) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 April 2024, 03:34:02 UTC
e94bb50 [MINOR][DOCS] Clarify relation between grouping API and `spark.sql.execution.arrow.maxRecordsPerBatch` ### What changes were proposed in this pull request? This PR fixes the documentation of `spark.sql.execution.arrow.maxRecordsPerBatch` to clarify the relation between `spark.sql.execution.arrow.maxRecordsPerBatch` and grouping API such as `DataFrame(.cogroup).groupby.applyInPandas`. ### Why are the changes needed? To address confusion about them. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing SQL configuration page https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration ### How was this patch tested? CI in this PR should verify them. I ran linters. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45993 from HyukjinKwon/minor-doc-change. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6c8e4cfd6f3f95455b0d4479f2527d425349f1cf) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 April 2024, 02:26:16 UTC
05f72fe [SPARK-47774][INFRA][3.4] Remove redundant rules from `MimaExcludes` ### What changes were proposed in this pull request? This PR aims to remove redundant rules from `MimaExcludes` for Apache Spark 3.4.x. Previously, these rules were required due to the `dev/mima` limitation which is fixed at - https://github.com/apache/spark/pull/45938 ### Why are the changes needed? To minimize the exclusion rules for Apache Spark 3.4.x by removing the rules related to the following `private class`. - `DeployMessages` https://github.com/apache/spark/blob/d3c75540788cf4ce86558feb38c197fdc1c8300e/core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala#L34 - `ShuffleBlockFetcherIterator` https://github.com/apache/spark/blob/d3c75540788cf4ce86558feb38c197fdc1c8300e/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L85-L86 - `BlockManagerMessages` https://github.com/apache/spark/blob/d3c75540788cf4ce86558feb38c197fdc1c8300e/core/src/main/scala/org/apache/spark/storage/BlockManagerMessages.scala#L25 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45949 from dongjoon-hyun/SPARK-47774-3.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 April 2024, 07:52:58 UTC
d3c7554 [SPARK-47770][INFRA] Fix `GenerateMIMAIgnore.isPackagePrivateModule` to return `false` instead of failing ### What changes were proposed in this pull request? This PR aims to fix `GenerateMIMAIgnore.isPackagePrivateModule` to work correctly. For example, `Metadata` is a case class inside package private `DefaultParamsReader` class. Currently, MIMA fails at this class analysis. https://github.com/apache/spark/blob/f8e652e88320528a70e605a6a3cf986725e153a5/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L474-L485 The root cause is `isPackagePrivateModule` fails due to `scala.ScalaReflectionException`. We can simply make `isPackagePrivateModule` return `false` instead of failing. ``` Error instrumenting class:org.apache.spark.ml.util.DefaultParamsReader$Metadata Exception in thread "main" scala.ScalaReflectionException: type Serializable is not a class at scala.reflect.api.Symbols$SymbolApi.asClass(Symbols.scala:284) at scala.reflect.api.Symbols$SymbolApi.asClass$(Symbols.scala:284) at scala.reflect.internal.Symbols$SymbolContextApiImpl.asClass(Symbols.scala:99) at scala.reflect.runtime.JavaMirrors$JavaMirror.classToScala1(JavaMirrors.scala:1085) at scala.reflect.runtime.JavaMirrors$JavaMirror.$anonfun$classToScala$1(JavaMirrors.scala:1040) at scala.reflect.runtime.JavaMirrors$JavaMirror.$anonfun$toScala$1(JavaMirrors.scala:150) at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toScala(TwoWayCaches.scala:50) at scala.reflect.runtime.JavaMirrors$JavaMirror.toScala(JavaMirrors.scala:148) at scala.reflect.runtime.JavaMirrors$JavaMirror.classToScala(JavaMirrors.scala:1040) at scala.reflect.runtime.JavaMirrors$JavaMirror.typeToScala(JavaMirrors.scala:1148) at scala.reflect.runtime.JavaMirrors$JavaMirror$FromJavaClassCompleter.$anonfun$completeRest$2(JavaMirrors.scala:816) at scala.reflect.runtime.JavaMirrors$JavaMirror$FromJavaClassCompleter.$anonfun$completeRest$1(JavaMirrors.scala:816) at scala.reflect.runtime.JavaMirrors$JavaMirror$FromJavaClassCompleter.completeRest(JavaMirrors.scala:810) at scala.reflect.runtime.JavaMirrors$JavaMirror$FromJavaClassCompleter.complete(JavaMirrors.scala:806) at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1575) at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1538) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$13.scala$reflect$runtime$SynchronizedSymbols$SynchronizedSymbol$$super$info(SynchronizedSymbols.scala:221) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol.info(SynchronizedSymbols.scala:158) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol.info$(SynchronizedSymbols.scala:158) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$13.info(SynchronizedSymbols.scala:221) at scala.reflect.internal.Symbols$Symbol.initialize(Symbols.scala:1733) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol.privateWithin(SynchronizedSymbols.scala:109) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol.privateWithin$(SynchronizedSymbols.scala:107) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$13.privateWithin(SynchronizedSymbols.scala:221) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$13.privateWithin(SynchronizedSymbols.scala:221) at org.apache.spark.tools.GenerateMIMAIgnore$.isPackagePrivateModule(GenerateMIMAIgnore.scala:48) at org.apache.spark.tools.GenerateMIMAIgnore$.$anonfun$privateWithin$1(GenerateMIMAIgnore.scala:67) at scala.collection.immutable.List.foreach(List.scala:334) at org.apache.spark.tools.GenerateMIMAIgnore$.privateWithin(GenerateMIMAIgnore.scala:61) at org.apache.spark.tools.GenerateMIMAIgnore$.main(GenerateMIMAIgnore.scala:125) at org.apache.spark.tools.GenerateMIMAIgnore.main(GenerateMIMAIgnore.scala) ``` ### Why are the changes needed? **BEFORE** ``` $ dev/mima | grep org.apache.spark.ml.util.DefaultParamsReader Using SPARK_LOCAL_IP=localhost Using SPARK_LOCAL_IP=localhost Error instrumenting class:org.apache.spark.ml.util.DefaultParamsReader$Metadata$ Error instrumenting class:org.apache.spark.ml.util.DefaultParamsReader$Metadata Using SPARK_LOCAL_IP=localhost # I checked the following before deleing `.generated-mima-class-excludes ` $ cat .generated-mima-class-excludes | grep org.apache.spark.ml.util.DefaultParamsReader org.apache.spark.ml.util.DefaultParamsReader$ org.apache.spark.ml.util.DefaultParamsReader# org.apache.spark.ml.util.DefaultParamsReader ``` **AFTER** ``` $ dev/mima | grep org.apache.spark.ml.util.DefaultParamsReader Using SPARK_LOCAL_IP=localhost Using SPARK_LOCAL_IP=localhost [WARN] Unable to detect inner functions for class:org.apache.spark.ml.util.DefaultParamsReader.Metadata [WARN] Unable to detect inner functions for class:org.apache.spark.ml.util.DefaultParamsReader.Metadata Using SPARK_LOCAL_IP=localhost # I checked the following before deleting `.generated-mima-class-excludes `. $ cat .generated-mima-class-excludes | grep org.apache.spark.ml.util.DefaultParamsReader org.apache.spark.ml.util.DefaultParamsReader$Metadata$ org.apache.spark.ml.util.DefaultParamsReader$ org.apache.spark.ml.util.DefaultParamsReader#Metadata# org.apache.spark.ml.util.DefaultParamsReader# org.apache.spark.ml.util.DefaultParamsReader$Metadata org.apache.spark.ml.util.DefaultParamsReader#Metadata org.apache.spark.ml.util.DefaultParamsReader ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45938 from dongjoon-hyun/SPARK-47770. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 08c49637795fd56ef550a509648f0890ff22a948) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f0752f2701b1b8d5fbc38912edd9cd9325693bef) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 April 2024, 04:49:13 UTC
1f66a40 [SPARK-47734][PYTHON][TESTS][3.4] Fix flaky DataFrame.writeStream doctest by stopping streaming query ### What changes were proposed in this pull request? Backport of https://github.com/apache/spark/pull/45885. This PR deflakes the `pyspark.sql.dataframe.DataFrame.writeStream` doctest. PR https://github.com/apache/spark/pull/45298 aimed to fix that test but misdiagnosed the root issue. The problem is not that concurrent tests were colliding on a temporary directory. Rather, the issue is specific to the `DataFrame.writeStream` test's logic: that test is starting a streaming query that writes files to the temporary directory, the exits the temp directory context manager without first stopping the streaming query. That creates a race condition where the context manager might be deleting the directory while the streaming query is writing new files into it, leading to the following type of error during cleanup: ``` File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line ?, in pyspark.sql.dataframe.DataFrame.writeStream Failed example: with tempfile.TemporaryDirectory() as d: # Create a table with Rate source. df.writeStream.toTable( "my_table", checkpointLocation=d) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.11/doctest.py", line 1353, in __run exec(compile(example.source, filename, "single", File "<doctest pyspark.sql.dataframe.DataFrame.writeStream[3]>", line 1, in <module> with tempfile.TemporaryDirectory() as d: File "/usr/lib/python3.11/tempfile.py", line 1043, in __exit__ self.cleanup() File "/usr/lib/python3.11/tempfile.py", line 1047, in cleanup self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors) File "/usr/lib/python3.11/tempfile.py", line 1029, in _rmtree _rmtree(name, onerror=onerror) File "/usr/lib/python3.11/shutil.py", line 738, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/usr/lib/python3.11/shutil.py", line 736, in rmtree os.rmdir(path, dir_fd=dir_fd) OSError: [Errno 39] Directory not empty: '/__w/spark/spark/python/target/4f062b09-213f-4ac2-a10a-2d704990141b/tmp29irqweq' ``` In this PR, I update the doctest to properly stop the streaming query. ### Why are the changes needed? Fix flaky test. ### Does this PR introduce _any_ user-facing change? No, test-only. Small user-facing doc change, but one that is consistent with other doctest examples. ### How was this patch tested? Manually ran updated test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45908 from JoshRosen/fix-flaky-writestream-doctest-3.4. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 07 April 2024, 22:05:45 UTC
6ab31d4 [SPARK-45445][BUILD][3.4] Upgrade snappy to 1.1.10.5 ### What changes were proposed in this pull request? This is a backport of #43254. The pr aims to upgrade snappy to 1.1.10.5. ### Why are the changes needed? - Although the `1.1.10.4` version was upgraded approximately 2-3 weeks ago, the new version includes some bug fixes, eg: <img width="868" alt="image" src="https://github.com/apache/spark/assets/15246973/6c7f05f7-382f-4e82-bb68-22fc50895b94"> - Full release notes: https://github.com/xerial/snappy-java/releases ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45902 from dongjoon-hyun/SPARK-45445-3.4. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 22:38:25 UTC
2a453b1 [SPARK-47111][SQL][TESTS][3.4] Upgrade `PostgreSQL` JDBC driver to 42.7.2 and docker image to 16.2 ### What changes were proposed in this pull request? This is a backport of #45191 . This PR aims to upgrade `PostgreSQL` JDBC driver and docker images. - JDBC Driver: `org.postgresql:postgresql` to 42.7.2 - Docker Image: `postgres` from `15.1-alpine` to `16.2-alpine` ### Why are the changes needed? To use the latest PostgreSQL combination in the following integration tests. - PostgresIntegrationSuite - PostgresKrbIntegrationSuite - v2/PostgresIntegrationSuite - v2/PostgresNamespaceSuite ### Does this PR introduce _any_ user-facing change? No. This is a pure test-environment update. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45900 from dongjoon-hyun/SPARK-47111-3.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 19:57:33 UTC
5f8a00b [SPARK-46411][BUILD][3.4] Change to use `bcprov/bcpkix-jdk18on` for UT ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/44359 . This PR migrates the test dependency `bcprov/bcpkix` from `jdk15on` to `jdk18on`, and upgrades the version from 1.70 to 1.77, the `jdk18on` jars are compiled to work with anything from Java 1.8 up. ### Why are the changes needed? The full release notes as follows: - https://www.bouncycastle.org/releasenotes.html#r1rv77 ### Does this PR introduce _any_ user-facing change? No, just for test. ### How was this patch tested? Pass GitHub Actions. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45898 from dongjoon-hyun/SPARK-46411-3.4. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 April 2024, 18:55:07 UTC
d9a6e5d [SPARK-44441][BUILD] Upgrade `bcprov-jdk15on` and `bcpkix-jdk15on` to 1.70 This pr aims to upgrade `bcprov-jdk15on` and `bcpkix-jdk15on` from 1.60 to 1.70 The new version fixed [CVE-2020-15522](https://github.com/bcgit/bc-java/wiki/CVE-2020-15522). The release notes as follows: - https://www.bouncycastle.org/releasenotes.html#r1rv70 No, just upgrade test dependency Pass Git Hub Actions Closes #42015 from LuciferYang/SPARK-44441. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> 05 April 2024, 15:32:04 UTC
e5293c7 [SPARK-44393][BUILD] Upgrade `H2` from 2.1.214 to 2.2.220 ### What changes were proposed in this pull request? Upgrade H2 from 2.1.214 to 2.2.220 [Changelog](https://www.h2database.com/html/changelog.html) ### Why are the changes needed? [CVE-2022-45868](https://nvd.nist.gov/vuln/detail/CVE-2022-45868) The following change in the release note fixes the CVE. [581ed18](https://github.com/h2database/h2database/commit/581ed18ff9d6b3761d851620ed88a3994a351a0d) Merge pull request [#3833](https://redirect.github.com/h2database/h2database/issues/3833) from katzyn/password ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA Closes #41963 from bjornjorgensen/h2-2.2.220. Authored-by: Bjørn Jørgensen <bjornjorgensen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 April 2024, 15:28:50 UTC
9f8eb54 [SPARK-47666][SQL][3.4] Fix NPE when reading mysql bit array as LongType ### What changes were proposed in this pull request? This PR fixes NPE when reading mysql bit array as LongType ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45793 from yaooqinn/PR_TOOL_PICK_PR_45790_BRANCH-3.4. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> 02 April 2024, 12:48:15 UTC
b0afd04 [SPARK-47676][BUILD] Clean up the removed `VersionsSuite` references ### What changes were proposed in this pull request? This PR aims to clean up the removed `VersionsSuite` reference. ### Why are the changes needed? At Apache Spark 3.3.0, `VersionsSuite` is removed via SPARK-38036 . - https://github.com/apache/spark/pull/35335 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45800 from dongjoon-hyun/SPARK-47676. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 128f74b055d3f290003f42259ffa23861eaa69e1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 April 2024, 23:50:02 UTC
a154415 [SPARK-47646][SQL][FOLLOWUP][3.4] Replace non-existing try_to_number function with TryToNumber ### What changes were proposed in this pull request? This patch fixes broken CI by replacing non-existing `try_to_number` function in branch-3.4. ### Why are the changes needed? #45771 backported a test to `StringFunctionsSuite` in branch-3.4 but it uses `try_to_number` which is added since Spark 3.5. So this patch fixes the broken CI: https://github.com/apache/spark/actions/runs/8494692184/job/23270175100 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #45785 from viirya/fix. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 March 2024, 22:37:01 UTC
f16dd05 [SPARK-47646][SQL] Make try_to_number return NULL for malformed input This PR proposes to add NULL check after parsing the number so the output can be safely null for `try_to_number` expression. ```scala import org.apache.spark.sql.functions._ val df = spark.createDataset(spark.sparkContext.parallelize(Seq("11"))) df.select(try_to_number($"value", lit("$99.99"))).show() ``` ``` java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.types.Decimal.toPlainString()" because "<local7>" is null at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:894) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:894) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) ``` To fix the bug, and let `try_to_number` return `NULL` for malformed input as designed. Yes, it fixes a bug. Previously, `try_to_number` failed with NPE. Unittest was added. No. Closes #45771 from HyukjinKwon/SPARK-47646. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 31 March 2024, 01:10:49 UTC
5e7600e [SPARK-47503][SQL][3.4] Make makeDotNode escape graph node name always ### What changes were proposed in this pull request? This is a backport of #45640 To prevent corruption of dot file a node name should be escaped even if there is no metrics to display ### Why are the changes needed? This pr fixes a bug in spark history server which fails to display query for cached JDBC relation named in quotes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45695 from alex35736/branch-3.4. Authored-by: Alexey <alexey13> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 March 2024, 15:54:28 UTC
77fd58b [SPARK-47537][SQL][3.4] Fix error data type mapping on MySQL Connector/J ### What changes were proposed in this pull request? This PR fixes: - BIT(n>1) is wrongly mapping to boolean instead of long for MySQL Connector/J. This is because we only have a case branch for Maria Connector/J. - MySQL Docker Integration Tests were using Maria Connector/J, not MySQL Connector/J ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #45691 from yaooqinn/SPARK-47537-BB. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 March 2024, 15:51:20 UTC
585845e [SPARK-47521][CORE] Use `Utils.tryWithResource` during reading shuffle data from external storage ### What changes were proposed in this pull request? In method FallbackStorage.open, file open is guarded by Utils.tryWithResource to avoid file handle leakage incase of failure during read. ### Why are the changes needed? To avoid file handle leakage in case of read failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #45663 from maheshk114/SPARK-47521. Authored-by: maheshbehera <maheshbehera@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 245669053a34cb1d4a84689230e5bd1d163be5c6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 March 2024, 17:45:12 UTC
47c698e [SPARK-47505][INFRA][3.4] Fix `Pyspark-errors` test jobs for branch-3.4 ### What changes were proposed in this pull request? The pr aims to fix `pyspark-errors` test jobs for branch-3.4. ### Why are the changes needed? Fix `pyspark-errors` test jobs for branch-3.4. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45624 from panbingkun/branch-3.4_fix_pyerrors. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 March 2024, 17:34:11 UTC
a791017 [MINOR][CORE] Fix a comment typo `slf4j-to-jul` to `jul-to-slf4j` This PR aims to fix a typo `slf4j-to-jul` to `jul-to-slf4j`. There exists only one. ``` $ git grep slf4j-to-jul common/utils/src/main/scala/org/apache/spark/internal/Logging.scala: // slf4j-to-jul bridge order to route their logs to JUL. ``` Apache Spark uses `jul-to-slf4j` which includes a `java.util.logging` (jul) handler, namely `SLF4JBridgeHandler`, which routes all incoming jul records to the SLF4j API. https://github.com/apache/spark/blob/bb3e27581887a094ead0d2f7b4a6b2a17ee84b6f/pom.xml#L735 This typo was there since Apache Spark 1.0.0. No, this is a comment fix. Manual review. No. Closes #45625 from dongjoon-hyun/jul-to-slf4j. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit bb0867f54d437f6467274e854506aea2900bceb1) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 March 2024, 05:02:36 UTC
622ab53 [SPARK-47494][DOC] Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 ### What changes were proposed in this pull request? Add migration doc for the behavior change of Parquet timestamp inference since Spark 3.3 ### Why are the changes needed? Show the behavior change to users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? Yes, there are some doc suggestion from copilot in docs/sql-migration-guide.md Closes #45623 from gengliangwang/SPARK-47494. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 11247d804cd370aaeb88736a706c587e7f5c83b3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 March 2024, 22:17:39 UTC
4de8000 [SPARK-47481][INFRA][3.4] Pin `matplotlib<3.3.0` to fix Python linter failure ### What changes were proposed in this pull request? The pr aims to fix `python linter issue` on branch-3.4 through pinning `matplotlib<3.3.0` ### Why are the changes needed? - Through this PR https://github.com/apache/spark/pull/45600, we found that the version of `matplotlib` in our Docker image was `3.8.2`, which clearly did not meet the original requirements for `branch-3.4`. https://github.com/panbingkun/spark/actions/runs/8354370179/job/22869580038 <img width="1072" alt="image" src="https://github.com/apache/spark/assets/15246973/dd425bfb-ce5f-4a99-a487-a462d6ebbbb9"> https://github.com/apache/spark/blob/branch-3.4/dev/requirements.txt#L12 <img width="973" alt="image" src="https://github.com/apache/spark/assets/15246973/70485648-b886-4218-bb21-c41a85d5eecf"> - Fix as follows: <img width="989" alt="image" src="https://github.com/apache/spark/assets/15246973/db31d8fb-0b6c-4925-95e1-0ca0247bb9f5"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45608 from panbingkun/branch_3.4_pin_matplotlib. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 March 2024, 14:15:32 UTC
142677b [SPARK-47455][BUILD] Fix resource leak during the initialization of `scalaStyleOnCompileConfig` in `SparkBuild.scala` ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/e01ed0da22f24204fe23143032ff39be7f4b56af/project/SparkBuild.scala#L157-L173 `Source.fromFile(in)` opens a `BufferedSource` resource handle, but it does not close it, this pr fix this issue. ### Why are the changes needed? Close resource after used. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45582 from LuciferYang/SPARK-47455. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 85bf7615f85eea3e9192a7684ef711cf44042e05) Signed-off-by: yangjie01 <yangjie01@baidu.com> 20 March 2024, 07:20:06 UTC
d25f49a [SPARK-47472][INFRA][3.4] Pin `numpy` to 1.23.5 in `dev/infra/Dockerfile` ### What changes were proposed in this pull request? This PR aims to pin `numpy` to 1.23.5 in `dev/infra/Dockerfile` to recover the following test failure. ### Why are the changes needed? `numpy==1.23.5` was the version of the last successful run. - https://github.com/apache/spark/actions/runs/8276453417/job/22725387782 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Closes #45595 from dongjoon-hyun/pin-numpy. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 20 March 2024, 03:53:45 UTC
7a899e2 [SPARK-47434][WEBUI] Fix `statistics` link in `StreamingQueryPage` ### What changes were proposed in this pull request? Like SPARK-24553, this PR aims to fix redirect issues (incorrect 302) when one is using proxy settings. Change the generated link to be consistent with other links and include a trailing slash ### Why are the changes needed? When using a proxy, an invalid redirect is issued if this is not included ### Does this PR introduce _any_ user-facing change? Only that people will be able to use these links if they are using a proxy ### How was this patch tested? With a proxy installed I went to the location this link would generate and could go to the page, when it redirects with the link as it exists. Edit: Further tested by building a version of our application with this patch applied, the links work now. ### Was this patch authored or co-authored using generative AI tooling? No. Page with working link <img width="913" alt="Screenshot 2024-03-18 at 4 45 27 PM" src="https://github.com/apache/spark/assets/5205457/dbcd1ffc-b7e6-4f84-8ca7-602c41202bf3"> Goes correctly to <img width="539" alt="Screenshot 2024-03-18 at 4 45 36 PM" src="https://github.com/apache/spark/assets/5205457/89111c82-b24a-4b33-895f-9c0131e8acb5"> Before it would redirect and we'd get a 404. <img width="639" alt="image" src="https://github.com/apache/spark/assets/5205457/1adfeba1-a1f6-4c35-9c39-e077c680baef"> Closes #45527 from HuwCampbell/patch-1. Authored-by: Huw Campbell <huw.campbell@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9b466d329c3c75e89b80109755a41c2d271b8acc) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 March 2024, 14:38:28 UTC
b4e2c67 [SPARK-47433][PYTHON][DOCS][INFRA][3.4] Update PySpark package dependency with version ranges ### What changes were proposed in this pull request? This PR aims to update `PySpark` package dependency with version ranges. ### Why are the changes needed? Like Apache Spark 3.5+, we had better clarify the range of supported versions. This is a logical backport of subset of #41997 and #45553 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45554 from dongjoon-hyun/SPARK-47433. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 March 2024, 23:45:42 UTC
be0e44e [SPARK-45141][PYTHON][INFRA][TESTS] Pin `pyarrow==12.0.1` in CI Pin `pyarrow==12.0.1` in CI to fix test failure, https://github.com/apache/spark/actions/runs/6167186123/job/16738683632 ``` ====================================================================== FAIL [0.095s]: test_from_to_pandas (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in _assert_pandas_equal assert_series_equal( File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 931, in assert_series_equal assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}") File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 415, in assert_attr_equal raise_assert_detail(obj, msg, left_attr, right_attr) File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 599, in raise_assert_detail raise AssertionError(msg) AssertionError: Attributes of Series are different Attribute "dtype" are different [left]: datetime64[ns] [right]: datetime64[us] ``` No CI and manually test No Closes #42897 from zhengruifeng/pin_pyarrow. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit e3d2dfa8b514f9358823c3cb1ad6523da8a6646b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8049a203b8c5f2f8045701916e66cfc786e16b57) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 17 March 2024, 21:32:34 UTC
3c41b1d [SPARK-47428][BUILD][3.4] Upgrade Jetty to 9.4.54.v20240208 ### What changes were proposed in this pull request? This PR aims to upgrade Jetty to 9.4.54.v20240208 for Apache Spark 3.4.3. ### Why are the changes needed? To bring the latest bug fixes. - https://github.com/jetty/jetty.project/releases/tag/jetty-9.4.54.v20240208 - https://github.com/jetty/jetty.project/releases/tag/jetty-9.4.53.v20231009 - https://github.com/jetty/jetty.project/releases/tag/jetty-9.4.52.v20230823 - https://github.com/jetty/jetty.project/releases/tag/jetty-9.4.51.v20230217 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45544 from dongjoon-hyun/SPARK-47428-3.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 March 2024, 05:42:17 UTC
210e80e [SPARK-45587][INFRA] Skip UNIDOC and MIMA in `build` GitHub Action job ### What changes were proposed in this pull request? This PR aims to skip `Unidoc` and `MIMA` phases in many general test pipelines. `mima` test is moved to `lint` job. ### Why are the changes needed? By having an independent document generation and mima checking GitHub Action job, we can skip them in the following many jobs. https://github.com/apache/spark/blob/73f9f5296e36541db78ab10c4c01a56fbc17cca8/.github/workflows/build_and_test.yml#L142-L190 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually check the GitHub action logs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43422 from dongjoon-hyun/SPARK-45587. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8c6eeb8ab0180368cc60de8b2dbae7457bee5794) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 16 March 2024, 05:38:52 UTC
0a7fe03 [SPARK-47375][DOC][FOLLOWUP] Fix a mistake in JDBC's preferTimestampNTZ option doc ### What changes were proposed in this pull request? Fix a mistake in JDBC's preferTimestampNTZ option doc ### Why are the changes needed? Fix a mistake in doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45510 from gengliangwang/reviseJdbcDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 63b79c1eac01fe7ec88e608008916258b088aeff) Signed-off-by: Kent Yao <yao@apache.org> 14 March 2024, 13:02:18 UTC
645a769 [SPARK-47385] Fix tuple encoders with Option inputs https://github.com/apache/spark/pull/40755 adds a null check on the input of the child deserializer in the tuple encoder. It breaks the deserializer for the `Option` type, because null should be deserialized into `None` rather than null. This PR adds a boolean parameter to `ExpressionEncoder.tuple` so that only the user that https://github.com/apache/spark/pull/40755 intended to fix has this null check. Unit test. Closes #45508 from chenhao-db/SPARK-47385. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9986462811f160eacd766da8a4e14a9cbb4b8710) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 March 2024, 06:29:09 UTC
922f5f6 [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc ### What changes were proposed in this pull request? Correct the preferTimestampNTZ option description in JDBC doc as per https://github.com/apache/spark/pull/45496 ### Why are the changes needed? The current doc is wrong about the jdbc option preferTimestampNTZ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45502 from gengliangwang/ntzJdbc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit abfbd2718159d62e3322cca8c2d4ef1c29781b21) Signed-off-by: Gengliang Wang <gengliang@apache.org> 14 March 2024, 04:00:58 UTC
60b4c0b [SPARK-47368][SQL]][3.5] Remove inferTimestampNTZ config check in ParquetRo… ### What changes were proposed in this pull request? The configuration `spark.sql.parquet.inferTimestampNTZ.enabled` is not related the parquet row converter. This PR is the remove the config check `spark.sql.parquet.inferTimestampNTZ.enabled` in the ParquetRowConverter ### Why are the changes needed? Bug fix. Otherwise reading TimestampNTZ columns may fail when `spark.sql.parquet.inferTimestampNTZ.enabled` is disabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #45492 from gengliangwang/PR_TOOL_PICK_PR_45480_BRANCH-3.5. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 3018a5d8cd96a569b3bfe7e11b4b26fb4fb54f32) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 March 2024, 05:42:59 UTC
982fbc5 [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files ### What changes were proposed in this pull request? Add migration doc: TimestampNTZ type inference on Parquet files ### Why are the changes needed? Update docs. The behavior change was not mentioned in the SQL migration guide ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45482 from gengliangwang/ntzMigrationDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 621f2c88f3e56257ee517d65e093d32fb44b783e) Signed-off-by: Gengliang Wang <gengliang@apache.org> 12 March 2024, 22:11:59 UTC
8b43164 [SPARK-47305][SQL][TESTS][FOLLOWUP][3.4] Fix the compilation error related to `PropagateEmptyRelationSuite` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/45406 has been backported to branch-3.4, where the newly added test case in `PropagateEmptyRelationSuite` uses `DataTypeUtils`, but `DataTypeUtils` is a utility class added in Apache Spark 3.5(SPARK-44475), so this triggered a compilation failure in branch-3.4: - https://github.com/apache/spark/actions/runs/8183755511/job/22377119069 ``` [error] /home/runner/work/spark/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala:229:27: not found: value DataTypeUtils [error] val schemaForStream = DataTypeUtils.fromAttributes(outputForStream) [error] ^ [error] /home/runner/work/spark/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala:233:26: not found: value DataTypeUtils [error] val schemaForBatch = DataTypeUtils.fromAttributes(outputForBatch) [error] ^ [info] done compiling [info] compiling 1 Scala source to /home/runner/work/spark/spark/connector/connect/common/target/scala-2.12/test-classes ... [info] compiling 25 Scala sources and 1 Java source to /home/runner/work/spark/spark/connector/connect/client/jvm/target/scala-2.12/classes ... [info] done compiling [error] two errors found ``` Therefore, this PR changes to use the `StructType.fromAttributes` function to fix the compilation failure." ### Why are the changes needed? Fix the compilation failure in branch-3.4 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45428 from LuciferYang/SPARK-47305-FOLLOWUP-34. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> 08 March 2024, 08:44:54 UTC
ddb112d [MINOR][DOCS][PYTHON] Fix documentation typo in takeSample method ### What changes were proposed in this pull request? Fixed an error in the docstring documentation for the parameter `withReplacement` of `takeSample` method in `pyspark.RDD`, should be of type `bool`, but is `list` instead. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.takeSample.html ### Why are the changes needed? They correct a mistake in the documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? \- ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45419 from kimborowicz/master. Authored-by: Michał Kimborowicz <michal.kimbor@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 7a429aa84a5ed2c4b6448d43e88c475919ea2210) Signed-off-by: Kent Yao <yao@apache.org> 07 March 2024, 11:35:01 UTC
7e5d592 [SPARK-47305][SQL] Fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming ### What changes were proposed in this pull request? This PR proposes to fix PruneFilters to tag the isStreaming flag of LocalRelation correctly when the plan has both batch and streaming. ### Why are the changes needed? When filter is evaluated to be always false, PruneFilters replaces the filter with empty LocalRelation, which effectively prunes filter. The logic cares about migration of the isStreaming flag, but incorrectly migrated in some case, via picking up the value of isStreaming flag from root node rather than filter (or child). isStreaming flag is true if the value of isStreaming flag from any of children is true. Flipping the coin, some children might have isStreaming flag as "false". If the filter being pruned is a descendant to such children (in other word, ancestor of streaming node), LocalRelation is incorrectly tagged as streaming where it should be batch. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT verifying the fix. The new UT fails without this PR and passes with this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45406 from HeartSaVioR/SPARK-47305. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 8d6bd9bbd29da6023e5740b622e12c7e1f8581ce) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 07 March 2024, 06:11:52 UTC
245a33c [SPARK-47146][CORE][FOLLOWUP] Rename incorrect logger name ### What changes were proposed in this pull request? Rename incorrect logger name in `UnsafeSorterSpillReader`. ### Why are the changes needed? The logger name in UnsafeSorterSpillReader is incorrect. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? No Closes #45404 from JacobZheng0927/loggerNameFix. Authored-by: JacobZheng0927 <zsh517559523@163.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5089140e2e6a43ffef584b42aed5cd9bc11268b6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 March 2024, 13:00:22 UTC
6ebfadf [SPARK-47146][CORE][3.5] Possible thread leak when doing sort merge join This pr backport https://github.com/apache/spark/pull/45327 to branch-3.5 ### What changes were proposed in this pull request? Add TaskCompletionListener to close inputStream to avoid thread leakage caused by unclosed ReadAheadInputStream. ### Why are the changes needed? To fix the issue SPARK-47146 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #45390 from JacobZheng0927/SPARK-47146-3.5. Authored-by: JacobZheng0927 <zsh517559523@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit e9f7d36797c4344295556463da16f891bb96d8ac) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 06 March 2024, 02:36:03 UTC
15e1502 [SPARK-47177][SQL][3.4] Cached SQL plan do not display final AQE plan in explain string This pr backport https://github.com/apache/spark/pull/45282 to branch-3.4 ### What changes were proposed in this pull request? This pr adds lock for ExplainUtils.processPlan to avoid tag race condition. ### Why are the changes needed? To fix the issue [SPARK-47177](https://issues.apache.org/jira/browse/SPARK-47177) ### Does this PR introduce _any_ user-facing change? yes, affect plan explain ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #45381 from ulysses-you/SPARK-47177-3.4. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> 05 March 2024, 08:50:00 UTC
58a4a49 [SPARK-47236][CORE] Fix `deleteRecursivelyUsingJavaIO` to skip non-existing file input ### What changes were proposed in this pull request? This PR aims to fix `deleteRecursivelyUsingJavaIO` to skip non-existing file input. ### Why are the changes needed? `deleteRecursivelyUsingJavaIO` is a fallback of `deleteRecursivelyUsingUnixNative`. We should have identical capability. Currently, it fails. ``` [info] java.nio.file.NoSuchFileException: /Users/dongjoon/APACHE/spark-merge/target/tmp/spark-e264d853-42c0-44a2-9a30-22049522b04f [info] at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) [info] at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) [info] at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) [info] at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) [info] at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:148) [info] at java.base/java.nio.file.Files.readAttributes(Files.java:1851) [info] at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:126) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is difficult to test this `private static` Java method. I tested this with #45344 . ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45346 from dongjoon-hyun/SPARK-47236. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1cd7bab5c5c2bd8d595b131c88e6576486dbf123) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 01 March 2024, 03:08:36 UTC
faba320 [SPARK-47187][SQL][3.4] Fix hive compress output config does not work ### What changes were proposed in this pull request? This pr fixs the issue that `setupHadoopConfForCompression` did not set isCompressed as expected due to we implicitly convert ShimFileSinkDesc to FileSinkDesc. This issue does not affect master branch since we removed ShimFileSinkDesc in https://github.com/apache/spark/pull/40848 ### Why are the changes needed? To make `hive.exec.compress.output` work as expected. ### Does this PR introduce _any_ user-facing change? yes, fix bug ### How was this patch tested? manually test ### Was this patch authored or co-authored using generative AI tooling? no Closes #45286 from ulysses-you/fix. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com> 28 February 2024, 05:35:38 UTC
8cbceab [SPARK-47063][SQL] CAST long to timestamp has different behavior for codegen vs interpreted ### What changes were proposed in this pull request? When an overflow occurs casting long to timestamp there are different behaviors between codegen and interpreted ``` scala> Seq(Long.MaxValue, Long.MinValue).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) +--------------------+-------------------+---------------+ |v |ts |unix_micros(ts)| +--------------------+-------------------+---------------+ |9223372036854775807 |1969-12-31 20:59:59|-1000000 | |-9223372036854775808|1969-12-31 21:00:00|0 | +--------------------+-------------------+---------------+ scala> spark.conf.set("spark.sql.codegen.wholeStage", false) scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") scala> Seq(Long.MaxValue, Long.MinValue).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) +--------------------+-----------------------------+--------------------+ |v |ts |unix_micros(ts) | +--------------------+-----------------------------+--------------------+ |9223372036854775807 |+294247-01-10 01:00:54.775807|9223372036854775807 | |-9223372036854775808|-290308-12-21 15:16:20.224192|-9223372036854775808| +--------------------+-----------------------------+--------------------+ ``` To align the behavior this PR change the codegen function the be the same as interpreted (https://github.com/apache/spark/blob/f0090c95ad4eca18040104848117a7da648ffa3c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687) ### Why are the changes needed? This is necesary to be consistent in all cases ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With unit test and manually ### Was this patch authored or co-authored using generative AI tooling? No Closes #45294 from planga82/bugfix/spark47063_cast_codegen. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit f18d945af7b69fbc89b38b9ca3ca79263b0881ed) Signed-off-by: Kent Yao <yao@apache.org> 28 February 2024, 03:44:41 UTC
5ce628f [SPARK-47196][CORE][BUILD][3.4] Fix `core` module to succeed SBT tests ### What changes were proposed in this pull request? This PR aims to fix `core` module to succeed SBT tests by preserving `mockito-core`'s `byte-buddy` test dependency. Currently, `Maven` respects `mockito-core`'s byte-buddy dependency while SBT doesn't. **MAVEN** ``` $ build/mvn dependency:tree -pl core | grep byte-buddy ... [INFO] | +- net.bytebuddy:byte-buddy:jar:1.12.10:test [INFO] | +- net.bytebuddy:byte-buddy-agent:jar:1.12.10:test ``` **SBT** ``` $ build/sbt "core/test:dependencyTree" | grep byte-buddy ... [info] | | | | +-net.bytebuddy:byte-buddy:1.12.10 (evicted by: 1.12.18) [info] | | | | +-net.bytebuddy:byte-buddy:1.12.18 ... ``` Note that this happens at `branch-3.4` from Apache Spark 3.4.0~3.4.2 only. branch-3.3/branch-3.5/master are okay. ### Why are the changes needed? **BEFORE** ``` $ build/sbt "core/testOnly *.DAGSchedulerSuite" [info] DAGSchedulerSuite: [info] - [SPARK-3353] parent stage should have lower stage id *** FAILED *** (439 milliseconds) [info] java.lang.IllegalStateException: Could not initialize plugin: interface org.mockito.plugins.MockMaker (alternate: null) ... [info] *** 1 SUITE ABORTED *** [info] *** 118 TESTS FAILED *** [error] Error during tests: [error] org.apache.spark.scheduler.DAGSchedulerSuite [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 48 s, completed Feb 27, 2024, 1:26:27 PM ``` **AFTER** ``` $ build/sbt "core/testOnly *.DAGSchedulerSuite" ... [info] All tests passed. [success] Total time: 22 s, completed Feb 27, 2024, 1:24:34 PM ``` ### Does this PR introduce _any_ user-facing change? No, this is a test-only fix. ### How was this patch tested? Pass the CIs and manual tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45295 from dongjoon-hyun/SPARK-47196. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 February 2024, 02:22:08 UTC
3192c8c [SPARK-47125][SQL] Return null if Univocity never triggers parsing This PR proposes to prevent `null` for `tokenizer.getContext`. This is similar with https://github.com/apache/spark/pull/28029. `getContext` seemingly via the univocity library, it can return null if `begingParsing` is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when `parseLine` is not invoked at https://github.com/apache/spark/blob/e081f06ea401a2b6b8c214a36126583d35eaf55f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L300 - `parseLine` invokes `begingParsing`. To fix up a bug. Yes. In a very rare case, when `CsvToStructs` is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it. Manually tested, but test case will be done in a separate PR. We should backport this to all branches. No. Closes #45210 from HyukjinKwon/SPARK-47125. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a87015efb5cf36103bc4eb82ae8613874e2eb408) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 February 2024, 03:14:55 UTC
ef02dbd [SPARK-47085][SQL][3.4] reduce the complexity of toTRowSet from n^2 to n ### What changes were proposed in this pull request? reduce the complexity of RowSetUtils.toTRowSet from n^2 to n ### Why are the changes needed? This causes performance issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests + test manually on AWS EMR ### Was this patch authored or co-authored using generative AI tooling? No Closes #45164 from igreenfield/branch-3.4. Authored-by: Izek Greenfield <izek.greenfield@adenza.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 February 2024, 16:12:35 UTC
081c7a7 [SPARK-47072][SQL][3.4] Fix supported interval formats in error messages ### What changes were proposed in this pull request? In the PR, I propose to add one more field to keys of `supportedFormat` in `IntervalUtils` because current implementation has duplicate keys that overwrites each other. For instance, the following keys are the same: ``` (YM.YEAR, YM.MONTH) ... (DT.DAY, DT.HOUR) ``` because `YM.YEAR = DT.DAY = 0` and `YM.MONTH = DT.HOUR = 1`. This is a backport of https://github.com/apache/spark/pull/45127. ### Why are the changes needed? To fix the incorrect error message when Spark cannot parse ANSI interval string. For example, the expected format should be some year-month format but Spark outputs day-time one: ```sql spark-sql (default)> select interval '-\t2-2\t' year to month; Interval string does not match year-month format of `[+|-]d h`, `INTERVAL [+|-]'[+|-]d h' DAY TO HOUR` when cast to interval year to month: - 2-2 . (line 1, pos 16) == SQL == select interval '-\t2-2\t' year to month ----------------^^^ ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the existing test suite: ``` $ build/sbt "test:testOnly *IntervalUtilsSuite" ``` and regenerating the golden files: ``` $ SPARK_GENERATE_GOLDEN_FILES=1 PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Authored-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 074fcf2807000d342831379de0fafc1e49a6bf19) Closes #45140 from MaxGekk/fix-supportedFormat-3.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 19 February 2024, 07:26:47 UTC
b4e28df [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch This PR fixes the regression introduced by https://github.com/apache/spark/pull/36683. ```python import pandas as pd spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0) spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1) spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas() ``` **Before** ``` /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false. range() arg 3 must not be zero warn(msg) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame return super(SparkSession, self).createDataFrame( # type: ignore[call-overload] File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step)) ValueError: range() arg 3 must not be zero ``` ``` Empty DataFrame Columns: [a] Index: [] ``` **After** ``` a 0 123 ``` ``` a 0 123 ``` It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5. Yes, it fixes a regression as described above. Unittest was added. No. Closes #45132 from HyukjinKwon/SPARK-47068. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3bb762dc032866cfb304019cba6db01125556c2f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 February 2024, 03:43:07 UTC
d25ef73 [SPARK-46400][CORE][SQL][3.4] When there are corrupted files in the local maven repo, skip this cache and try again ### What changes were proposed in this pull request? The pr aims to - fix potential bug(ie: https://github.com/apache/spark/pull/44208) and enhance user experience. - make the code more compliant with standards Backport above to branch 3.4. Master branch pr: https://github.com/apache/spark/pull/44343 ### Why are the changes needed? We use the local maven repo as the first-level cache in ivy. The original intention was to reduce the time required to parse and obtain the ar, but when there are corrupted files in the local maven repo,The above mechanism will be directly interrupted and the prompt is very unfriendly, which will greatly confuse the user. Based on the original intention, we should skip the cache directly in similar situations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45018 from panbingkun/branch-3.4_SPARK-46400. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 15 February 2024, 17:52:53 UTC
29adf32 [SPARK-47021][BUILD][TESTS] Fix `kvstore` module to have explicit `commons-lang3` test dependency ### What changes were proposed in this pull request? This PR aims to fix `kvstore` module by adding explicit `commons-lang3` test dependency and excluding `htmlunit-driver` from `org.scalatestplus` to use Apache Spark's explicit declaration. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/pom.xml#L711-L716 ### Why are the changes needed? Since Spark 3.3.0 (SPARK-37282), `kvstore` uses `commons-lang3` test dependency like the following, but we didn't declare it explicitly so far. https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java#L33 https://github.com/apache/spark/blob/fa23d276e7e4ed94bf11d71f2e1daa22fe2238e5/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBIteratorSuite.java#L23 Previously, it was provided by some unused `htmlunit-driver`'s transitive dependency accidentally. This causes a weird situation which `kvstore` module starts to fail to compile when we upgrade `htmlunit-driver`. We need to fix this first. ``` $ mvn dependency:tree -pl common/kvstore ... [INFO] | \- org.seleniumhq.selenium:htmlunit-driver:jar:4.12.0:test ... [INFO] | +- org.apache.commons:commons-lang3:jar:3.14.0:test ``` ### Does this PR introduce _any_ user-facing change? No. This is only a test dependency fix. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45080 from dongjoon-hyun/SPARK-47021. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a926c7912a78f1a2fb71c5ffd21b5c2f723a0128) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 February 2024, 18:43:44 UTC
74eaf59 [MINOR][DOCS] Add Missing space in `docs/configuration.md` ### What changes were proposed in this pull request? Add a missing space in documentation file `docs/configuration.md`, which might lead to some misunderstanding to newcomers. ### Why are the changes needed? To eliminate ambiguity in sentences. ### Does this PR introduce _any_ user-facing change? Yes, it changes the documentation. ### How was this patch tested? I built the docs locally and double-checked the spelling. ### Was this patch authored or co-authored using generative AI tooling? No. It is just a little typo lol. Closes #45021 from KKtheGhost/fix/spell-configuration. Authored-by: KKtheGhost <dev@amd.sh> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit da73c123e648460dc7df04e9eda9d90445dfedff) Signed-off-by: Kent Yao <yao@apache.org> 05 February 2024, 01:49:59 UTC
9c36cfa [SPARK-46945][K8S][3.4] Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters ### What changes were proposed in this pull request? This PR aims to introduce a legacy configuration for K8s PVC access mode to mitigate migrations issues in old K8s clusters. This is a kind of backport of - #44985 ### Why are the changes needed? - The default value of `spark.kubernetes.legacy.useReadWriteOnceAccessMode` is `true` in branch-3.4. - To help the users who cannot upgrade their K8s versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44987 from dongjoon-hyun/SPARK-46945-3.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Kent Yao <yao@apache.org> 02 February 2024, 02:46:07 UTC
64115d9 [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects ### What changes were proposed in this pull request? [SPARK-46747](https://issues.apache.org/jira/browse/SPARK-46747) reported an issue that Postgres instances suffered from too many shared locks, which was caused by Spark‘s get table exist query. In this PR, we supplanted `"SELECT 1 FROM $table LIMIT 1"` with `"SELECT 1 FROM $table WHERE 1=0"` to prevent data from being scanned. ### Why are the changes needed? overhead reduction for JDBC datasources ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing JDBC v1/v2 datasouce tests. ### Was this patch authored or co-authored using generative AI tooling? no Closes #44948 from yaooqinn/SPARK-46747. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 031df8fa62666f14f54cf0a792f7fa2acc43afee) Signed-off-by: Kent Yao <yao@apache.org> 31 January 2024, 01:45:51 UTC
edaa0fd [SPARK-46893][UI] Remove inline scripts from UI descriptions ### What changes were proposed in this pull request? This PR prevents malicious users from injecting inline scripts via job and stage descriptions. Spark's Web UI [already checks the security of job and stage descriptions](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L528-L545) before rendering them as HTML (or treating them as plain text). The UI already disallows `<script>` tags but doesn't protect against attributes with inline scripts like `onclick` or `onmouseover`. ### Why are the changes needed? On multi-user clusters, bad users can inject scripts into their job and stage descriptions. The UI already finds that [worth protecting against](https://github.com/apache/spark/blob/a368280708dd3c6eb90bd3b09a36a68bdd096222/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L533-L535). So this is extending that protection to scripts in attributes. ### Does this PR introduce _any_ user-facing change? Yes if users relied on inline scripts or attributes in their job or stage descriptions. ### How was this patch tested? Added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44933 from rshkv/wr/spark-46893. Authored-by: Willi Raschkowski <wraschkowski@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit abd9d27e87b915612e2a89e0d2527a04c7b984e0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 30 January 2024, 06:43:39 UTC
51b021f [SPARK-46888][CORE] Fix `Master` to reject `/workers/kill/` requests if decommission is disabled This PR aims to fix `Master` to reject `/workers/kill/` request if `spark.decommission.enabled` is `false` in order to fix the dangling worker issue. Currently, `spark.decommission.enabled` is `false` by default. So, when a user asks to decommission, only Master marked it `DECOMMISSIONED` while the worker is alive. ``` $ curl -XPOST http://localhost:8080/workers/kill/\?host\=127.0.0.1 ``` **Master UI** ![Screenshot 2024-01-27 at 6 19 18 PM](https://github.com/apache/spark/assets/9700541/443bfc32-b924-438a-8bf6-c64b9afbc4be) **Worker Log** ``` 24/01/27 18:18:06 WARN Worker: Receive decommission request, but decommission feature is disabled. ``` To be consistent with the existing `Worker` behavior which ignores the request. https://github.com/apache/spark/blob/1787a5261e87e0214a3f803f6534c5e52a0138e6/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L859-L868 No, this is a bug fix. Pass the CI with the newly added test case. No. Closes #44915 from dongjoon-hyun/SPARK-46888. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 20b593811dc02c96c71978851e051d32bf8c3496) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 January 2024, 04:29:50 UTC
5254840 [SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by https://github.com/apache/spark/pull/44872 while matching of CSV header to a schema in the V1 CSV datasource. ### Why are the changes needed? To fix the failure when column pruning happens and a schema is not enforced: ```scala scala> spark.read. | option("multiLine", true). | option("header", true). | option("escape", "\""). | option("enforceSchema", false). | csv("/Users/maximgekk/tmp/es-939111-data.csv"). | count() 24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: Header length: 4, schema size: 0 CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44910 from MaxGekk/check-header-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bc51c9fea3645c6ae1d9e1e83b0f94f8b849be20) Signed-off-by: Max Gekk <max.gekk@gmail.com> 27 January 2024, 16:23:38 UTC
113ca51 [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode ### What changes were proposed in this pull request? In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode. ### Why are the changes needed? To workaround the issue in the `uniVocity` parser used by the CSV datasource: https://github.com/uniVocity/univocity-parsers/issues/529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44872 from MaxGekk/csv-disable-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 829e742df8251c6f5e965cb08ad454ac3ee1a389) Signed-off-by: Max Gekk <max.gekk@gmail.com> 26 January 2024, 08:02:46 UTC
441c33d [SPARK-46855][INFRA][3.4] Add `sketch` to the dependencies of the `catalyst` in `module.py` ### What changes were proposed in this pull request? This pr add `sketch` to the dependencies of the `catalyst` module in `module.py` due to `sketch` is direct dependency of `catalyst` module. ### Why are the changes needed? Ensure that when modifying the `sketch` module, both `catalyst` and cascading modules will trigger tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44894 from LuciferYang/SPARK-46855-34. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 January 2024, 06:36:32 UTC
3130ac9 [SPARK-46861][CORE] Avoid Deadlock in DAGScheduler * The DAGScheduler could currently run into a deadlock with another thread if both access the partitions of the same RDD at the same time. * To make progress in getCacheLocs, we require both exclusive access to the RDD partitions and the location cache. We first lock on the location cache, and then on the RDD. * When accessing partitions of an RDD, the RDD first acquires exclusive access on the partitions, and then might acquire exclusive access on the location cache. * If thread 1 is able to acquire access on the RDD, while thread 2 holds the access to the location cache, we can run into a deadlock situation. * To fix this, acquire locks in the same order. Change the DAGScheduler to first acquire the lock on the RDD, and then the lock on the location cache. * This is a deadlock you can run into, which can prevent any progress on the cluster. * No * Unit test that reproduces the issue. No Closes #44882 from fred-db/fix-deadlock. Authored-by: fred-db <fredrik.klauss@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 617014cc92d933c70c9865a578fceb265883badd) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 January 2024, 16:37:40 UTC
e56bd97 [SPARK-46817][CORE] Fix `spark-daemon.sh` usage by adding `decommission` command ### What changes were proposed in this pull request? This PR aims to fix `spark-daemon.sh` usage by adding `decommission` command. ### Why are the changes needed? This was missed when SPARK-20628 added `decommission` command at Apache Spark 3.1.0. The command has been used like the following. https://github.com/apache/spark/blob/0356ac00947282b1a0885ad7eaae1e25e43671fe/sbin/decommission-worker.sh#L41 ### Does this PR introduce _any_ user-facing change? No, this is only a change on usage message. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44856 from dongjoon-hyun/SPARK-46817. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 00a92d328576c39b04cfd0fdd8a30c5a9bc37e36) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 January 2024, 00:39:02 UTC
894faab [SPARK-46794][SQL] Remove subqueries from LogicalRDD constraints This PR modifies `LogicalRDD` to filter out all subqueries from its `constraints`. Fixes a correctness bug. Spark can produce incorrect results when using a checkpointed `DataFrame` with a filter containing a scalar subquery. This subquery is included in the constraints of the resulting `LogicalRDD`, and may then be propagated as a filter when joining with the checkpointed `DataFrame`. This causes the subquery to be evaluated twice: once during checkpointing and once while evaluating the query. These two subquery evaluations may return different results, e.g. when the subquery contains a limit with an underspecified sort order. No Added a test to `DataFrameSuite`. No Closes #44833 from tomvanbussel/SPARK-46794. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d26e871136e0c6e1f84a25978319733a516b7b2e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2024, 16:51:21 UTC
c158a7a Revert "[SPARK-46417][SQL] Do not fail when calling hive.getTable and throwException is false" This reverts commit 114754382b19eded5488d63101fa4c520e8551ca. 23 January 2024, 09:37:57 UTC
34f81d6 [SPARK-46763] Fix assertion failure in ReplaceDeduplicateWithAggregate for duplicate attributes ### What changes were proposed in this pull request? - Updated the `ReplaceDeduplicateWithAggregate` implementation to reuse aliases generated for an attribute. - Added a unit test to ensure scenarios with duplicate non-grouping keys are correctly optimized. ### Why are the changes needed? - `ReplaceDeduplicateWithAggregate` replaces `Deduplicate` with an `Aggregate` operator with grouping expressions for the deduplication keys and aggregate expressions for the non-grouping keys (to preserve the output schema and keep the non-grouping columns). - For non-grouping key `a#X`, it generates an aggregate expression of the form `first(a#X, false) AS a#Y` - In case the non-grouping keys have a repeated attribute (with the same name and exprId), the existing logic would generate two different aggregate expressions both having two different exprId. - This then leads to duplicate rewrite attributes error (in `transformUpWithNewOutput`) when transforming the remaining tree. - For example, for the query ``` Project [a#0, b#1] +- Deduplicate [b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` the existing logic would transform it to ``` Project [a#3, b#1] +- Aggregate [b#1], [first(a#0, false) AS a#3, first(a#0, false) AS a#5, b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` with the aggregate mapping having two entries `a#0 -> a#3, a#0 -> a#5`. The correct transformation would be ``` Project [a#3, b#1] +- Aggregate [b#1], [first(a#0, false) AS a#3, first(a#0, false) AS a#3, b#1] +- Project [a#0, a#0, b#1] +- LocalRelation <empty>, [a#0, b#1] ``` with the aggregate mapping having only one entry `a#0 -> a#3`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit test in `ResolveOperatorSuite`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44835 from nikhilsheoran-db/SPARK-46763. Authored-by: Nikhil Sheoran <125331115+nikhilsheoran-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 715b43428913d6a631f8f9043baac751b88cb5d4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 January 2024, 09:21:26 UTC
2621882 [SPARK-46801][PYTHON][TESTS] Do not treat exit code 5 as a test failure in Python testing script ### What changes were proposed in this pull request? This PR proposes to avoid treating the exit code 5 as a test failure in Python testing script. ### Why are the changes needed? ``` ... ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.12'] Will test the following Python modules: ['pyspark-core', 'pyspark-streaming', 'pyspark-errors'] python3.12 python_implementation is CPython python3.12 version is: Python 3.12.1 Starting test(python3.12): pyspark.streaming.tests.test_context (temp output: /__w/spark/spark/python/target/8674ed86-36bd-47d1-863b-abb0405557f6/python3.12__pyspark.streaming.tests.test_context__umu69c3v.log) Finished test(python3.12): pyspark.streaming.tests.test_context (12s) Starting test(python3.12): pyspark.streaming.tests.test_dstream (temp output: /__w/spark/spark/python/target/847eb56b-3c5f-49ab-8a83-3326bb96bc5d/python3.12__pyspark.streaming.tests.test_dstream__rorhk0lc.log) Finished test(python3.12): pyspark.streaming.tests.test_dstream (102s) Starting test(python3.12): pyspark.streaming.tests.test_kinesis (temp output: /__w/spark/spark/python/target/78f23c83-c24d-4fa1-abbd-edb90f48dff1/python3.12__pyspark.streaming.tests.test_kinesis__q5l1pv0h.log) test_kinesis_stream (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." test_kinesis_stream_api (pyspark.streaming.tests.test_kinesis.KinesisStreamTests.test_kinesis_stream_api) ... skipped "Skipping all Kinesis Python tests as environmental variable 'ENABLE_KINESIS_TESTS' was not set." ---------------------------------------------------------------------- Ran 0 tests in 0.000s NO TESTS RAN (skipped=2) Had test failures in pyspark.streaming.tests.test_kinesis with python3.12; see logs. Error: running /__w/spark/spark/python/run-tests --modules=pyspark-core,pyspark-streaming,pyspark-errors --parallelism=1 --python-executables=python3.12 ; received return code 255 Error: Process completed with exit code 19. ``` Scheduled job fails because of exit 5, see https://github.com/pytest-dev/pytest/issues/2393. This isn't a test failure. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No, Closes #44841 from HyukjinKwon/SPARK-46801. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 52b62921cadb05da5b1183f979edf7d608256f2e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 January 2024, 01:07:20 UTC
97536c6 [SPARK-46779][SQL] `InMemoryRelation` instances of the same cached plan should be semantically equivalent When canonicalizing `output` in `InMemoryRelation`, use `output` itself as the schema for determining the ordinals, rather than `cachedPlan.output`. `InMemoryRelation.output` and `InMemoryRelation.cachedPlan.output` don't necessarily use the same exprIds. E.g.: ``` +- InMemoryRelation [c1#340, c2#341], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [c1#254, c2#255] ``` Because of this, `InMemoryRelation` will sometimes fail to fully canonicalize, resulting in cases where two semantically equivalent `InMemoryRelation` instances appear to be semantically nonequivalent. Example: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (3, 7), (4, 5); cache table data; select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all; ``` If plan change validation checking is on (i.e., `spark.sql.planChangeValidation=true`), the failure is: ``` [PLAN_VALIDATION_FAILED_RULE_EXECUTOR] The input plan of org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2 is invalid: Aggregate: Aggregate [c1#78, scalar-subquery#77 [c1#78]], [c1#78, scalar-subquery#77 [c1#78] AS scalarsubquery(c1)#90L, count(c2#79) AS count(c2)#83L] ... is not a valid aggregate expression: [SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION] The correlated scalar subquery '"scalarsubquery(c1)"' is neither present in GROUP BY, nor in an aggregate function. ``` If plan change validation checking is off, the failure is more mysterious: ``` [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 ``` If you remove the cache command, the query succeeds. The above failures happen because the subquery in the aggregate expressions and the subquery in the grouping expressions seem semantically nonequivalent since the `InMemoryRelation` in one of the subquery plans failed to completely canonicalize. In `CacheManager#useCachedData`, two lookups for the same cached plan may create `InMemoryRelation` instances that have different exprIds in `output`. That's because the plan fragments used as lookup keys may have been deduplicated by `DeduplicateRelations`, and thus have different exprIds in their respective output schemas. When `CacheManager#useCachedData` creates an `InMemoryRelation` instance, it borrows the output schema of the plan fragment used as the lookup key. The failure to fully canonicalize has other effects. For example, this query fails to reuse the exchange: ``` create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (2, 4), (3, 7), (7, 22); cache table data; set spark.sql.autoBroadcastJoinThreshold=-1; set spark.sql.adaptive.enabled=false; select * from data l join data r on l.c1 = r.c1; ``` No. New tests. No. Closes #44806 from bersprockets/plan_validation_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b80e8cb4552268b771fc099457b9186807081c4a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 January 2024, 19:15:21 UTC
137528a [SPARK-44495][INFRA][K8S][3.4] Use the latest minikube in K8s IT ### What changes were proposed in this pull request? This is a backport of #44813 . This PR aims to recover GitHub Action K8s IT to use the latest Minikube and to make it sure that Apache Spark K8s module are tested with all Minikubes without any issues. **BEFORE** - Minikube: v1.30.1 - K8s: v1.26.3 **AFTER** - Minikube: v1.32.0 - K8s: v1.28.3 ### Why are the changes needed? - Previously, it was pinned due to the failure. - After this PR, we will track the latest Minikube and K8s version always. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44820 from dongjoon-hyun/SPARK-44495-3.4. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 January 2024, 08:19:23 UTC
bb4a893 [SPARK-46786][K8S] Fix `MountVolumesFeatureStep` to use `ReadWriteOncePod` instead of `ReadWriteOnce` This PR aims to fix a duplicated volume mounting bug by using `ReadWriteOncePod` instead of `ReadWriteOnce`. This bug fix is based on the stable K8s feature which is available since v1.22. - [KEP-2485: ReadWriteOncePod PersistentVolume AccessMode](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2485-read-write-once-pod-pv-access-mode/README.md) - https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes - v1.22 Alpha - v1.27 Beta - v1.29 Stable For the record, the minimum K8s version of GKE/EKS/AKE is **v1.24** as of today and the latest v1.29 is supported like the following. - [2024.01 (GKE Regular Channel)](https://cloud.google.com/kubernetes-engine/docs/release-schedule) - [2024.02 (AKE GA)](https://learn.microsoft.com/en-us/azure/aks/supported-kubernetes-versions?tabs=azure-cli#aks-kubernetes-release-calendar) This is a bug fix. Pass the CIs with the existing PV-related tests. No. Closes #44817 from dongjoon-hyun/SPARK-46786. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 45ec74415a4a89851968941b80c490e37ee88daf) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 January 2024, 01:52:38 UTC
37a0e46 [MINOR][DOCS] Add zstandard as a candidate to fix the desc of spark.sql.avro.compression.codec ### What changes were proposed in this pull request? Add zstandard as a candidate to fix the desc of spark.sql.avro.compression.codec ### Why are the changes needed? docfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ### Was this patch authored or co-authored using generative AI tooling? no Closes #44783 from yaooqinn/avro_minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c040824fd75c955dbc8e5712bc473a0ddb9a8c0f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 18 January 2024, 16:21:42 UTC
8f89f40 [SPARK-46715][INFRA][3.4] Pin `sphinxcontrib-*` ### What changes were proposed in this pull request? Pin `sphinxcontrib-*` and other deps for doc ### Why are the changes needed? to fix CI ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44762 from zhengruifeng/infra_pin_doc_34. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 17 January 2024, 03:24:16 UTC
bac0033 [SPARK-46700][CORE] Count the last spilling for the shuffle disk spilling bytes metric ### What changes were proposed in this pull request? This PR fixes a long-standing bug in ShuffleExternalSorter about the "spilled disk bytes" metrics. When we close the sorter, we will spill the remaining data in the buffer, with a flag `isLastFile = true`. This flag means the spilling will not increase the "spilled disk bytes" metrics. This makes sense if the sorter has never spilled before, then the final spill file will be used as the final shuffle output file, and we should keep the "spilled disk bytes" metrics as 0. However, if spilling did happen before, then we simply miscount the final spill file for the "spilled disk bytes" metrics today. This PR fixes this issue, by setting that flag when closing the sorter only if this is the first spilling. ### Why are the changes needed? make metrics accurate ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44709 from cloud-fan/shuffle. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 4ea374257c1fdb276abcd6b953ba042593e4d5a3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 January 2024, 21:53:03 UTC
0172413 [SPARK-46704][CORE][UI] Fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly ### What changes were proposed in this pull request? This PR aims to fix `MasterPage` to sort `Running Drivers` table by `Duration` column correctly. ### Why are the changes needed? Since Apache Spark 3.0.0, `MasterPage` shows `Duration` column of `Running Drivers`. **BEFORE** <img width="111" src="https://github.com/apache/spark/assets/9700541/50276e34-01be-4474-803d-79066e06cb2c"> **AFTER** <img width="111" src="https://github.com/apache/spark/assets/9700541/a427b2e6-eab0-4d73-9114-1d8ff9d052c2"> ### Does this PR introduce _any_ user-facing change? Yes, this is a bug fix of UI. ### How was this patch tested? Manual. Run a Spark standalone cluster. ``` $ SPARK_MASTER_OPTS="-Dspark.master.rest.enabled=true -Dspark.deploy.maxDrivers=2" sbin/start-master.sh $ sbin/start-worker.sh spark://$(hostname):7077 ``` Submit multiple jobs via REST API. ``` $ curl -s -k -XPOST http://localhost:6066/v1/submissions/create \ --header "Content-Type:application/json;charset=UTF-8" \ --data '{ "appResource": "", "sparkProperties": { "spark.master": "spark://localhost:7077", "spark.app.name": "Test 1", "spark.submit.deployMode": "cluster", "spark.jars": "/Users/dongjoon/APACHE/spark-merge/examples/target/scala-2.13/jars/spark-examples_2.13-4.0.0-SNAPSHOT.jar" }, "clientSparkVersion": "", "mainClass": "org.apache.spark.examples.SparkPi", "environmentVariables": {}, "action": "CreateSubmissionRequest", "appArgs": [ "10000" ] }' ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44711 from dongjoon-hyun/SPARK-46704. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 25c680cfd4dc63aeb9d16a673ee431c57188b80d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 12 January 2024, 20:54:46 UTC
ce22af0 [SPARK-46628][INFRA] Use SPDX short identifier in `license` name ### What changes were proposed in this pull request? This PR aims to use SPDX short identifier as `license`'s `name` field. - https://spdx.org/licenses/Apache-2.0.html ### Why are the changes needed? SPDX short identifier is recommended as `name` field by `Apache Maven`. - https://maven.apache.org/pom.html#Licenses ASF pom file has been using it. This PR aims to match with ASF pom file. - https://github.com/apache/maven-apache-parent/pull/118 - https://github.com/apache/maven-apache-parent/blob/7888bdb8ee653ecc03b5fee136540a607193c240/pom.xml#L46 ``` <name>Apache-2.0</name> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44631 from dongjoon-hyun/SPARK-46628. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d008f81a9d8d4b5e8e434469755405f6ae747e75) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 January 2024, 00:24:23 UTC
53683a8 [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it. To not violate the memory mode requirement No new test no Closes #44598 from cloud-fan/orc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 0c1c5e93e376b97a6d2dae99e973b9385155727a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 January 2024, 20:44:10 UTC
2eb603c [SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite leaks hive's SessionState ### What changes were proposed in this pull request? The upcoming tests with the new hive configurations will have no effect due to the leaked SessionState. ``` 06:21:12.848 pool-1-thread-1 INFO ThriftServerWithSparkContextInHttpSuite: Trying to start HiveThriftServer2: mode=http, attempt=0 .... 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:OperationManager is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:SessionManager is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: CLIService is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: HiveServer2 is inited. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:OperationManager is started. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:SessionManager is started. 06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: CLIService is started. 06:21:12.852 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is started. 06:21:12.852 pool-1-thread-1 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads 06:21:12.852 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is started. ``` As the logs above revealed, ThriftServerWithSparkContextInHttpSuite started the ThriftBinaryCLIService instead of the ThriftHttpCLIService. This is because in HiveClientImpl, the new configurations are only applied to hive conf during initializing but not for existing ones. This cause ThriftServerWithSparkContextInHttpSuite retrying or even aborting. ### Why are the changes needed? Fix flakiness in tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ran tests locally with the hive-thriftserver module locally, ### Was this patch authored or co-authored using generative AI tooling? no Closes #44578 from yaooqinn/SPARK-46577. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 605fecd22cc18fc9b93fb26d4aa6088f5a314f92) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 January 2024, 13:55:15 UTC
b478eb4 [SPARK-46425][INFRA] Pin the bundler version in CI Currently documentation build is broken: https://github.com/apache/spark/actions/runs/7226413850/job/19691970695 ``` ... ERROR: Error installing bundler: The last version of bundler (>= 0) to support your Ruby & RubyGems was 2.4.22. Try installing it with `gem install bundler -v 2.4.22` bundler requires Ruby version >= 3.0.0. The current ruby version is 2.7.0.0. ``` This PR uses the suggestion. To recover the CI. No, dev-only. CI in this PR verify it. No. Closes #44376 from HyukjinKwon/SPARK-46425. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d0da1172b7d87b68a8af8464c6486aa586324241) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6e8dbacf8a1402878a2a4be295bbe78e7c78327e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 January 2024, 16:37:16 UTC
ee44914 [SPARK-46514][TESTS] Fix HiveMetastoreLazyInitializationSuite ### What changes were proposed in this pull request? This PR enabled the assertion in HiveMetastoreLazyInitializationSuite ### Why are the changes needed? fix test intenton ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass HiveMetastoreLazyInitializationSuite ### Was this patch authored or co-authored using generative AI tooling? no Closes #44500 from yaooqinn/SPARK-46514. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit d0245d34c004935bb2c904bfd906836df3d574fa) Signed-off-by: Kent Yao <yao@apache.org> 28 December 2023, 02:54:59 UTC
f8eb533 [SPARK-46466][SQL][3.5] Vectorized parquet reader should never do rebase for timestamp ntz backport https://github.com/apache/spark/pull/44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0948e24c30f6f7a05110f6e45b6723897e095aeb) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 December 2023, 15:25:45 UTC
254d634 [SPARK-46330] Loading of Spark UI blocks for a long time when HybridStore enabled ### What changes were proposed in this pull request? Move `LoadedAppUI` invalidate operation out of `FsHistoryProvider` synchronized block. ### Why are the changes needed? When closing a HybridStore of a `LoadedAppUI` with a lot of data waiting to be written to disk, loading of other Spark UIs will be blocked for a long time. See more details at https://issues.apache.org/jira/browse/SPARK-46330 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44260 from zhouyifan279/SPARK-46330. Authored-by: zhouyifan279 <zhouyifan279@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit cf54e8f9a51bf54e8fa3e1011ac370e46134b134) Signed-off-by: Kent Yao <yao@apache.org> 20 December 2023, 08:51:32 UTC
1147543 [SPARK-46417][SQL] Do not fail when calling hive.getTable and throwException is false ### What changes were proposed in this pull request? Uses can set up their own HMS and let Spark connects to it. We have no control over it and somtimes it's not even Hive but just a HMS-API-compatible service. Spark should be more fault-tolerant when calling HMS APIs. This PR fixes an issue in `hive.getTable` with `throwException = false`, to make sure we don't throw error when can't fetch the table. ### Why are the changes needed? avoid query failure caused by HMS bugs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? in our product environment ### Was this patch authored or co-authored using generative AI tooling? No Closes #44364 from cloud-fan/hive. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 59488039f58b18617cd6dfd6dbe3bf014af222e7) Signed-off-by: Kent Yao <yao@apache.org> 15 December 2023, 10:55:51 UTC
b813e2e [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` ### What changes were proposed in this pull request? This PR aims to remove `kill` hyperlink from `RELAUNCHING` drivers in `MasterPage`. ### Why are the changes needed? Since Apache Spark 1.4.0 (SPARK-5495), `RELAUNCHING` drivers have `kill` hyperlinks in the `Completed Drivers` table. ![Screenshot 2023-12-11 at 1 02 29 PM](https://github.com/apache/spark/assets/9700541/38f4bf08-efb9-47e5-8a7a-f7d127429012) However, this is a bug because the driver was already terminated by definition. Newly relaunched driver has an independent ID and there is no relationship with the previously terminated ID. https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala#L27 If we clicked the `kill` link, `Master` always complains like the following. ``` 23/12/11 21:25:50 INFO Master: Asked to kill driver 202312112113-00000 23/12/11 21:25:50 WARN Master: Driver 202312112113-00000 has already finished or does not exist ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44301 from dongjoon-hyun/SPARK-46369. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit e434c9f0d5792b7af43c87dd6145fd8a6a04d8e2) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ac031d68a01f14cc73f05e83a790a6787aa6453d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 11 December 2023, 23:06:26 UTC
4e80b3a [SPARK-46339][SS] Directory with batch number name should not be treated as metadata log ### What changes were proposed in this pull request? This patch updates the document of `CheckpointFileManager.list` method to reflect the fact it is used to return both files and directories to reduce confusion. For the usage like `HDFSMetadataLog` where it assumes returned file status by `list` are all files, we add a filter there to avoid confusing error. ### Why are the changes needed? `HDFSMetadataLog` takes a metadata path as parameter. When it goes to retrieves all batches metadata, it calls `CheckpointFileManager.list` to get all files under the metadata path. However, currently all implementations of `CheckpointFileManager.list` returns all files/directories under the given path. So if there is a dictionary with name of batch number (a long value), the directory will be returned too and cause trouble when `HDFSMetadataLog` goes to read it. Actually, `CheckpointFileManager.list` method clearly defines that it lists the "files" in a path. That's being said, current implementations don't follow the doc. We tried to make `list` method implementations only return files but some usage (state metadata) of `list` method already break the assumption and they use dictionaries returned by `list` method. So we simply update `list` method document to explicitly define it returns both files/dictionaries. We add a filter in `HDFSMetadataLog` on the file statuses returned by `list` method to avoid this issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44272 from viirya/fix_metadatalog. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 75805f07f5caeb01104a7352b02790d03a043ded) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 28a8b181e96d4ce71e2f9888910214d14a859b7d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 09 December 2023, 23:22:08 UTC
4745138 [SPARK-46275][3.4] Protobuf: Return null in permissive mode when deserialization fails This is a cherry-pick of #44214 into 3.4 branch. From the original PR: ### What changes were proposed in this pull request? This updates the the behavior of `from_protobuf()` built function when underlying record fails to deserialize. * **Current behvior**: * By default, this would throw an error and the query fails. [This part is not changed in the PR] * When `mode` is set to 'PERMISSIVE' it returns a non-null struct with each of the inner fields set to null e.g. `{ "field_a": null, "field_b": null }` etc. * This is not very convenient to the users. They don't know if this was due to malformed record or if the input itself has null. It is very hard to check for each field for null in SQL query (imagine a sql query with a struct that has 10 fields). * **New behavior** * When `mode` is set to 'PERMISSIVE' it simply returns `null`. ### Why are the changes needed? This makes it easier for users to detect and handle malformed records. ### Does this PR introduce _any_ user-facing change? Yes, but this does not change the contract. In fact, it clarifies it. ### How was this patch tested? - Unit tests are updated. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44265 from rangadi/protobuf-null-3.4. Authored-by: Raghu Angadi <raghu.angadi@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 08 December 2023, 22:40:03 UTC
8e40ec6 [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join ### What changes were proposed in this pull request? This is a back-port of https://github.com/apache/spark/pull/44193. In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node. For example: ``` Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` becomes ``` Project [a#13] +- Join LeftSemi, ((a#13 = c1#15) OR exists#19) :- Join ExistenceJoin(exists#19), (a#13 = col1#17) : :- LocalRelation [a#13] : +- LocalRelation [col1#17] +- LocalRelation [c1#15] ``` This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node. ### Why are the changes needed? This query returns an extraneous boolean column when run in spark-sql: ``` create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true ``` (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization). This query fails when run in either spark-sql or using the Dataset API: ``` select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis ``` ### Does this PR introduce _any_ user-facing change? No, except for the removal of the extraneous boolean flag and the fix to the error condition. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44219 from bersprockets/schema_change_br34. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 07 December 2023, 03:23:19 UTC
93fef09 [SPARK-46286][DOCS] Document `spark.io.compression.zstd.bufferPool.enabled` This PR adds spark.io.compression.zstd.bufferPool.enabled to documentation - Missing docs - https://github.com/apache/spark/pull/31502#issuecomment-774792276 potential regression no doc build no Closes #44207 from yaooqinn/SPARK-46286. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6b6980de451e655ef4b9f63d502b73c09a513d4c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 06 December 2023, 18:47:57 UTC
757c3a9 [SPARK-46239][CORE] Hide `Jetty` info **What changes were proposed in this pull request?** The PR sets parameters to hide the version of jetty in spark. **Why are the changes needed?** It can avoid obtaining remote WWW service information through HTTP. **Does this PR introduce any user-facing change?** No **How was this patch tested?** Manual review **Was this patch authored or co-authored using generative AI tooling?** No Closes #44158 from chenyu-opensource/branch-SPARK-46239. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: chenyu <119398199+chenyu-opensource@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ff4f59341215b7f3a87e6cd8798d49e25562fcd6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2023, 22:41:44 UTC
05b5c9e [SPARK-46092][SQL][3.4] Don't push down Parquet row group filters that overflow This is a cherry-pick from https://github.com/apache/spark/pull/44006 to spark 3.4 ### What changes were proposed in this pull request? This change adds a check for overflows when creating Parquet row group filters on an INT32 (byte/short/int) parquet type to avoid incorrectly skipping row groups if the predicate value doesn't fit in an INT. This can happen if the read schema is specified as LONG, e.g via `.schema("col LONG")` While the Parquet readers don't support reading INT32 into a LONG, the overflow can lead to row groups being incorrectly skipped, bypassing the reader altogether and producing incorrect results instead of failing. ### Why are the changes needed? Reading a parquet file containing INT32 values with a read schema specified as LONG can produce incorrect results today: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` will return an empty result. The correct result is either: - Failing the query if the parquet reader doesn't support upcasting integers to longs (all parquet readers in Spark today) - Return result `[0]` if the parquet reader supports that upcast (no readers in Spark as of now, but I'm looking into adding this capability). ### Does this PR introduce _any_ user-facing change? The following: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` produces an (incorrect) empty result before this change. After this change, the read will fail, raising an error about the unsupported conversion from INT to LONG in the parquet reader. ### How was this patch tested? - Added tests to `ParquetFilterSuite` to ensure that no row group filter is created when the predicate value overflows or when the value type isn't compatible with the parquet type - Added test to `ParquetQuerySuite` covering the correctness issue described above. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44155 from johanl-db/SPARK-46092-row-group-skipping-overflow-3.4. Authored-by: Johan Lasperas <johan.lasperas@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2023, 16:59:21 UTC
b8750d5 [SPARK-46182][CORE] Track `lastTaskFinishTime` using the exact task finished event ### What changes were proposed in this pull request? We found a race condition between lastTaskRunningTime and lastShuffleMigrationTime that could lead to a decommissioned executor exit before all the shuffle blocks have been discovered. The issue could lead to immediate task retry right after an executor exit, thus longer query execution time. To fix the issue, we choose to update the lastTaskRunningTime only when a task updates its status to finished through the StatusUpdate event. This is better than the current approach (which use a thread to check for number of running tasks every second), because in this way we clearly know whether the shuffle block refresh happened after all tasks finished running or not, thus resolved the race condition mentioned above. ### Why are the changes needed? To fix a race condition that could lead to shuffle data lost, thus longer query execution time. ### How was this patch tested? This is a very subtle race condition that is hard to write a unit test using current unit test framework. And we are confident the change is low risk. Thus only verify by passing all the existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44090 from jiangxb1987/SPARK-46182. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6f112f7b1a50a2b8a59952c69f67dd5f80ab6633) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2023, 06:08:57 UTC
429afdc [SPARK-46189][PS][SQL] Perform comparisons and arithmetic between same types in various Pandas aggregate functions to avoid interpreted mode errors ### What changes were proposed in this pull request? In various Pandas aggregate functions, remove each comparison or arithmetic operation between `DoubleType` and `IntergerType` in `evaluateExpression` and replace with a comparison or arithmetic operation between `DoubleType` and `DoubleType`. Affected functions are `PandasStddev`, `PandasVariance`, `PandasSkewness`, `PandasKurtosis`, and `PandasCovar`. ### Why are the changes needed? These functions fail in interpreted mode. For example, `evaluateExpression` in `PandasKurtosis` compares a double to an integer: ``` If(n < 4, Literal.create(null, DoubleType) ... ``` This results in a boxed double and a boxed integer getting passed to `SQLOrderingUtil.compareDoubles` which expects two doubles as arguments. The scala runtime tries to unbox the boxed integer as a double, resulting in an error. Reproduction example: ``` spark.sql("set spark.sql.codegen.wholeStage=false") spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") import numpy as np import pandas as pd import pyspark.pandas as ps pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a") psser = ps.from_pandas(pser) psser.kurt() ``` See Jira (SPARK-46189) for the other reproduction cases. This works fine in codegen mode because the integer is already unboxed and the Java runtime will implictly cast it to a double. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44099 from bersprockets/unboxing_error. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 042d8546be5d160e203ad78a8aa2e12e74142338) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> 01 December 2023, 02:29:27 UTC
639f836 [SPARK-46029][SQL][3.4] Escape the single quote, _ and % for DS V2 pushdown ### What changes were proposed in this pull request? This PR used to back port https://github.com/apache/spark/pull/43801 to branch-3.4 ### Why are the changes needed? Escape the single quote, _ and % for DS V2 pushdown ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? New test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44066 from beliefer/SPARK-46029_backport. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Jiaan Geng <beliefer@163.com> 30 November 2023, 01:51:46 UTC
3e910fb [SPARK-46006][YARN][FOLLOWUP] YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop ### What changes were proposed in this pull request? YarnAllocator set target executor number to 0 to cancel pending allocate request when driver stop Now for this issue we do: 1. AllocationFailure should not be treated as exitCausedByApp when driver is shutting down https://github.com/apache/spark/pull/38622 2. Avoid new allocation requests when sc.stop stuck https://github.com/apache/spark/pull/43906 3. Cancel pending allocation request, this pr https://github.com/apache/spark/pull/44036 ### Why are the changes needed? Avoid unnecessary allocate request ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? MT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44036 from AngersZhuuuu/SPARK-46006-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit dbc8756bdac823be42ed10bc011415f405905497) Signed-off-by: Kent Yao <yao@apache.org> 28 November 2023, 03:04:49 UTC
aff5a9f Preparing development version 3.4.3-SNAPSHOT 25 November 2023, 06:40:37 UTC
0c0e7d4 Preparing Spark release v3.4.2-rc1 25 November 2023, 06:40:32 UTC
03dac18 [SPARK-46095][DOCS] Document `REST API` for Spark Standalone Cluster This PR aims to document `REST API` for Spark Standalone Cluster. To help the users to understand Apache Spark features. No. Manual review. `REST API` Section is added newly. **AFTER** <img width="704" alt="Screenshot 2023-11-24 at 4 13 53 PM" src="https://github.com/apache/spark/assets/9700541/a4e09d94-d216-4629-8b37-9d350365a428"> No. Closes #44007 from dongjoon-hyun/SPARK-46095. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 132c1a1f08d6555c950600c102db28b9d7581350) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2023, 01:41:37 UTC
a53c16a [SPARK-46016][DOCS][PS] Fix pandas API support list properly ### What changes were proposed in this pull request? This PR proposes to fix a critical issue in the [Supported pandas API documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html) where many essential APIs such as `DataFrame.max`, `DataFrame.min`, `DataFrame.mean`, `and DataFrame.median`, etc. were incorrectly marked as not implemented - marked as "N" - as below: <img width="291" alt="Screenshot 2023-11-24 at 12 37 49 PM" src="https://github.com/apache/spark/assets/44108233/95c5785c-711c-400c-b2ec-0db034e90fd8"> The root cause of this issue was that the script used to generate the support list excluded functions inherited from parent classes. For instance, `CategoricalIndex.max` is actually supported by inheriting the `Index` class but was not directly implemented in `CategoricalIndex`, leading to it being marked as unsupported: <img width="397" alt="Screenshot 2023-11-24 at 12 30 08 PM" src="https://github.com/apache/spark/assets/44108233/90e92996-a88a-4a20-bb0c-4909097e2688"> ### Why are the changes needed? The current documentation inaccurately represents the state of supported pandas API, which could significantly hinder user experience and adoption. By correcting these inaccuracies, we ensure that the documentation reflects the true capabilities of Pandas API on Spark, providing users with reliable and accurate information. ### Does this PR introduce _any_ user-facing change? No. This PR only updates the documentation to accurately reflect the current state of supported pandas API. ### How was this patch tested? Manually build documentation, and check if the supported pandas API list is correctly generated as below: <img width="299" alt="Screenshot 2023-11-24 at 12 36 31 PM" src="https://github.com/apache/spark/assets/44108233/a2da0f0b-0973-45cb-b22d-9582bbeb51b5"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43996 from itholic/fix_supported_api_gen. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 132bb63a897f4f4049f34deefc065ed3eac6a90f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 November 2023, 10:39:02 UTC
e87d166 [SPARK-46062][SQL] Sync the isStreaming flag between CTE definition and reference This PR proposes to sync the flag `isStreaming` from CTE definition to CTE reference. The essential issue is that CTE reference node cannot determine the flag `isStreaming` by itself, and never be able to have a proper value and always takes the default as it does not have a parameter in constructor. The other flag `resolved` is handled, and we need to do the same for `isStreaming`. Once we add the parameter to the constructor, we will also need to make sure the flag is in sync with CTE definition. We have a rule `ResolveWithCTE` doing the sync, hence we add the logic to sync the flag `isStreaming` as well. The bug may impact some rules which behaves differently depending on isStreaming flag. It would no longer be a problem once CTE reference is replaced with CTE definition at some point in "optimization phase", but all rules in analyzer and optimizer being triggered before the rule takes effect may misbehave based on incorrect isStreaming flag. No. New UT. No. Closes #43966 from HeartSaVioR/SPARK-46062. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 43046631a5d4ac7201361a00473cc87fa52ab5a7) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 23 November 2023, 14:32:48 UTC
back to top