https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
c8c657b Preparing Spark release v3.3.0-rc2 16 May 2022, 05:42:28 UTC
386c756 [SPARK-39186][PYTHON] Make pandas-on-Spark's skew consistent with pandas ### What changes were proposed in this pull request? the logics of computing skewness are different between spark sql and pandas: spark sql: [`sqrt(n) * m3 / sqrt(m2 * m2 * m2))`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L304) pandas: [`(count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2**1.5)`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py#L1221) ### Why are the changes needed? to make skew consistent with pandas ### Does this PR introduce _any_ user-facing change? yes, the logic to compute skew was changed ### How was this patch tested? added UT Closes #36549 from zhengruifeng/adjust_pandas_skew. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7e4519c9a8ba35958ef6d408be3ca4e97917c965) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:31:09 UTC
ab1d986 [SPARK-37544][SQL] Correct date arithmetic in sequences ### What changes were proposed in this pull request? Change `InternalSequenceBase` to pass a time-zone aware value to `DateTimeUtils#timestampAddInterval`, rather than a time-zone agnostic value, when performing `Date` arithmetic. ### Why are the changes needed? The following query gets the wrong answer if run in the America/Los_Angeles time zone: ``` spark-sql> select sequence(date '2021-01-01', date '2022-01-01', interval '3' month) x; [2021-01-01,2021-03-31,2021-06-30,2021-09-30,2022-01-01] Time taken: 0.664 seconds, Fetched 1 row(s) spark-sql> ``` The answer should be ``` [2021-01-01,2021-04-01,2021-07-01,2021-10-01,2022-01-01] ``` `InternalSequenceBase` converts the date to micros by multiplying days by micros per day. This converts the date into a time-zone agnostic timestamp. However, `InternalSequenceBase` uses `DateTimeUtils#timestampAddInterval` to perform the arithmetic, and that function assumes a _time-zone aware_ timestamp. One simple fix would be to call `DateTimeUtils#timestampNTZAddInterval` instead for date arithmetic. However, Spark date arithmetic is typically time-zone aware (see the comment in the test added by this PR), so this PR converts the date to a time-zone aware value before calling `DateTimeUtils#timestampAddInterval`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #36546 from bersprockets/date_sequence_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 14ee0d8f04f218ad61688196a0b984f024151468) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:26:26 UTC
2672624 [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors ### What changes were proposed in this pull request? Document PySpark(SQL, pandas API on Spark, and Py4J) common exceptions/errors and respective solutions. ### Why are the changes needed? Make PySpark debugging easier. There are common exceptions/errors in PySpark SQL, pandas API on Spark, and Py4J. Documenting exceptions and respective solutions help users debug PySpark. ### Does this PR introduce _any_ user-facing change? No. Document change only. ### How was this patch tested? Manual test. <img width="1019" alt="image" src="https://user-images.githubusercontent.com/47337188/165145874-b0de33b1-835a-459d-9062-94086e62e254.png"> Please refer to https://github.com/apache/spark/blob/7a1c7599a21cbbe2778707b72643cf98ac601ab1/python/docs/source/development/debugging.rst#common-exceptions--errors for the whole rendered page. Closes #36267 from xinrong-databricks/common_err. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f940d7adfd6d071bc3bdcc406e01263a7f03e955) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:25:10 UTC
2db9d78 [SPARK-39174][SQL] Catalogs loading swallows missing classname for ClassNotFoundException ### What changes were proposed in this pull request? this PR captures the actual missing classname when catalog loading meets ClassNotFoundException ### Why are the changes needed? ClassNotFoundException can occur when missing dependencies, we shall not always report the catalog class is missing ### Does this PR introduce _any_ user-facing change? yes, when loading catalogs and ClassNotFoundException occurs, it shows the correct missing class. ### How was this patch tested? new test added Closes #36534 from yaooqinn/SPARK-39174. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1b37f19876298e995596a30edc322c856ea1bbb4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 May 2022, 10:45:34 UTC
30834b8 [SPARK-39157][SQL] H2Dialect should override getJDBCType so as make the data type is correct ### What changes were proposed in this pull request? Currently, `H2Dialect` not implement `getJDBCType` of `JdbcDialect`, so the DS V2 push-down will throw exception show below: ``` Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13) (jiaan-gengdembp executor driver): org.h2.jdbc.JdbcSQLNonTransientException: Unknown data type: "STRING"; SQL statement: SELECT "DEPT","NAME","SALARY","BONUS","IS_MANAGER" FROM "test"."employee" WHERE ("BONUS" IS NOT NULL) AND ("DEPT" IS NOT NULL) AND (CAST("BONUS" AS string) LIKE '%30%') AND (CAST("DEPT" AS byte) > 1) AND (CAST("DEPT" AS short) > 1) AND (CAST("BONUS" AS decimal(20,2)) > 1200.00) [50004-210] ``` H2Dialect should implement `getJDBCType` of `JdbcDialect`. ### Why are the changes needed? make the H2 data type is correct. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug for `H2Dialect`. ### How was this patch tested? New tests. Closes #36516 from beliefer/SPARK-39157. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa3f096e02d408fbeab5f69af451ef8bc8f5b3db) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 May 2022, 14:01:27 UTC
e743e68 [SPARK-39178][CORE] SparkFatalException should show root cause when print error stack ### What changes were proposed in this pull request? Our user meet an case when running broadcast, throw `SparkFatalException`, but in error stack, it don't show the error case. ### Why are the changes needed? Make exception more clear ### Does this PR introduce _any_ user-facing change? User can got root cause when application throw `SparkFatalException`. ### How was this patch tested? For ut ``` test("xxxx") { throw new SparkFatalException( new OutOfMemoryError("Not enough memory to build and broadcast the table to all " + "worker nodes. As a workaround, you can either disable broadcast by setting " + s"driver memory by setting ${SparkLauncher.DRIVER_MEMORY} to a higher value.") .initCause(null)) } ``` Before this pr: ``` [info] org.apache.spark.util.SparkFatalException: [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:59) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) ``` After this pr: ``` [info] org.apache.spark.util.SparkFatalException: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting driver memory by setting spark.driver.memory to a higher value. [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:59) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:64) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) [info] Cause: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting driver memory by setting spark.driver.memory to a higher value. [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:58) ``` Closes #36539 from AngersZhuuuu/SPARK-39178. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit d7317b03e975f8dc1a8c276dd0a931e00c478717) Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 13:47:21 UTC
1a49de6 [SPARK-39177][SQL] Provide query context on map key not exists error when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides query context for "map key not exists" runtime error when WSCG is off. ### Why are the changes needed? Enhance the runtime error query context for "map key not exists" runtime error. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #36538 from gengliangwang/fixMapKeyContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 1afddf407436c3b315ec601fab5a4a1b2028e672) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 13:45:20 UTC
1372f31 [SPARK-39164][SQL][3.3] Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions ### What changes were proposed in this pull request? In the PR, I propose to catch `java.lang.IllegalStateException` and `java.lang.AssertionError` (raised by asserts), and wrap them by Spark's exception w/ the `INTERNAL_ERROR` error class. The modification affects only actions so far. This PR affects the case of missing bucket file. After the changes, Spark throws `SparkException` w/ `INTERNAL_ERROR` instead of `IllegalStateException`. Since this is not Spark's illegal state, the exception should be replaced by another runtime exception. Created the ticket SPARK-39163 to fix this. This is a backport of https://github.com/apache/spark/pull/36500. ### Why are the changes needed? To improve user experience with Spark SQL and unify representation of internal errors by using error classes like for other errors. Usually, users shouldn't observe asserts and illegal states, but even if such situation happens, they should see errors in the same way as other errors (w/ error class `INTERNAL_ERROR`). ### Does this PR introduce _any_ user-facing change? Yes. At least, in one particular case, see the modified test suites and SPARK-39163. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *.BucketedReadWithoutHiveSupportSuite" $ build/sbt "test:testOnly *.AdaptiveQueryExecSuite" $ build/sbt "test:testOnly *.WholeStageCodegenSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit f5c3f0c228fef7808d1f927e134595ddd4d31723) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36533 from MaxGekk/class-internal-error-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 13:43:53 UTC
c2bd7ba [SPARK-39165][SQL][3.3] Replace `sys.error` by `IllegalStateException` ### What changes were proposed in this pull request? Replace all invokes of `sys.error()` by throwing of `IllegalStateException` in the `sql` namespace. This is a backport of https://github.com/apache/spark/pull/36524. ### Why are the changes needed? In the context of wrapping all internal errors like asserts/illegal state exceptions (see https://github.com/apache/spark/pull/36500), it is impossible to distinguish `RuntimeException` of `sys.error()` from Spark's exceptions like `SparkRuntimeException`. The last one can be propagated to the user space but `sys.error` exceptions shouldn't be visible to users in regular cases. ### Does this PR introduce _any_ user-facing change? No, shouldn't. sys.error shouldn't propagate exception to user space in regular cases. ### How was this patch tested? By running the existing test suites. Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 95c7efd7571464d8adfb76fb22e47a5816cf73fb) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36532 from MaxGekk/sys_error-internal-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 09:47:53 UTC
27c03e5 [SPARK-39175][SQL] Provide runtime error query context for Cast when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides runtime error query context for the Cast expression when WSCG is off. ### Why are the changes needed? Enhance the runtime error query context of Cast expression. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #36535 from gengliangwang/fixCastContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit cdd33e83c3919c4475e2e1ef387acb604bea81b9) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 09:46:45 UTC
63f20c5 [SPARK-39166][SQL] Provide runtime error query context for binary arithmetic when WSCG is off ### What changes were proposed in this pull request? Currently, for most of the cases, the project https://issues.apache.org/jira/browse/SPARK-38615 is able to show where the runtime errors happen within the original query. However, after trying on production, I found that the following queries won't show where the divide by 0 error happens ``` create table aggTest(i int, j int, k int, d date) using parquet insert into aggTest values(1, 2, 0, date'2022-01-01') select sum(j)/sum(k),percentile(i, 0.9) from aggTest group by d ``` With `percentile` function in the query, the plan can't execute with whole stage codegen. Thus the child plan of `Project` is serialized to executors for execution, from ProjectExec: ``` protected override def doExecute(): RDD[InternalRow] = { child.execute().mapPartitionsWithIndexInternal { (index, iter) => val project = UnsafeProjection.create(projectList, child.output) project.initialize(index) iter.map(project) } } ``` Note that the `TreeNode.origin` is not serialized to executors since `TreeNode` doesn't extend the trait `Serializable`, which results in an empty query context on errors. For more details, please read https://issues.apache.org/jira/browse/SPARK-39140 A dummy fix is to make `TreeNode` extend the trait `Serializable`. However, it can be performance regression if the query text is long (every `TreeNode` carries it for serialization). A better fix is to introduce a new trait `SupportQueryContext` and materialize the truncated query context for special expressions. This PR targets on binary arithmetic expressions only. I will create follow-ups for the remaining expressions which support runtime error query context. ### Why are the changes needed? Improve the error context framework and make sure it works when whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #36525 from gengliangwang/serializeContext. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit e336567c8a9704b500efecd276abaf5bd3988679) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 05:50:40 UTC
3cc47a1 Revert "[SPARK-36837][BUILD] Upgrade Kafka to 3.1.0" ### What changes were proposed in this pull request? This PR aims to revert commit 973ea0f06e72ab64574cbf00e095922a3415f864 from `branch-3.3` to exclude it from Apache Spark 3.3 scope. ### Why are the changes needed? SPARK-36837 tried to use Apache Kafka 3.1.0 at Apache Spark 3.3.0 and initially wanted to upgrade to Apache Kafka 3.3.1 before the official release. However, we can use the stable Apache Kafka 2.8.1 at Spark 3.3.0 and wait for more proven versions, Apache Kafka 3.2.x or 3.3.x. Apache Kafka 3.2.0 vote is already passed and will arrive. - https://lists.apache.org/thread/9k5sysvchg98lchv2rvvvq6xhpgk99cc Apache Kafka 3.3.0 release discussion is started too. - https://lists.apache.org/thread/cmol5bcf011s1xl91rt4ylb1dgz2vb1r ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #36517 from dongjoon-hyun/SPARK-36837-REVERT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 May 2022, 17:55:21 UTC
0baa5c7 [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject This PR fixes a perf regression in Spark 3.3 caused by https://github.com/apache/spark/pull/33958 In `CollapseProject`, we want to treat `CreateStruct` and its friends as cheap expressions if they are only referenced by `ExtractValue`, but the check is too conservative, which causes a perf regression. This PR fixes this check. Now "extract-only" means: the attribute only appears as a child of `ExtractValue`, but the consumer expression can be in any shape. Fixes perf regression No new tests Closes #36510 from cloud-fan/bug. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 547f032d04bd2cf06c54b5a4a2f984f5166beb7d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 May 2022, 04:59:36 UTC
70becf2 [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion ### What changes were proposed in this pull request? Access to JVM through passed-in GatewayClient during type conversion. ### Why are the changes needed? In customized type converters, we may utilize the passed-in GatewayClient to access JVM, rather than rely on the `SparkContext._jvm`. That's [how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508) Py4J explicit converters access JVM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36504 from xinrong-databricks/gateway_client_jvm. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 92fcf214c107358c1a70566b644cec2d35c096c0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 03:22:19 UTC
ebe4252 [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode ### What changes were proposed in this pull request? This is a bug of the command legacy mode as it does not fully restore to the legacy behavior. The legacy v1 SHOW DATABASES command does not quote the database names. This PR fixes it. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no change by default, unless people turn on legacy mode, in which case SHOW DATABASES common won't quote the database names. ### How was this patch tested? new tests Closes #36508 from cloud-fan/regression. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3094e495095635f6c9e83f4646d3321c2a9311f4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 May 2022, 03:18:35 UTC
65dd727 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 00:13:34 UTC
e4bb341 Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" This reverts commit f75c00da3cf01e63d93cedbe480198413af41455. 12 May 2022, 00:12:13 UTC
f75c00d [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index Remove outdated statements on distributed-sequence default index. Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. No. Doc change only. Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cec1e7b4e68deac321f409d424a3acdcd4cb91be) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 00:07:49 UTC
8608baa [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/35204 . https://github.com/apache/spark/pull/35204 introduced a potential regression: it removes the "location" table property from `V1Table` if the table is not external. The intention was to avoid putting the LOCATION clause for managed tables in `ShowCreateTableExec`. However, if we use the v2 DESCRIBE TABLE command by default in the future, this will bring a behavior change and v2 DESCRIBE TABLE command won't print the table location for managed tables. This PR fixes this regression by using a different idea to fix the SHOW CREATE TABLE issue: 1. introduce a new reserved table property `is_managed_location`, to indicate that the location is managed by the catalog, not user given. 2. `ShowCreateTableExec` only generates the LOCATION clause if the "location" property is present and is not managed. ### Why are the changes needed? avoid a potential regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. We can add a test when we use v2 DESCRIBE TABLE command by default. Closes #36498 from cloud-fan/regression. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa2bda5c4eabb23d5f5b3e14ccd055a2453f579f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 May 2022, 06:50:11 UTC
7d57577 [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost ### What changes were proposed in this pull request? Add a new leaf like node `LeafNodeWithoutStats` and apply to the list: - ResolvedDBObjectName - ResolvedNamespace - ResolvedTable - ResolvedView - ResolvedNonPersistentFunc - ResolvedPersistentFunc ### Why are the changes needed? We enable v2 command at 3.3.0 branch by default `spark.sql.legacy.useV1Command`. However this is a behavior change between v1 and c2 command. - v1 command: We resolve logical plan to command at analyzer phase by `ResolveSessionCatalog` - v2 commnd: We resolve logical plan to v2 command at physical phase by `DataSourceV2Strategy` Foe cost explain mode, we will call `LogicalPlanStats.stats` using optimized plan so there is a gap between v1 and v2 command. Unfortunately, the logical plan of v2 command contains the `LeafNode` which does not override the `computeStats`. As a result, there is a error running such sql: ```sql set spark.sql.ui.explainMode=cost; show tables; ``` ``` java.lang.UnsupportedOperationException: at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats$(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.analysis.ResolvedNamespace.computeStats(v2ResolutionPlans.scala:155) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30) ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #36488 from ulysses-you/SPARK-39112. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 06fd340daefd67a3e96393539401c9bf4b3cbde9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 May 2022, 06:32:21 UTC
1738f48 [MINOR][DOCS][PYTHON] Fixes pandas import statement in code example ### What changes were proposed in this pull request? In 'Applying a Function' section, example code import statement `as pd` was added ### Why are the changes needed? In 'Applying a Function' section, example code import statement needs to have a `as pd` because in function definitions we're using `pd`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Documentation change only. Continues to be markdown compliant. Closes #36502 from snifhex/patch-1. Authored-by: Sachin Tripathi <sachintripathis84@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 60edc5758e82e76a37ce5a5f98e870fac587b656) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 May 2022, 00:13:30 UTC
a4b420c [SPARK-39106][SQL] Correct conditional expression constant folding - add try-catch when we fold children inside `ConditionalExpression` if it's not foldable - mark `CaseWhen` and `If` as foldable if it's children are foldable For a conditional expression, we should add a try-catch to partially fold the constant inside it's children because some bracnhes may no be evaluated at runtime. For example if c1 or c2 is not null, the last branch should be never hit: ```sql SELECT COALESCE(c1, c2, 1/0); ``` Besides, for CaseWhen and If, we should mark it as foldable if it's children are foldable. It is safe since the both non-codegen and codegen code path have already respected the evaluation order. yes, bug fix add more test in sql file Closes #36468 from ulysses-you/SPARK-39106. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 08a4ade8ba881589da0741b3ffacd3304dc1e9b5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2022, 12:43:24 UTC
c6584af [SPARK-39135][SQL] DS V2 aggregate partial push-down should supports group by without aggregate functions ### What changes were proposed in this pull request? Currently, the SQL show below not supported by DS V2 aggregate partial push-down. `select key from tab group by key` ### Why are the changes needed? Make DS V2 aggregate partial push-down supports group by without aggregate functions. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes #36492 from beliefer/SPARK-39135. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit decd393e23406d82b47aa75c4d24db04c7d1efd6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2022, 09:37:39 UTC
7838a14 [MINOR][INFRA][3.3] Add ANTLR generated files to .gitignore ### What changes were proposed in this pull request? Add git ignore entries for files created by ANTLR. This is a backport of #35838. ### Why are the changes needed? To avoid developers from accidentally adding those files when working on parser/lexer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By making sure those files are ignored by git status when they exist. Closes #36489 from yutoacts/minor_gitignore_3.3. Authored-by: Yuto Akutsu <rhythmy.dev@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> 10 May 2022, 02:50:44 UTC
c759151 [SPARK-39107][SQL] Account for empty string input in regex replace ### What changes were proposed in this pull request? When trying to perform a regex replace, account for the possibility of having empty strings as input. ### Why are the changes needed? https://github.com/apache/spark/pull/29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string) From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions. 3.0.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | <empty>| +---+--------+ ``` 3.1.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | | +---+--------+ ``` The 3.0.2 outcome is the expected and correct one ### Does this PR introduce _any_ user-facing change? Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above. ### How was this patch tested? Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement. Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace. Authored-by: Lorenzo Martini <lmartini@palantir.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 731aa2cdf8a78835621fbf3de2d3492b27711d1a) Signed-off-by: Sean Owen <srowen@gmail.com> 10 May 2022, 00:44:26 UTC
cf13262 [SPARK-38939][SQL][FOLLOWUP] Replace named parameter with comment in ReplaceColumns ### What changes were proposed in this pull request? This PR aims to replace named parameter with comment in `ReplaceColumns`. ### Why are the changes needed? #36252 changed signature of deleteColumn#**TableChange.java**, but this PR breaks sbt compilation in k8s integration test. ```shell > build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=r -Dspark.kubernetes.test.imageRepo=kubespark "kubernetes-integration-tests/test" [error] /Users/IdeaProjects/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala:147:45: not found: value ifExists [error] TableChange.deleteColumn(Array(name), ifExists = false) [error] ^ [error] /Users/IdeaProjects/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala:159:19: value ++ is not a member of Array[Nothing] [error] deleteChanges ++ addChanges [error] ^ [error] two errors found [error] (catalyst / Compile / compileIncremental) Compilation failed ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the GA and k8s integration test. Closes #36487 from dcoliversun/SPARK-38939. Authored-by: Qian.Sun <qian.sun2020@gmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit 16b5124d75dc974c37f2fd87c78d231f8a3bf772) Signed-off-by: huaxingao <huaxin_gao@apple.com> 09 May 2022, 17:00:44 UTC
a52a245 [SPARK-38786][SQL][TEST] Bug in StatisticsSuite 'change stats after add/drop partition command' ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979 It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug. ### Why are the changes needed? Due to this test bug, the drop command was dropping a wrong (`partDir1`) underlying file in the test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added extra underlying file location check. Closes #36075 from kazuyukitanimura/SPARK-38786. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit a6b04f007c07fe00637aa8be33a56f247a494110) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 May 2022, 00:21:59 UTC
e9ee2c8 [SPARK-37618][CORE][FOLLOWUP] Support cleaning up shuffle blocks from external shuffle service ### What changes were proposed in this pull request? Fix test failure in build. Depending on the umask of the process running tests (which is typically inherited from the user's default umask), the group writable bit for the files/directories could be set or unset. The test was assuming that by default the umask will be restrictive (and so files/directories wont be group writable). Since this is not a valid assumption, we use jnr to change the umask of the process to be more restrictive - so that the test can validate the behavior change - and reset it back once the test is done. ### Why are the changes needed? Fix test failure in build ### Does this PR introduce _any_ user-facing change? No Adds jnr as a test scoped dependency, which does not bring in any other new dependency (asm is already a dep in spark). ``` [INFO] +- com.github.jnr:jnr-posix:jar:3.0.9:test [INFO] | +- com.github.jnr:jnr-ffi:jar:2.0.1:test [INFO] | | +- com.github.jnr:jffi:jar:1.2.7:test [INFO] | | +- com.github.jnr:jffi:jar:native:1.2.7:test [INFO] | | +- org.ow2.asm:asm:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-commons:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-analysis:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-tree:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-util:jar:5.0.3:test [INFO] | | \- com.github.jnr:jnr-x86asm:jar:1.0.2:test [INFO] | \- com.github.jnr:jnr-constants:jar:0.8.6:test ``` ### How was this patch tested? Modification to existing test. Tested on Linux, skips test when native posix env is not found. Closes #36473 from mridulm/fix-SPARK-37618-test. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 317407171cb36439c371153cfd45c1482bf5e425) Signed-off-by: Sean Owen <srowen@gmail.com> 08 May 2022, 13:11:31 UTC
feba335 [SPARK-39083][CORE] Fix race condition between update and clean app data ### What changes were proposed in this pull request? make `cleanAppData` atomic to prevent race condition between update and clean app data. When the race condition happens, it could lead to a scenario when `cleanAppData` delete the entry of ApplicationInfoWrapper for an application right after it has been updated by `mergeApplicationListing`. So there will be cases when the HS Web UI displays `Application not found` for applications whose logs does exist. #### Error message ``` 22/04/29 17:16:21 DEBUG FsHistoryProvider: New/updated attempts found: 1 ArrayBuffer(viewfs://iu/log/spark3/application_1651119726430_138107_1) 22/04/29 17:16:21 INFO FsHistoryProvider: Parsing viewfs://iu/log/spark3/application_1651119726430_138107_1 for listing data... 22/04/29 17:16:21 INFO FsHistoryProvider: Looking for end event; skipping 10805037 bytes from viewfs://iu/log/spark3/application_1651119726430_138107_1... 22/04/29 17:16:21 INFO FsHistoryProvider: Finished parsing viewfs://iu/log/spark3/application_1651119726430_138107_1 22/04/29 17:16:21 ERROR Utils: Uncaught exception in thread log-replay-executor-7 java.util.NoSuchElementException at org.apache.spark.util.kvstore.InMemoryStore.read(InMemoryStore.java:85) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$3(FsHistoryProvider.scala:927) at scala.Option.foreach(Option.scala:407) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$1(FsHistoryProvider.scala:926) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:2032) at org.apache.spark.deploy.history.FsHistoryProvider.checkAndCleanLog(FsHistoryProvider.scala:916) at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:712) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$15(FsHistoryProvider.scala:576) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` #### Background Currently, the HS runs the `checkForLogs` to build the application list based on the current contents of the log directory for every 10 seconds by default. - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L296-L299 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L472 In each turn of execution, this method scans the specified logDir and parse the log files to update its KVStore: - detect new updated/added files to process : https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L574-L578 - detect stale data to remove: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600 These 2 operations are executed in different threads as `submitLogProcessTask` uses `replayExecutor` to submit tasks. https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1389-L1401 ### When does the bug happen? `Application not found` error happens in the following scenario: In the first run of `checkForLogs`, it detected a newly-added log `viewfs://iu/log/spark3/AAA_1.inprogress` (log of an in-progress application named AAA). So it will add 2 entries to the KVStore - one entry of key-value as the key is the logPath (`viewfs://iu/log/spark3/AAA_1.inprogress`) and the value is an instance of LogInfo represented the log - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L495-L505 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L545-L552 - one entry of key-value as the key is the applicationId (`AAA`) and the value is an instance of ApplicationInfoWrapper holding the information of the application. - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L825 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 In the next run of `checkForLogs`, now the AAA application has finished, the log `viewfs://iu/log/spark3/AAA_1.inprogress` has been deleted and a new log `viewfs://iu/log/spark3/AAA_1` is created. So `checkForLogs` will do the following 2 things in 2 different threads: - Thread 1: parsing the new log `viewfs://iu/log/spark3/AAA_1` and update data in its KVStore - add a new entry of key: `viewfs://iu/log/spark3/AAA_1` and value: an instance of LogInfo represented the log - updated the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper (for example: the isCompleted flag now change from false to true) - Thread 2: data related to `viewfs://iu/log/spark3/AAA_1.inprogress` is now considered as stale and it must be deleted. - clean App data for `viewfs://iu/log/spark3/AAA_1.inprogress` https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600 - Inside `cleanAppData`, first it loads the latest information of `ApplicationInfoWrapper` from the KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L632 For most of the time, when this line is executed, Thread 1 already finished `updating the entry with key=applicationId (AAA) with new value of an instance of ApplicationInfoWrapper` so this condition https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L637 will be evaluated as false, so `isStale` will be false. However, in some rare cases, when Thread1 does not finish the update yet, the old data of ApplicationInfoWrapper will be load, so `isStale` will be true and it leads to deleting the entry of ApplicationInfoWrapper in KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L656-L662 and the worst thing is it delete the entry right after when Thread 1 has finished updating the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper. So the entry for the ApplicationInfoWrapper of applicationId= `AAA` is removed forever then when users access the Web UI for this application, and `Application not found` is shown up while the log for the app does exist. So here we make the `cleanAppData` method atomic just like the `addListing` method https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 so that - If Thread 1 gets the lock on the `listing` before Thread 2, it will update the entry for the application, so in Thread2 `isStale` will be false, the entry for the application will not be removed from KVStore - If Thread 2 gets the lock on the `listing` before Thread 1, then `isStale` will be true, the entry for the application will be removed from KVStore but after that it will be added again by Thread 1. In both case, the entry for the application will not be deleted unexpectedly from KVStore. ### Why are the changes needed? Fix the bug causing HS Web UI to display `Application not found` for applications whose logs does exist. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ## How was this patch tested? Manual test. Deployed in our Spark HS and the `java.util.NoSuchElementException` exception does not happen anymore. `Application not found` error does not happen anymore. Closes #36424 from tanvn/SPARK-39083. Authored-by: tan.vu <tan.vu@linecorp.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 29643265a9f5e8142d20add5350c614a55161451) Signed-off-by: Sean Owen <srowen@gmail.com> 08 May 2022, 13:09:32 UTC
3241f20 [SPARK-39093][SQL][FOLLOWUP] Fix Period test ### What changes were proposed in this pull request? Change the Period test to use months rather than days. ### Why are the changes needed? While the Period test as-is confirms that the compilation error is fixed, it doesn't confirm that the newly generated code is correct. Spark ignores any unit less than months in Periods. As a result, we are always testing `0/(num + 3)`, so the test doesn't verify that the code generated for the right-hand operand is correct. ### Does this PR introduce _any_ user-facing change? No, only changes a test. ### How was this patch tested? Unit test. Closes #36481 from bersprockets/SPARK-39093_followup. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 670710e91dfb2ed27c24b139c50a6bcf03132024) Signed-off-by: Gengliang Wang <gengliang@apache.org> 08 May 2022, 10:14:58 UTC
6378365 [SPARK-39121][K8S][DOCS] Fix format error on running-on-kubernetes doc ### What changes were proposed in this pull request? Fix format error on running-on-kubernetes doc ### Why are the changes needed? Fix format syntax error ### Does this PR introduce _any_ user-facing change? No, unreleased doc only ### How was this patch tested? - `SKIP_API=1 bundle exec jekyll serve --watch` - CI passed Closes #36476 from Yikun/SPARK-39121. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 2349f74866ae1b365b5e4e0ec8a58c4f7f06885c) Signed-off-by: Max Gekk <max.gekk@gmail.com> 07 May 2022, 07:20:02 UTC
74ffaec [SPARK-39012][SQL] SparkSQL cast partition value does not support all data types ### What changes were proposed in this pull request? Add binary type support for schema inference in SparkSQL. ### Why are the changes needed? Spark does not support binary and boolean types when casting partition value to a target type. This PR adds the support for binary and boolean. ### Does this PR introduce _any_ user-facing change? No. This is more like fixing a bug. ### How was this patch tested? UT Closes #36344 from amaliujia/inferbinary. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 08c07a17717691f931d4d3206dd0385073f5bd08) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 May 2022, 09:39:13 UTC
de3673e [SPARK-39105][SQL] Add ConditionalExpression trait ### What changes were proposed in this pull request? Add `ConditionalExpression` trait. ### Why are the changes needed? For developers, if a custom conditional like expression contains common sub expression then the evaluation order may be changed since Spark will pull out and eval the common sub expressions first during execution. Add ConditionalExpression trait is friendly for developers. ### Does this PR introduce _any_ user-facing change? no, add a new trait ### How was this patch tested? Pass existed test Closes #36455 from ulysses-you/SPARK-39105. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa86a078bb7d57d7dbd48095fb06059a9bdd6c2e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 May 2022, 07:42:10 UTC
6a61f95 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases ### What changes were proposed in this pull request? Add missed dependencies to `dev/create-release/spark-rm/Dockerfile`. ### Why are the changes needed? To be able to build Spark releases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By building the Spark 3.3 release via: ``` $ dev/create-release/do-release-docker.sh -d /home/ubuntu/max/spark-3.3-rc1 ``` Closes #36449 from MaxGekk/deps-Dockerfile. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 4b1c2fb7a27757ebf470416c8ec02bb5c1f7fa49) Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 May 2022, 17:10:17 UTC
0f2e3ec [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---+----+ | f1| f2| +---+----+ | a|null| | a| b| +---+----+ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6689b97ec76abe5bab27f02869f8f16b32530d1a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 May 2022, 07:23:43 UTC
1fa3171 [SPARK-39060][SQL][3.3] Typo in error messages of decimal overflow ### What changes were proposed in this pull request? This PR removes extra curly bracket from debug string for Decimal type in SQL. This is a backport from master branch. Commit: 165ce4eb7d6d75201beb1bff879efa99fde24f94 ### Why are the changes needed? Typo in error messages of decimal overflow. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests: ``` $ build/sbt "sql/testOnly" ``` Closes #36450 from vli-databricks/SPARK-39060-3.3. Authored-by: Vitalii Li <vitalii.li@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 May 2022, 06:08:38 UTC
94d3d6b [SPARK-38891][SQL] Skipping allocating vector for repetition & definition levels when possible ### What changes were proposed in this pull request? This PR adds two optimization on the vectorized Parquet reader for complex types: - avoid allocating vectors for repetition and definition levels whenever can be applied - avoid reading definition levels whenever can be applied ### Why are the changes needed? At the moment, Spark will allocate vectors for repetition and definition levels, and also read definition levels even if it's not necessary, for instance, when reading primitive types. This may add extra memory footprint especially when reading wide tables. Therefore, we should avoid them if possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #36202 from sunchao/SPARK-38891. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> 04 May 2022, 16:37:55 UTC
fd998c8 [SPARK-39093][SQL] Avoid codegen compilation error when dividing year-month intervals or day-time intervals by an integral ### What changes were proposed in this pull request? In `DivideYMInterval#doGenCode` and `DivideDTInterval#doGenCode`, rely on the operand variable names provided by `nullSafeCodeGen` rather than calling `genCode` on the operands twice. ### Why are the changes needed? `DivideYMInterval#doGenCode` and `DivideDTInterval#doGenCode` call `genCode` on the operands twice (once directly, and once indirectly via `nullSafeCodeGen`). However, if you call `genCode` on an operand twice, you might not get back the same variable name for both calls (e.g., when the operand is not a `BoundReference` or if whole-stage codegen is turned off). When that happens, `nullSafeCodeGen` generates initialization code for one set of variables, but the divide expression generates usage code for another set of variables, resulting in compilation errors like this: ``` spark-sql> create or replace temp view v1 as > select * FROM VALUES > (interval '10' months, interval '10' day, 2) > as v1(period, duration, num); Time taken: 2.81 seconds spark-sql> cache table v1; Time taken: 2.184 seconds spark-sql> select period/(num + 3) from v1; 22/05/03 08:56:37 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 44: Expression "project_value_2" is not an rvalue ... 22/05/03 08:56:37 WARN UnsafeProjection: Expr codegen error and falling back to interpreter mode ... 0-2 Time taken: 0.149 seconds, Fetched 1 row(s) spark-sql> select duration/(num + 3) from v1; 22/05/03 08:57:29 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 54: Expression "project_value_2" is not an rvalue ... 22/05/03 08:57:29 WARN UnsafeProjection: Expr codegen error and falling back to interpreter mode ... 2 00:00:00.000000000 Time taken: 0.089 seconds, Fetched 1 row(s) ``` The error is not fatal (unless you have `spark.sql.codegen.fallback` set to `false`), but it muddies the log and can slow the query (since the expression is interpreted). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests (unit tests run with `spark.sql.codegen.fallback` set to `false`, so the new tests fail without the fix). Closes #36442 from bersprockets/interval_div_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit ca87bead23ca32a05c6a404a91cea47178f63e70) Signed-off-by: Gengliang Wang <gengliang@apache.org> 04 May 2022, 09:22:24 UTC
d3aadb4 [SPARK-39087][SQL][3.3] Improve messages of error classes ### What changes were proposed in this pull request? In the PR, I propose to modify error messages of the following error classes: - INVALID_JSON_SCHEMA_MAP_TYPE - INCOMPARABLE_PIVOT_COLUMN - INVALID_ARRAY_INDEX_IN_ELEMENT_AT - INVALID_ARRAY_INDEX - DIVIDE_BY_ZERO This is a backport of https://github.com/apache/spark/pull/36428. ### Why are the changes needed? To improve readability of error messages. ### Does this PR introduce _any_ user-facing change? Yes. It changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite*" $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite*" $ build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "test:testOnly *SparkThrowableSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 040526391a45ad610422a48c05aa69ba5133f922) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36439 from MaxGekk/error-class-improve-msg-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 04 May 2022, 05:45:03 UTC
0515536 Preparing development version 3.3.1-SNAPSHOT 03 May 2022, 18:15:51 UTC
482b7d5 Preparing Spark release v3.3.0-rc1 03 May 2022, 18:15:45 UTC
4177626 [SPARK-35320][SQL][FOLLOWUP] Remove duplicated test ### What changes were proposed in this pull request? Follow-up for https://github.com/apache/spark/pull/33525 to remove duplicated test. ### Why are the changes needed? We don't need to do the same test twice. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch remove the duplicated test, so the existing test should pass. Closes #36436 from itholic/SPARK-35320. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a1ac5c57c7b79fb70656638d284b77dfc4261d35) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 May 2022, 10:28:13 UTC
bd6fd7e [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion ### What changes were proposed in this pull request? This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. ### Why are the changes needed? Fixes the JVM crash when checking isEmpty() on a dataset. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 May 2022, 23:30:14 UTC
1804f5c [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS ### What changes were proposed in this pull request? Currently SparkR documentation fails because of the usage `grep -oP `. Mac OS does not have this. This PR fixes it via using the existing way used in the current scripts at: https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/R/check-cran.sh#L52 ### Why are the changes needed? To make the dev easier. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested via: ```bash cd R ./create-docs.sh ``` Closes #36423 from HyukjinKwon/SPARK-37474. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6479455b8db40d584045cdb13e6c3cdfda7a2c0b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 May 2022, 08:50:09 UTC
4c09616 [SPARK-38085][SQL][FOLLOWUP] Do not fail too early for DeleteFromTable ### What changes were proposed in this pull request? `DeleteFromTable` has been in Spark for a long time and there are existing Spark extensions to compile `DeleteFromTable` to physical plans. However, the new analyzer rule `RewriteDeleteFromTable` fails very early if the v2 table does not support delete. This breaks certain Spark extensions which can still execute `DeleteFromTable` for certain v2 tables. This PR simply removes the error throwing in `RewriteDeleteFromTable`. It's a safe change because: 1. the new delete-related rules only match v2 table with `SupportsRowLevelOperations`, so won't be affected by this change 2. the planner rule will fail eventually if the v2 table doesn't support deletion. Spark eagerly executes commands so Spark users can still see this error immediately. ### Why are the changes needed? To not break existing Spark extesions. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #36402 from cloud-fan/follow. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5630f700768432396a948376f5b46b00d4186e1b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 May 2022, 02:28:51 UTC
c87180a [SPARK-39040][SQL] Respect NaNvl in EquivalentExpressions for expression elimination ### What changes were proposed in this pull request? Respect NaNvl in EquivalentExpressions for expression elimination. ### Why are the changes needed? For example the query will fail: ```sql set spark.sql.ansi.enabled=true; set spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding; SELECT nanvl(1, 1/0 + 1/0); ``` ```sql org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4) (10.221.98.68 executor driver): org.apache.spark.SparkArithmeticException: divide by zero. To return NULL instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. == SQL(line 1, position 17) == select nanvl(1 , 1/0 + 1/0) ^^^ at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:151) ``` We should respect the ordering of conditional expression that always evaluate the predicate branch first, so the query above should not fail. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #36376 from ulysses-you/SPARK-39040. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f6b43f0384f9681b963f52a53759c521f6ac11d5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 April 2022, 04:36:05 UTC
e7caea5 [SPARK-39047][SQL][3.3] Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE ### What changes were proposed in this pull request? In the PR, I propose to remove the `ILLEGAL_SUBSTRING` error class, and use `INVALID_PARAMETER_VALUE` in the case when the `strfmt` parameter of the `format_string()` function contains `%0$`. The last value is handled differently by JDKs: _"... Java 8 and Java 11 uses it as "%1$", and Java 17 throws IllegalFormatArgumentIndexException(Illegal format argument index = 0)"_. This is a backport of https://github.com/apache/spark/pull/36380. ### Why are the changes needed? To improve code maintenance and user experience with Spark SQL by reducing the number of user-facing error classes. ### Does this PR introduce _any_ user-facing change? Yes, it changes user-facing error message. Before: ```sql spark-sql> select format_string('%0$s', 'Hello'); Error in query: [ILLEGAL_SUBSTRING] The argument_index of string format cannot contain position 0$.; line 1 pos 7 ``` After: ```sql spark-sql> select format_string('%0$s', 'Hello'); Error in query: [INVALID_PARAMETER_VALUE] The value of parameter(s) 'strfmt' in `format_string` is invalid: expects %1$, %2$ and so on, but got %0$.; line 1 pos 7 ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *SparkThrowableSuite" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z text.sql" $ build/sbt "test:testOnly *QueryCompilationErrorsSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 9dcc24c36f6fcdf43bf66fe50415be575f7b2918) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36390 from MaxGekk/error-class-ILLEGAL_SUBSTRING-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 28 April 2022, 11:11:17 UTC
50f5e9c [SPARK-39055][DOC] Fix documentation 404 page ### What changes were proposed in this pull request? replace unused `docs/_layouts/404.html` with `docs/404.md` ### Why are the changes needed? make the custom 404 page work <img width="638" alt="image" src="https://user-images.githubusercontent.com/8326978/165706963-6cc96cf5-299e-4b60-809f-79dd771f3b5d.png"> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` bundle exec jekyll serve ``` and visit a non-existing page `http://localhost:4000/abcd` Closes #36392 from yaooqinn/SPARK-39055. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ec2bfa566ed3c796e91987f7a158e8b60fbd5c42) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 April 2022, 10:57:51 UTC
606a99f [SPARK-39046][SQL] Return an empty context string if TreeNode.origin is wrongly set ### What changes were proposed in this pull request? For the query context `TreeNode.origin.context`, this PR proposal to return an empty context string if * the query text/ the start index/ the stop index is missing * the start index is less than 0 * the stop index is larger than the length of query text * the start index is larger than the stop index ### Why are the changes needed? There are downstream projects that depend on Spark. There is no guarantee for the correctness of TreeNode.origin. Developers may create a plan/expression with a Origin containing wrong startIndex/stopIndex/sqlText. Thus, to avoid errors in calling `String.substring` or showing misleading debug information, I suggest returning an empty context string if TreeNode.origin is wrongly set. The query context is just for better error messages and we should handle it cautiously. ### Does this PR introduce _any_ user-facing change? No, the context framework is not released yet. ### How was this patch tested? UT Closes #36379 from gengliangwang/safeContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 7fe2759e9f81ec267e92e1c6f8a48f42042db791) Signed-off-by: Gengliang Wang <gengliang@apache.org> 28 April 2022, 01:59:34 UTC
96d66b0 [SPARK-38988][PYTHON] Suppress PerformanceWarnings of `DataFrame.info` ### What changes were proposed in this pull request? Suppress PerformanceWarnings of DataFrame.info ### Why are the changes needed? To improve usability. ### Does this PR introduce _any_ user-facing change? No. Only PerformanceWarnings of DataFrame.info are suppressed. ### How was this patch tested? Manual tests. Closes #36367 from xinrong-databricks/frame.info. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 594337fad131280f62107326062fb554f0566d43) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 April 2022, 00:25:51 UTC
c9b6b50 [SPARK-39049][PYTHON][CORE][ML] Remove unneeded `pass` ### What changes were proposed in this pull request? Remove unneeded `pass` ### Why are the changes needed? Class`s Estimator, Transformer and Evaluator are abstract classes. Which has functions. ValueError in def run() has code. By removing `pass` it will be easier to read, understand and reuse code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests passed. Closes #36383 from bjornjorgensen/remove-unneeded-pass. Lead-authored-by: Bjørn Jørgensen <bjornjorgensen@gmail.com> Co-authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0e875875059c1cbf36de49205a4ce8dbc483d9d1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 April 2022, 00:21:07 UTC
84addc5 [SPARK-39051][PYTHON] Minor refactoring of `python/pyspark/sql/pandas/conversion.py` Minor refactoring of `python/pyspark/sql/pandas/conversion.py`, which includes: - doc change - renaming To improve code readability and maintainability. No. Existing tests. Closes #36384 from xinrong-databricks/conversion.py. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c19fadabde3ef3f9c7e4fa9bf74632a4f8e1f3e2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 April 2022, 00:20:18 UTC
4a4e35a [SPARK-38997][SQL] DS V2 aggregate push-down supports group by expressions ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down only supports group by column. But the SQL show below is very useful and common. ``` SELECT CASE WHEN 'SALARY' > 8000.00 AND 'SALARY' < 10000.00 THEN 'SALARY' ELSE 0.00 END AS key, SUM('SALARY') FROM "test"."employee" GROUP BY key ``` ### Why are the changes needed? Let DS V2 aggregate push-down supports group by expressions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes #36325 from beliefer/SPARK-38997. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ee6ea3c68694e35c36ad006a7762297800d1e463) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2022, 16:44:18 UTC
b25276f [SPARK-39015][SQL][3.3] Remove the usage of toSQLValue(v) without an explicit type ### What changes were proposed in this pull request? This PR is a backport of https://github.com/apache/spark/pull/36351 This PR proposes to remove the the usage of `toSQLValue(v)` without an explicit type. `Literal(v)` is intended to be used from end-users so it cannot handle our internal types such as `UTF8String` and `ArrayBasedMapData`. Using this method can lead to unexpected error messages such as: ``` Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45) ... ``` Since It is impossible to have the corresponding data type from the internal types as one type can map to multiple external types (e.g., `Long` for `Timestamp`, `TimestampNTZ`, and `LongType`), the removal approach was taken. ### Why are the changes needed? To provide the error messages as intended. ### Does this PR introduce _any_ user-facing change? Yes. ```scala import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType, StringType) val arrayStructureSchema = new StructType().add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) spark.conf.set("spark.sql.ansi.enabled", true) mapTypeDF.selectExpr("element_at(properties, 'hair')").show ``` Before: ``` Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45) ... ``` After: ``` Caused by: org.apache.spark.SparkNoSuchElementException: [MAP_KEY_DOES_NOT_EXIST] Key 'hair' does not exist. To return NULL instead, use 'try_element_at'. If necessary set spark.sql.ansi.enabled to false to bypass this error. == SQL(line 1, position 0) == element_at(properties, 'hair') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` ### How was this patch tested? Unittest was added. Otherwise, existing test cases should cover. Closes #36375 from HyukjinKwon/SPARK-39015-3.3. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 27 April 2022, 13:17:51 UTC
b3ecff3 [SPARK-34079][SQL][FOLLOW-UP] Revert some changes in InjectRuntimeFilterSuite To remove unnecessary changes from `InjectRuntimeFilterSuite` after https://github.com/apache/spark/pull/32298. These are not needed after https://github.com/apache/spark/pull/34929 as the final optimized plan does'n contain any `WithCTE` nodes. No need for those changes. No. Added new test. Closes #36361 from peter-toth/SPARK-34079-multi-column-scalar-subquery-follow-up-2. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d05e01d54024e3844f1e48e03bad3fd814b8f6b9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2022, 08:18:06 UTC
d59f118 [SPARK-39032][PYTHON][DOCS] Examples' tag for pyspark.sql.functions.when() ### What changes were proposed in this pull request? Fix missing keyword for `pyspark.sql.functions.when()` documentation. ### Why are the changes needed? [Documentation](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html) is not formatted correctly ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All tests passed. Closes #36369 from vadim/SPARK-39032. Authored-by: vadim <86705+vadim@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5b53bdfa83061c160652e07b999f996fc8bd2ece) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 April 2022, 07:56:30 UTC
793ba60 [SPARK-38918][SQL] Nested column pruning should filter out attributes that do not belong to the current relation ### What changes were proposed in this pull request? This PR updates `ProjectionOverSchema` to use the outputs of the data source relation to filter the attributes in the nested schema pruning. This is needed because the attributes in the schema do not necessarily belong to the current data source relation. For example, if a filter contains a correlated subquery, then the subquery's children can contain attributes from both the inner query and the outer query. Since the `RewriteSubquery` batch happens after early scan pushdown rules, nested schema pruning can wrongly use the inner query's attributes to prune the outer query data schema, thus causing wrong results and unexpected exceptions. ### Why are the changes needed? To fix a bug in `SchemaPruning`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #36216 from allisonwang-db/spark-38918-nested-column-pruning. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 150434b5d7909dcf8248ffa5ec3d937ea3da09fd) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 27 April 2022, 05:39:53 UTC
3df3421 [SPARK-39027][SQL][3.3] Output SQL statements in error messages in upper case and w/o double quotes ### What changes were proposed in this pull request? In the PR, I propose to output any SQL statement in error messages in upper case, and apply new implementation of `QueryErrorsBase.toSQLStmt()` to all exceptions in `Query.*Errors` w/ error classes. Also this PR modifies all affected tests, see the list in the section "How was this patch tested?". This is a backport of https://github.com/apache/spark/pull/36359. ### Why are the changes needed? To improve user experience with Spark SQL by highlighting SQL statements in error massage and make them more visible to users. Also this PR makes SQL statements in error messages consistent to the docs where such elements are showed in upper case w/ any quotes. ### Does this PR introduce _any_ user-facing change? Yes. The changes might influence on error messages: Before: ```sql The operation "DESC PARTITION" is not allowed ``` After: ```sql The operation DESC PARTITION is not allowed ``` ### How was this patch tested? By running affected test suites: ``` $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite" $ build/sbt "sql/testOnly *QueryParsingErrorsSuite" $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite" $ build/sbt "test:testOnly *QueryCompilationErrorsDSv2Suite" $ build/sbt "test:testOnly *ExtractPythonUDFFromJoinConditionSuite" $ build/sbt "testOnly *PlanParserSuite" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z transform.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z join-lateral.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z describe.sql" $ build/sbt "testOnly *DDLParserSuite" ``` Closes #36363 from MaxGekk/error-class-toSQLStmt-no-quotes-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 April 2022, 04:15:04 UTC
80efaa2 [SPARK-34863][SQL][FOLLOW-UP] Handle IsAllNull in OffHeapColumnVector ### What changes were proposed in this pull request? This PR fixes an issue of reading null columns with the vectorised Parquet reader when the entire column is null or does not exist. This is especially noticeable when performing a merge or schema evolution in Parquet. The issue is only exposed with the `OffHeapColumnVector` which does not handle `isAllNull` flag - `OnHeapColumnVector` already handles `isAllNull` so everything works fine there. ### Why are the changes needed? The change is needed to correctly read null columns using the vectorised reader in the off-heap mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I updated the existing unit tests to ensure we cover off-heap mode. I confirmed that the tests pass with the fix and fail without. Closes #36366 from sadikovi/fix-off-heap-cv. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Chao Sun <sunchao@apple.com> 27 April 2022, 03:45:23 UTC
052ae96 [SPARK-39030][PYTHON] Rename sum to avoid shading the builtin Python function ### What changes were proposed in this pull request? Rename sum to something else. ### Why are the changes needed? Sum is a build in function in python. [SUM() at python docs](https://docs.python.org/3/library/functions.html#sum) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use existing tests. Closes #36364 from bjornjorgensen/rename-sum. Authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3821d807a599a2d243465b4e443f1eb68251d432) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 April 2022, 01:10:20 UTC
41d44d1 Revert "[SPARK-38354][SQL] Add hash probes metric for shuffled hash join" This reverts commit 158436655f30141bbd5afa8d95aec66282a5c4b4, as the original PR caused performance regression reported in https://github.com/apache/spark/pull/35686#issuecomment-1107807027 . Closes #36338 from c21/revert-metrics. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6b5a1f9df28262fa90d28dc15af67e8a37a9efcf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 April 2022, 23:12:45 UTC
e2f82fd [SPARK-39019][TESTS] Use `withTempPath` to clean up temporary data directory after `SPARK-37463: read/write Timestamp ntz to Orc with different time zone` ### What changes were proposed in this pull request? `SPARK-37463: read/write Timestamp ntz to Orc with different time zone` use the absolute path to save the test data, and does not clean up the test data after the test. This pr change to use `withTempPath` to ensure the data directory is cleaned up after testing. ### Why are the changes needed? Clean up the temporary data directory after test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA - Manual test: Run ``` mvn clean install -pl sql/core -am -DskipTests mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcV1QuerySuite git status ``` **Before** ``` sql/core/ts_ntz_orc/ ls sql/core/ts_ntz_orc _SUCCESS part-00000-9523e257-5024-4980-8bb3-12070222b0bd-c000.snappy.orc ``` **After** No residual `sql/core/ts_ntz_orc/` Closes #36352 from LuciferYang/SPARK-39019. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 97449d23a3b2232e14e63c6645919c5d93e4491c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 April 2022, 22:52:53 UTC
dd192b7 [SPARK-34079][SQL][FOLLOW-UP] Remove debug logging ### What changes were proposed in this pull request? To remove debug logging accidentally left in code after https://github.com/apache/spark/pull/32298. ### Why are the changes needed? No need for that logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36354 from peter-toth/SPARK-34079-multi-column-scalar-subquery-follow-up. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f24c9b0d135ce7ef4f219ab661a6b665663039f0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 April 2022, 10:19:58 UTC
ff9b163 [SPARK-39001][SQL][DOCS][FOLLOW-UP] Revert the doc changes for dropFieldIfAllNull, prefersDecimal and primitivesAsString (schema_of_json) This PR is a followup of https://github.com/apache/spark/pull/36339. Actually `schema_of_json` expression supports `dropFieldIfAllNull`, `prefersDecimal` an `primitivesAsString`: ```scala scala> spark.range(1).selectExpr("""schema_of_json("{'a': null}", map('dropFieldIfAllNull', 'true'))""").show() +---------------------------+ |schema_of_json({'a': null})| +---------------------------+ | STRUCT<>| +---------------------------+ scala> spark.range(1).selectExpr("""schema_of_json("{'b': 1.0}", map('prefersDecimal', 'true'))""").show() +--------------------------+ |schema_of_json({'b': 1.0})| +--------------------------+ | STRUCT<b: DECIMAL...| +--------------------------+ scala> spark.range(1).selectExpr("""schema_of_json("{'b': 1.0}", map('primitivesAsString', 'true'))""").show() +--------------------------+ |schema_of_json({'b': 1.0})| +--------------------------+ | STRUCT<b: STRING>| +--------------------------+ ``` For correct documentation. To end users, no because it's a partial revert of the docs unreleased yet. Partial logical revert so I did not add a test also since this is just a doc change. Closes #36346 from HyukjinKwon/SPARK-39001-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5056c6cc333982d39546f2acf9a889d102cc4ab3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 April 2022, 01:57:58 UTC
cb40bee [SPARK-39008][BUILD] Change ASF as a single author in Spark distribution ### What changes were proposed in this pull request? This PR proposes to have a single author as ASF for Apache Spark project since the project is being maintained under this organization. For `pom.xml`, I referred to Hadoop https://github.com/apache/hadoop/blob/56cfd6061770872ce35cf815544b0c0f49613209/pom.xml#L76-L79 ### Why are the changes needed? We mention several original developers in authors in `pom.xml` or `R/pkg/DESRIPTION` while the project is being maintained under ASF organization. In addition, seems like these people (at least Matei) here get arbitrary spam emails. ### Does this PR introduce _any_ user-facing change? Yes, the authors in the distributions will remain as ASF alone in Apache Spark distributions. ### How was this patch tested? No, existing build and CRAN check should validate it. Closes #36337 from HyukjinKwon/SPARK-39008. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ce957e26e72d022b8fd9664bd19c431536302c36) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 April 2022, 23:52:06 UTC
49830c6 [SPARK-38939][SQL] Support DROP COLUMN [IF EXISTS] syntax ### What changes were proposed in this pull request? This PR introduces the following: - Parser changes to have an `IF EXISTS` clause for `DROP COLUMN`. - Logic to silence the errors within parser and analyzer when encountering missing columns while using `IF EXISTS` - Ensure only resolving and dropping existing columns inside table schema ### Why are the changes needed? Currently `ALTER TABLE ... DROP COLUMN(s) ...` syntax will always throw error if the column doesn't exist. This PR would like to provide an (IF EXISTS) syntax to provide better user experience for downstream handlers (such as Delta with incoming column dropping support) that support it, and make consistent with some other DMLs such as `DROP TABLE IF EXISTS`. ### Does this PR introduce _any_ user-facing change? User may now specify `ALTER TABLE xxx DROP COLUMN(S) IF EXISTS a, a.b, c.d`. ### How was this patch tested? Modified existing unit tests and new unit tests. cloud-fan gengliangwang MaxGekk Closes #36252 from jackierwzhang/SPARK-38939. Authored-by: jackierwzhang <ruowang.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 April 2022, 18:49:28 UTC
c6698cc [SPARK-39001][SQL][DOCS] Document which options are unsupported in CSV and JSON functions This PR proposes to document which options do not work and are explicitly unsupported in CSV and JSON functions. To avoid users to misunderstand the options. Yes, it documents which options don't work in CSV/JSON expressions. I manually built the docs and checked the HTML output. Closes #36339 from HyukjinKwon/SPARK-39001. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 10a643c8af368cce131ef217f6ef610bf84f8b9c) Signed-off-by: Max Gekk <max.gekk@gmail.com> 25 April 2022, 17:28:32 UTC
f056435 [SPARK-39007][SQL][3.3] Use double quotes for SQL configs in error messages ### What changes were proposed in this pull request? Wrap SQL configs in error messages by double quotes. Added the `toSQLConf()` method to `QueryErrorsBase` to invoke it from `Query.*Errors`. This is a backport of https://github.com/apache/spark/pull/36335. ### Why are the changes needed? 1. To highlight types and make them more visible for users. 2. To be able to easily parse types from error text. 3. To be consistent to other outputs of identifiers, sql statement and etc. where Spark uses quotes or ticks. ### Does this PR introduce _any_ user-facing change? Yes, it changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "testOnly *QueryCompilationErrorsSuite" $ build/sbt "testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "testOnly *QueryExecutionErrorsSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit f01bff971e36870e101b2f76195e0d380db64e0c) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36340 from MaxGekk/output-conf-error-class-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 25 April 2022, 17:16:08 UTC
ffa8152 [SPARK-38868][SQL] Don't propagate exceptions from filter predicate when optimizing outer joins ### What changes were proposed in this pull request? Change `EliminateOuterJoin#canFilterOutNull` to return `false` when a `where` condition throws an exception. ### Why are the changes needed? Consider this query: ``` select * from (select id, id as b from range(0, 10)) l left outer join (select id, id + 1 as c from range(0, 10)) r on l.id = r.id where assert_true(c > 0) is null; ``` The query should succeed, but instead fails with ``` java.lang.RuntimeException: '(c#1L > cast(0 as bigint))' is not true! ``` This happens even though there is no row where `c > 0` is false. The `EliminateOuterJoin` rule checks if it can convert the outer join to a inner join based on the expression in the where clause, which in this case is ``` assert_true(c > 0) is null ``` `EliminateOuterJoin#canFilterOutNull` evaluates that expression with `c` set to `null` to see if the result is `null` or `false`. That rule doesn't expect the result to be a `RuntimeException`, but in this case it always is. That is, the assertion is failing during optimization, not at run time. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #36230 from bersprockets/outer_join_eval_assert_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e2930b8dc087e5a284b451c4cac6c1a2459b456d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 April 2022, 05:49:41 UTC
5dfa24d [SPARK-38996][SQL][3.3] Use double quotes for types in error messages ### What changes were proposed in this pull request? This PR is a backport of https://github.com/apache/spark/pull/36324 In the PR, I propose to modify the method `QueryErrorsBase.toSQLType()` to use double quotes for types in error messages. ### Why are the changes needed? 1. To highlight types and make them more visible for users. 2. To be able to easily parse types from error text. 3. To be consistent to other outputs of identifiers, sql statement and etc. where Spark uses quotes or ticks. ### Does this PR introduce _any_ user-facing change? Yes, the PR changes user-facing errors. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "test:testOnly *QueryParsingErrorsSuite" $ build/sbt "test:testOnly *QueryCompilationErrorsSuite" $ build/sbt "test:testOnly *QueryExecutionErrorsSuite" $ build/sbt "testOnly *CastSuite" $ build/sbt "testOnly *AnsiCastSuiteWithAnsiModeOn" $ build/sbt "testOnly *EncoderResolutionSuite" $ build/sbt "test:testOnly *DatasetSuite" $ build/sbt "test:testOnly *InsertSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 5e494d3de70c6e46f33addd751a227e6f9d5703f) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36329 from MaxGekk/wrap-types-in-error-classes-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 April 2022, 02:53:27 UTC
9c5f38d [SPARK-38977][SQL] Fix schema pruning with correlated subqueries ### What changes were proposed in this pull request? This PR fixes schema pruning for queries with multiple correlated subqueries. Previously, Spark would throw an exception trying to determine root fields in `SchemaPruning$identifyRootFields`. That was happening because expressions in predicates that referenced attributes in subqueries were not ignored. That's why attributes from multiple subqueries could conflict with each other (e.g. incompatible types) even though they should be ignored. For instance, the following query would throw a runtime exception. ``` SELECT name FROM contacts c WHERE EXISTS (SELECT 1 FROM ids i WHERE i.value = c.id) AND EXISTS (SELECT 1 FROM first_names n WHERE c.name.first = n.value) ``` ``` [info] org.apache.spark.SparkException: Failed to merge fields 'value' and 'value'. Failed to merge incompatible data types int and string [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.failedMergingFieldsError(QueryExecutionErrors.scala:936) ``` ### Why are the changes needed? These changes are needed to avoid exceptions for some queries with multiple correlated subqueries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. Closes #36303 from aokolnychyi/spark-38977. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 0c9947dabcb71de414c97c0e60a1067e468f2642) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 22 April 2022, 21:12:01 UTC
02780b9 [SPARK-38973][SHUFFLE] Mark stage as merge finalized irrespective of its state ### What changes were proposed in this pull request? This change fixes the scenarios where a stage re-attempt doesn't complete successfully, even though all the tasks complete when push-based shuffle is enabled. With Adaptive Merge Finalization, a stage may be triggered for finalization when it is the below state: - The stage is not running (not in the running set of the DAGScheduler) - had failed or canceled or waiting, and - The stage has no pending partitions (all the tasks completed at-least once) For such a stage when the finalization completes, the stage will still not be marked as mergeFinalized. The stage of the stage will be: - `stage.shuffleDependency.mergeFinalized = false` - `stage.shuffleDependency.getFinalizeTask != Nil` - Merged statuses of the state are unregistered When the stage is resubmitted, the newer attempt of the stage will never complete even though its tasks may be completed. This is because the newer attempt of the stage will have `shuffleMergeEnabled = true`, since with the previous attempt the stage was never marked as mergedFinalized, and the finalizeTask is present (from finalization attempt for previous stage attempt). So, when all the tasks of the newer attempt complete, then these conditions will be true: - stage will be running - There will be no pending partitions since all the tasks completed - `stage.shuffleDependency.shuffleMergeEnabled = true` - `stage.shuffleDependency.shuffleMergeFinalized = false` - `stage.shuffleDependency.getFinalizeTask` is `not empty` This leads the DAGScheduler to try scheduling finalization and not trigger the completion of the Stage. However because of the last condition it never even schedules the finalization and the stage never completes. In addition, for determinate stages, which have completed merge finalization, we don't need to unregister merge results - since the stage retry, or any other stage computing the same shuffle id, can use it. ### Why are the changes needed? The change fixes the above issue where the application gets stalled as some stages don't complete successfully. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I have just modified the existing UT. A stage will be marked finalized irrespective of its state and for deterministic stage we don't want to unregister merge results. Closes #36293 from otterc/SPARK-38973. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit f4a81ae6e631af27fc5eef81097b842d4e0e2e51) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 22 April 2022, 19:39:10 UTC
ca9138e [SPARK-34960][SQL][DOCS][FOLLOWUP] Improve doc for DSv2 aggregate push down ### What changes were proposed in this pull request? This is a followup per comment in https://issues.apache.org/jira/browse/SPARK-34960, to improve the documentation for data source v2 aggregate push down of Parquet and ORC. * Unify SQL config docs between Parquet and ORC, and add the note that if statistics is missing from any file footer, exception would be thrown. * Also adding the same note for exception in Parquet and ORC methods to aggregate from statistics. Though in future Spark release, we may improve the behavior to fallback to aggregate from real data of file, in case any statistics are missing. We'd better to make a clear documentation for current behavior now. ### Why are the changes needed? Give users & developers a better idea of when aggregate push down would throw exception. Have a better documentation for current behavior. ### Does this PR introduce _any_ user-facing change? Yes, the documentation change in SQL configs. ### How was this patch tested? Existing tests as this is just documentation change. Closes #36311 from c21/agg-doc. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit 86b8757c2c4bab6a0f7a700cf2c690cdd7f31eba) Signed-off-by: huaxingao <huaxin_gao@apple.com> 22 April 2022, 17:14:37 UTC
9cc2ae7 [SPARK-38992][CORE] Avoid using bash -c in ShellBasedGroupsMappingProvider ### What changes were proposed in this pull request? This PR proposes to avoid using `bash -c` in `ShellBasedGroupsMappingProvider`. This could allow users a command injection. ### Why are the changes needed? For a security purpose. ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Manually tested. Closes #36315 from HyukjinKwon/SPARK-38992. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c83618e4e5fc092829a1f2a726f12fb832e802cc) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 April 2022, 10:01:26 UTC
0fdb675 [SPARK-38813][3.3][SQL][FOLLOWUP] Improve the analysis check for TimestampNTZ output ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/36094, a check for failing TimestampNTZ output is added. However, if there is an unresolved attribute in the plan, even if it is note related to TimestampNTZ, the error message becomes confusing ``` scala> val df = spark.range(2) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.select("i") org.apache.spark.sql.AnalysisException: Invalid call to dataType on unresolved object; 'Project ['i] +- Range (0, 2, step=1, splits=Some(16)) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$4(CheckAnalysis.scala:164) ... ``` Before changes it was ``` org.apache.spark.sql.AnalysisException: Column 'i' does not exist. Did you mean one of the following? [id]; ``` This PR is the improve the check for TimestampNTZ and restore the error message for unresolved attributes. ### Why are the changes needed? Fix a regression in analysis error message. ### Does this PR introduce _any_ user-facing change? No, it is not released yet. ### How was this patch tested? Manual test Closes #36316 from gengliangwang/bugFix. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 22 April 2022, 05:35:15 UTC
5f39653 [SPARK-38990][SQL] Avoid `NullPointerException` when evaluating date_trunc/trunc format as a bound reference ### What changes were proposed in this pull request? Change `TruncInstant.evalHelper` to pass the input row to `format.eval` when `format` is a not a literal (and therefore might be a bound reference). ### Why are the changes needed? This query fails with a `java.lang.NullPointerException`: ``` select date_trunc(col1, col2) from values ('week', timestamp'2012-01-01') as data(col1, col2); ``` This only happens if the data comes from an inline table. When the source is an inline table, `ConvertToLocalRelation` attempts to evaluate the function against the data in interpreted mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update to unit tests. Closes #36312 from bersprockets/date_trunc_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 2e4f4abf553cedec1fa8611b9494a01d24e6238a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 April 2022, 03:30:53 UTC
6306410 [SPARK-38974][SQL] Filter registered functions with a given database name in list functions ### What changes were proposed in this pull request? This PR fixes a bug in list functions to filter out registered functions that do not belong to the specified database. ### Why are the changes needed? To fix a bug for `SHOW FUNCTIONS IN [db]`. Listed functions should only include all temporary functions and persistent functions in the specified database, instead of all registered functions in the function registry. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #36291 from allisonwang-db/spark-38974-list-functions. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit cbfa0513421d5e9e9b7410d7f86b8e25df4ae548) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 April 2022, 03:24:52 UTC
4cb2ae2 [SPARK-38666][SQL] Add missing aggregate filter checks ### What changes were proposed in this pull request? Add checks in `ResolveFunctions#validateFunction` to ensure the following about each aggregate filter: - has a datatype of boolean - doesn't contain an aggregate expression - doesn't contain a window expression `ExtractGenerator` already handles the case of a generator in an aggregate filter. ### Why are the changes needed? There are three cases where a query with an aggregate filter produces non-helpful error messages. 1) Window expression in aggregate filter ``` select sum(a) filter (where nth_value(a, 2) over (order by b) > 1) from (select 1 a, '2' b); ``` The above query should produce an analysis error, but instead produces a stack overflow: ``` java.lang.StackOverflowError: null at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) ~[scala-library.jar:?] at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) ~[scala-library.jar:?] at scala.collection.immutable.VectorBuilder.$plus$plus$eq(Vector.scala:668) ~[scala-library.jar:?] at scala.collection.immutable.VectorBuilder.$plus$plus$eq(Vector.scala:645) ~[scala-library.jar:?] at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:56) ~[scala-library.jar:?] at org.apache.spark.sql.catalyst.trees.UnaryLike.children(TreeNode.scala:1172) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.trees.UnaryLike.children$(TreeNode.scala:1172) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.UnaryExpression.children$lzycompute(Expression.scala:494) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.UnaryExpression.children(Expression.scala:494) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:223) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Alias.resolved$lzycompute(namedExpressions.scala:155) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Alias.resolved(namedExpressions.scala:155) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] ``` With this PR, the query will instead produce ``` org.apache.spark.sql.AnalysisException: FILTER expression contains window function. It cannot be used in an aggregate function; line 1 pos 7 ``` 2) Non-boolean filter expression ``` select sum(a) filter (where a) from (select 1 a, '2' b); ``` This query should produce an analysis error, but instead causes a projection compilation error or whole-stage codegen error (depending on the datatype of the expression): ```` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 50, Column 6: Not a boolean expression at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) ~[janino-3.0.16.jar:?] at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4049) ~[janino-3.0.16.jar:?] at org.codehaus.janino.UnitCompiler.access$6300(UnitCompiler.java:226) ~[janino-3.0.16.jar:?] at org.codehaus.janino.UnitCompiler$14.visitIntegerLiteral(UnitCompiler.java:4016) ~[janino-3.0.16.jar:?] at org.codehaus.janino.UnitCompiler$14.visitIntegerLiteral(UnitCompiler.java:3986) ~[janino-3.0.16.jar:?] ... at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) ~[guava-14.0.1.jar:?] at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) ~[guava-14.0.1.jar:?] ... 37 more NULL Time taken: 6.132 seconds, Fetched 1 row(s) ```` After the compilation error, _the query returns a result as if `a` was a boolean `false`_. With this PR, the query will instead produce ``` org.apache.spark.sql.AnalysisException: FILTER expression is not of type boolean. It cannot be used in an aggregate function; line 1 pos 7 ``` 3) Aggregate expression in filter expression ``` select max(b) filter (where max(a) > 1) from (select 1 a, '2' b); ``` The above query should produce an analysis error, but instead causes a projection compilation error or whole-stage codegen error (depending on the datatype of the expression being aggregated): ``` org.apache.spark.SparkUnsupportedOperationException: Cannot generate code for expression: max(1) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:84) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:347) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:346) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] ``` With this PR, the query will instead produce ``` org.apache.spark.sql.AnalysisException: FILTER expression contains aggregate. It cannot be used in an aggregate function; line 1 pos 7 ``` ### Does this PR introduce _any_ user-facing change? No, except in error conditions. ### How was this patch tested? New unit tests. Closes #36072 from bersprockets/aggregate_in_aggregate_filter_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 49d2f3c2458863eefd63c8ce38064757874ab4ad) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 April 2022, 03:13:06 UTC
d63e42d [SPARK-38955][SQL] Disable lineSep option in 'from_csv' and 'schema_of_csv' ### What changes were proposed in this pull request? This PR proposes to disable `lineSep` option in `from_csv` and `schema_of_csv` expression by setting Noncharacters according to [unicode specification](https://www.unicode.org/charts/PDF/UFFF0.pdf), `\UFFFF`. This can be used for the internal purpose in a program according to the specification. The Univocity parser does not allow to omit the line separator (from my code reading) so this approach was proposed. This specific code path is not affected by our `encoding` or `charset` option because Unicovity parser parses them as unicodes as are internally. ### Why are the changes needed? Currently, this option is weirdly effective. See the example of `from_csv` as below: ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ Seq[String]("1,\n2,3,4,5").toDF.select( col("value"), from_csv( col("value"), StructType(Seq(StructField("a", LongType), StructField("b", StringType) )), Map[String,String]())).show() ``` ``` +-----------+---------------+ | value|from_csv(value)| +-----------+---------------+ |1,\n2,3,4,5| {1, null}| +-----------+---------------+ ``` `{1, null}` has to be `{1, \n2}`. The CSV expressions cannot easily make it supported because this option is plan-wise option that can change the number of returned rows; however, the expressions are designed to emit one row only whereas this option is easily effective in the scan plan with CSV data source. Therefore, we should disable this option. ### Does this PR introduce _any_ user-facing change? Yes, now the `lineSep` can be located in the output from `from_csv` and `schema_of_csv`. ### How was this patch tested? Manually tested, and unit test was added. Closes #36294 from HyukjinKwon/SPARK-38955. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f3cc2814d4bc585dad92c9eca9a593d1617d27e9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 April 2022, 02:43:54 UTC
ec771f1 [SPARK-38581][PYTHON][DOCS][3.3] List of supported pandas APIs for pandas-on-Spark docs ### What changes were proposed in this pull request? This PR proposes to add new page named "Supported pandas APIs" for pandas-on-Spark documents. This is cherry-pick from https://github.com/apache/spark/commit/f43c68cb38cb0556f2058be6d3a016083ef5152d to `branch-3.3`. ### Why are the changes needed? To let users can more easily find out whether a specific pandas API and its parameters are supported or not from the single document page. ### Does this PR introduce _any_ user-facing change? Yes, the "Supported pandas APIs" page is added to the user guide for pandas API on Spark documents. ### How was this patch tested? Manually check the links in the documents & the existing doc build should be passed. Closes #36308 from itholic/SPARK-38581-3.3. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 April 2022, 02:30:33 UTC
176dc61 [MINOR][DOCS] Also remove Google Analytics from Spark release docs, per ASF policy ### What changes were proposed in this pull request? Remove Google Analytics from Spark release docs. See also https://github.com/apache/spark-website/pull/384 ### Why are the changes needed? New ASF privacy policy requirement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #36310 from srowen/PrivacyPolicy. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7a58670e2e68ee4950cf62c2be236e00eb8fc44b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 April 2022, 02:26:34 UTC
17552d5 [SPARK-38950][SQL][FOLLOWUP] Fix java doc ### What changes were proposed in this pull request? `{link #pushFilters(Predicate[])}` -> `{link #pushFilters(Seq[Expression])}` ### Why are the changes needed? Fixed java doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #36302 from huaxingao/fix. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit 0b543e7480b6e414b23e02e6c805a33abc535c89) Signed-off-by: huaxingao <huaxin_gao@apple.com> 21 April 2022, 18:52:18 UTC
24d588c [SPARK-38950][SQL] Return Array of Predicate for SupportsPushDownCatalystFilters.pushedFilters ### What changes were proposed in this pull request? in `SupportsPushDownCatalystFilters`, change ``` def pushedFilters: Array[Filter] ``` to ``` def pushedFilters: Array[Predicate] ``` ### Why are the changes needed? use v2Filter in DS V2 ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? existing tests Closes #36264 from huaxingao/V2Filter. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7221d754075656ce41edacb0fccc1cf89a62fc77) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 April 2022, 15:16:40 UTC
197c975 [SPARK-38972][SQL] Support <param> in error-class messages Use symbolic names for parameters in error messages which are substituted with %s before formatting the string. Increase readability of error message docs (TBD) No SQL Project. Closes #36289 from srielau/symbolic-error-arg-names. Authored-by: Serge Rielau <serge.rielau@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 43e610333fb78834a09cd82f3da32bad262564f3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 April 2022, 13:55:10 UTC
bb1a523 [SPARK-38913][SQL][3.3] Output identifiers in error messages in SQL style ### What changes were proposed in this pull request? In the PR, I propose to use backticks to wrap SQL identifiers in error messages. I added new util functions `toSQLId()` to the trait `QueryErrorsBase`, and applied it in `Query.*Errors` (also modified tests in `Query.*ErrorsSuite`). For example: Before: ```sql Invalid SQL syntax: The definition of window win is repetitive. ``` After: ``` Invalid SQL syntax: The definition of window `win` is repetitive. ``` ### Why are the changes needed? To improve user experience with Spark SQL. The changes highlight SQL identifiers in error massages and make them more visible for users. ### Does this PR introduce _any_ user-facing change? No since error classes haven't been released yet. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *QueryParsingErrorsSuite" $ build/sbt "test:testOnly *QueryCompilationErrorsSuite" $ build/sbt "test:testOnly *QueryCompilationErrorsDSv2Suite" $ build/sbt "test:testOnly *QueryExecutionErrorsSuite" $ build/sbt "testOnly *PlanParserSuite" $ build/sbt "testOnly *DDLParserSuite" $ build/sbt -Phive-2.3 "testOnly *HiveSQLInsertTestSuite" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z window.sql" $ build/sbt "testOnly *DSV2SQLInsertTestSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 2ff6914e6bac053231825c083fd508726a11a349) Closes #36288 from MaxGekk/error-class-toSQLId-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 21 April 2022, 08:50:01 UTC
76fe1bf [SPARK-38916][CORE] Tasks not killed caused by race conditions between killTask() and launchTask() ### What changes were proposed in this pull request? This PR fixes the race conditions between the killTask() call and the launchTask() call that sometimes causes tasks not to be killed properly. If killTask() probes the map of pendingTasksLaunches before launchTask() has had a chance to put the corresponding task into that map, the kill flag will be lost and the subsequent launchTask() call will just proceed and run that task without knowing this task should be killed instead. The fix adds a kill mark during the killTask() call so that subsequent launchTask() can pick up the kill mark and call kill() on the TaskLauncher. If killTask() happens to happen after the task has finished and thus makes the kill mark useless, it will be cleaned up in a background thread. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UTs. Closes #36238 from maryannxue/spark-38916. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bb5092b9af60afdceeccb239d14be660f77ae0ea) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 April 2022, 08:31:17 UTC
4ad0b2c [SPARK-38957][SQL] Use multipartIdentifier for parsing table-valued functions This PR uses multipart identifiers when parsing table-valued functions. To make table-valued functions error messages consistent for 2-part names and n-part names. For example, before this PR: ``` select * from a.b.c org.apache.spark.sql.catalyst.parser.ParseException: Invalid SQL syntax: Unsupported function name `a`.`b`.`c`(line 1, pos 14) == SQL == select * from a.b.c(1) --------------^^^ ``` After this PR: ``` Invalid SQL syntax: table valued function cannot specify database name (line 1, pos 14) == SQL == SELECT * FROM a.b.c(1) --------------^^^ ``` No Unit test. Closes #36272 from allisonwang-db/spark-38957-parse-table-func. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8fe5bca1773521d967b82a920c6881f081155bc3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 April 2022, 08:21:00 UTC
c2fa3b8 [SPARK-38432][SQL][FOLLOWUP] Fix problems in And/Or/Not to V2 Filter ### What changes were proposed in this pull request? Instead of having ``` override def toV2: Predicate = new Predicate("AND", Seq(left, right).map(_.toV2).toArray) ``` I think we should construct a V2 `And` directly. ``` override def toV2: Predicate = new org.apache.spark.sql.connector.expressions.filter.And(left.toV2, right.toV2) ``` same for `Or` and `Not`. ### Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests Closes #36290 from huaxingao/toV1. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 36fc8bd185da99b64954ca0dd393b452fb788226) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 April 2022, 08:17:04 UTC
5707811 [SPARK-38936][SQL] Script transform feed thread should have name ### What changes were proposed in this pull request? re-add thread name(`Thread-ScriptTransformation-Feed`). ### Why are the changes needed? Lost feed thread name after [SPARK-32105](https://issues.apache.org/jira/browse/SPARK-32105) refactoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist UT Closes #36245 from cxzl25/SPARK-38936. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4dc12eb54544a12ff7ddf078ca8bcec9471212c3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 April 2022, 02:34:46 UTC
2e102b8 [SPARK-38949][SQL][3.3] Wrap SQL statements by double quotes in error messages ### What changes were proposed in this pull request? In the PR, I propose to wrap any SQL statement in error messages by double quotes "", and apply new implementation of `QueryErrorsBase.toSQLStmt()` to all exceptions in `Query.*Errors` w/ error classes. Also this PR modifies all affected tests, see the list in the section "How was this patch tested?". ### Why are the changes needed? To improve user experience with Spark SQL by highlighting SQL statements in error massage and make them more visible to users. ### Does this PR introduce _any_ user-facing change? Yes. The changes might influence on error messages that are visible to users. Before: ```sql The operation DESC PARTITION is not allowed ``` After: ```sql The operation "DESC PARTITION" is not allowed ``` ### How was this patch tested? By running affected test suites: ``` $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite" $ build/sbt "sql/testOnly *QueryParsingErrorsSuite" $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite" $ build/sbt "test:testOnly *QueryCompilationErrorsDSv2Suite" $ build/sbt "test:testOnly *ExtractPythonUDFFromJoinConditionSuite" $ build/sbt "testOnly *PlanParserSuite" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z transform.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z join-lateral.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z describe.sql" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 5aba2b38beae6e1baf6f0c6f9eb3b65cf607fe77) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36286 from MaxGekk/error-class-apply-toSQLStmt-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 20 April 2022, 19:52:04 UTC
44e90f3 [SPARK-38967][SQL] Turn "spark.sql.ansi.strictIndexOperator" into an internal configuration ### What changes were proposed in this pull request? Currently, most the ANSI error message shows the hint "If necessary set spark.sql.ansi.enabled to false to bypass this error." There is only one special case: "Map key not exist" or "array index out of bound" from the `[]` operator. It shows the config spark.sql.ansi.strictIndexOperator instead. This one special case can confuse users. To make it simple: - Turn "spark.sql.ansi.strictIndexOperator" into an internal configuration - Show the configuration `spark.sql.ansi.enabled` in error messages instead - If it is "map key not exist" error, show the hint for using `try_element_at`. Otherwise, we don't show it. For array, `[]` operator is using 0-based index while `try_element_at` is using 1-based index. ### Why are the changes needed? Make the hints in ANSI runtime error message simple and consistent ### Does this PR introduce _any_ user-facing change? No, the new configuration is not released yet. ### How was this patch tested? Existing UT Closes #36282 from gengliangwang/updateErrorMsg. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 276bdbafe83a5c0b8425a20eb8101a630be8b752) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 April 2022, 13:42:15 UTC
1e7cdda [SPARK-34079][SQL] Merge non-correlated scalar subqueries ### What changes were proposed in this pull request? This PR adds a new optimizer rule `MergeScalarSubqueries` to merge multiple non-correlated `ScalarSubquery`s to compute multiple scalar values once. E.g. the following query: ``` SELECT (SELECT avg(a) FROM t), (SELECT sum(b) FROM t) ``` is optimized from: ``` == Optimized Logical Plan == Project [scalar-subquery#242 [] AS scalarsubquery()#253, scalar-subquery#243 [] AS scalarsubquery()#254L] : :- Aggregate [avg(a#244) AS avg(a)#247] : : +- Project [a#244] : : +- Relation default.t[a#244,b#245] parquet : +- Aggregate [sum(a#251) AS sum(a)#250L] : +- Project [a#251] : +- Relation default.t[a#251,b#252] parquet +- OneRowRelation ``` to: ``` == Optimized Logical Plan == Project [scalar-subquery#242 [].avg(a) AS scalarsubquery()#253, scalar-subquery#243 [].sum(a) AS scalarsubquery()#254L] : :- Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260] : : +- Aggregate [avg(a#244) AS avg(a)#247, sum(a#244) AS sum(a)#250L] : : +- Project [a#244] : : +- Relation default.t[a#244,b#245] parquet : +- Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260] : +- Aggregate [avg(a#244) AS avg(a)#247, sum(a#244) AS sum(a)#250L] : +- Project [a#244] : +- Relation default.t[a#244,b#245] parquet +- OneRowRelation ``` and in the physical plan subqueries are reused: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(1) Project [Subquery subquery#242, [id=#113].avg(a) AS scalarsubquery()#253, ReusedSubquery Subquery subquery#242, [id=#113].sum(a) AS scalarsubquery()#254L] : :- Subquery subquery#242, [id=#113] : : +- AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(2) Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260] +- *(2) HashAggregate(keys=[], functions=[avg(a#244), sum(a#244)], output=[avg(a)#247, sum(a)#250L]) +- ShuffleQueryStage 0 +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#158] +- *(1) HashAggregate(keys=[], functions=[partial_avg(a#244), partial_sum(a#244)], output=[sum#262, count#263L, sum#264L]) +- *(1) ColumnarToRow +- FileScan parquet default.t[a#244] Batched: true, DataFilters: [], Format: Parquet, Location: ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int> +- == Initial Plan == Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260] +- HashAggregate(keys=[], functions=[avg(a#244), sum(a#244)], output=[avg(a)#247, sum(a)#250L]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#110] +- HashAggregate(keys=[], functions=[partial_avg(a#244), partial_sum(a#244)], output=[sum#262, count#263L, sum#264L]) +- FileScan parquet default.t[a#244] Batched: true, DataFilters: [], Format: Parquet, Location: ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int> : +- ReusedSubquery Subquery subquery#242, [id=#113] +- *(1) Scan OneRowRelation[] +- == Initial Plan == ... ``` Please note that the above simple example could be easily optimized into a common select expression without reuse node, but this PR can handle more complex queries as well. ### Why are the changes needed? Performance improvement. ``` [info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] q9 - MergeScalarSubqueries off 50798 52521 1423 0.0 Infinity 1.0X [info] q9 - MergeScalarSubqueries on 19484 19675 226 0.0 Infinity 2.6X [info] TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] q9b - MergeScalarSubqueries off 15430 17803 NaN 0.0 Infinity 1.0X [info] q9b - MergeScalarSubqueries on 3862 4002 196 0.0 Infinity 4.0X ``` Please find `q9b` in the description of SPARK-34079. It is a variant of [q9.sql](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q9.sql) using CTE. The performance improvement in case of `q9` comes from merging 15 subqueries into 5 and in case of `q9b` it comes from merging 5 subqueries into 1. ### Does this PR introduce _any_ user-facing change? No. But this optimization can be disabled with `spark.sql.optimizer.excludedRules` config. ### How was this patch tested? Existing and new UTs. Closes #32298 from peter-toth/SPARK-34079-multi-column-scalar-subquery. Lead-authored-by: Peter Toth <peter.toth@gmail.com> Co-authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e00b81ee9b37067ce8e8242907b26d3ae200f401) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 April 2022, 13:37:24 UTC
5c5a68c [SPARK-38219][SPARK-37691][3.3] Support ANSI Aggregation Function: percentile_cont and percentile_disc ### What changes were proposed in this pull request? This PR backport https://github.com/apache/spark/pull/35531 and https://github.com/apache/spark/pull/35041 to branch-3.3 ### Why are the changes needed? `percentile_cont` and `percentile_disc` in Spark3.3 release. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes #36277 from beliefer/SPARK-38219_SPARK-37691_backport_3.3. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 April 2022, 13:18:11 UTC
9d0650a [SPARK-38929][SQL][3.3] Improve error messages for cast failures in ANSI ### What changes were proposed in this pull request? Improve the error messages for cast failures in ANSI. As mentioned in https://issues.apache.org/jira/browse/SPARK-38929, this PR targets two cast-to types: numeric types and date types. * For numeric(`int`, `smallint`, `double`, `float`, `decimal` ..) types, it embeds the cast-to types in the error message. For example, ``` Invalid input value for type INT: '1.0'. To return NULL instead, use 'try_cast'. If necessary set %s to false to bypass this error. ``` It uses the `toSQLType` and `toSQLValue` to wrap the corresponding types and literals. * For date types, it does similarly as above. For example, ``` Invalid input value for type TIMESTAMP: 'a'. To return NULL instead, use 'try_cast'. If necessary set spark.sql.ansi.enabled to false to bypass this error. ``` ### Why are the changes needed? To improve the error message in general. ### Does this PR introduce _any_ user-facing change? It changes the error messages. ### How was this patch tested? The related unit tests are updated. Authored-by: Xinyi Yu <xinyi.yudatabricks.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit f76b3e766f79b4c2d4f1ecffaad25aeb962336b7) Closes #36275 from anchovYu/ansi-error-improve-3.3. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 20 April 2022, 10:08:38 UTC
83a365e [SPARK-38922][CORE] TaskLocation.apply throw NullPointerException ### What changes were proposed in this pull request? TaskLocation.apply w/o NULL check may throw NPE and fail job scheduling ``` Caused by: java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripPrefix(StringLike.scala:155) at scala.collection.immutable.StringOps.stripPrefix(StringOps.scala:29) at org.apache.spark.scheduler.TaskLocation$.apply(TaskLocation.scala:71) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal ``` For instance, `org.apache.spark.rdd.HadoopRDD#convertSplitLocationInfo` might generate unexpected `Some(null)` elements where should be replace by `Option.apply` ### Why are the changes needed? fix NPE ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #36222 from yaooqinn/SPARK-38922. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 33e07f3cd926105c6d28986eb6218f237505549e) Signed-off-by: Kent Yao <yao@apache.org> 20 April 2022, 06:38:44 UTC
27c75ea [SPARK-37575][SQL][FOLLOWUP] Update the migration guide for added legacy flag for the breaking change of write null value in csv to unquoted empty string ### What changes were proposed in this pull request? This is a follow-up of updating the migration guide for https://github.com/apache/spark/pull/36110 which adds a legacy flag to restore the pre-change behavior. It also fixes a typo in the previous flag description. ### Why are the changes needed? The flag needs to be documented. ### Does this PR introduce _any_ user-facing change? It changes the migration doc for users. ### How was this patch tested? No tests Closes #36268 from anchovYu/flags-null-to-csv-migration-guide. Authored-by: Xinyi Yu <xinyi.yu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a67acbaa29d1ab9071910cac09323c2544d65303) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 April 2022, 02:48:15 UTC
2b3df38 [SPARK-37613][SQL][FOLLOWUP] Supplement docs for regr_count ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/34880 supported ANSI Aggregate Function: regr_count. But the docs of regr_count is not good enough. ### Why are the changes needed? Make the docs of regr_count more detailed. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? N/A Closes #36258 from beliefer/SPARK-37613_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1b106ea32d567dd32ac697ed0d6cfd40ea7e6e08) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 April 2022, 02:03:07 UTC
fb58c3e [SPARK-38828][PYTHON] Remove TimestampNTZ type Python support in Spark 3.3 This PR proposes to remove `TimestampNTZ` type Python support in Spark 3.3 from documentation and `pyspark.sql.types` module. The purpose of this PR is just hide `TimestampNTZ` type from end-users. Because the `TimestampNTZ` project is not finished yet: - Lack Hive metastore support - Lack JDBC support - Need to spend time scanning the codebase to find out any missing support. The current code usages of TimestampType are larger than TimestampNTZType No. The existing tests should cover. Closes #36255 from itholic/SPARK-38828. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 581000de24377ca373df7fa94b214baa7e9b0462) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 April 2022, 01:54:08 UTC
8811e8c [SPARK-38931][SS] Create root dfs directory for RocksDBFileManager with unknown number of keys on 1st checkpoint ### What changes were proposed in this pull request? Create root dfs directory for RocksDBFileManager with unknown number of keys on 1st checkpoint. ### Why are the changes needed? If this fix is not introduced, we might meet exception below: ~~~java File /private/var/folders/rk/wyr101_562ngn8lp7tbqt7_00000gp/T/spark-ce4a0607-b1d8-43b8-becd-638c6b030019/state/1/1 does not exist java.io.FileNotFoundException: File /private/var/folders/rk/wyr101_562ngn8lp7tbqt7_00000gp/T/spark-ce4a0607-b1d8-43b8-becd-638c6b030019/state/1/1 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:93) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:353) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:626) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:701) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:697) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:703) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createTempFile(CheckpointFileManager.scala:327) at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.<init>(CheckpointFileManager.scala:140) at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.<init>(CheckpointFileManager.scala:143) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createAtomic(CheckpointFileManager.scala:333) at org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.zipToDfsFile(RocksDBFileManager.scala:438) at org.apache.spark.sql.execution.streaming.state.RocksDBFileManager.saveCheckpointToDfs(RocksDBFileManager.scala:174) at org.apache.spark.sql.execution.streaming.state.RocksDBSuite.saveCheckpointFiles(RocksDBSuite.scala:566) at org.apache.spark.sql.execution.streaming.state.RocksDBSuite.$anonfun$new$35(RocksDBSuite.scala:179) ........ ~~~ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested via RocksDBSuite. Closes #36242 from Myasuka/SPARK-38931. Authored-by: Yun Tang <myasuka@live.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit abb1df9d190e35a17b693f2b013b092af4f2528a) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 19 April 2022, 11:31:17 UTC
back to top