https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
6c4e07d [SPARK-39255][SQL][3.3] Improve error messages ### What changes were proposed in this pull request? In the PR, I propose to improve errors of the following error classes: 1. NON_PARTITION_COLUMN - `a non-partition column name` -> `the non-partition column` 2. UNSUPPORTED_SAVE_MODE - `a not existent path` -> `a non existent path`. 3. INVALID_FIELD_NAME. Quote ids to follow the rules https://github.com/apache/spark/pull/36621. 4. FAILED_SET_ORIGINAL_PERMISSION_BACK. It is renamed to RESET_PERMISSION_TO_ORIGINAL. 5. NON_LITERAL_PIVOT_VALUES - Wrap error's expression by double quotes. The PR adds new helper method `toSQLExpr()` for that. 6. CAST_INVALID_INPUT - Add the recommendation: `... Correct the syntax for the value before casting it, or change the type to one appropriate for the value.` This is a backport of https://github.com/apache/spark/pull/36635. ### Why are the changes needed? To improve user experience with Spark SQL by making error message more clear. ### Does this PR introduce _any_ user-facing change? Yes, it changes user-facing error messages. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "sql/testOnly *QueryCompilationErrorsDSv2Suite" $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite" $ build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite" $ build/sbt "sql/testOnly *QueryParsingErrorsSuite*" ``` Lead-authored-by: Max Gekk <max.gekkgmail.com> Co-authored-by: Maxim Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 625afb4e1aefda59191d79b31f8c94941aedde1e) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36655 from MaxGekk/error-class-improve-msg-3-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 25 May 2022, 16:36:55 UTC
37a2416 [SPARK-39252][PYSPARK][TESTS] Remove flaky test_df_is_empty ### What changes were proposed in this pull request? ### Why are the changes needed? This PR removes flaky `test_df_is_empty` as reported in https://issues.apache.org/jira/browse/SPARK-39252. I will open a follow-up PR to reintroduce the test and fix the flakiness (or see if it was a regression). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #36656 from sadikovi/SPARK-39252. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9823bb385cd6dca7c4fb5a6315721420ad42f80a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 May 2022, 02:40:02 UTC
a33d697 [SPARK-39273][PS][TESTS] Make PandasOnSparkTestCase inherit ReusedSQLTestCase ### What changes were proposed in this pull request? This PR proposes to make `PandasOnSparkTestCase` inherit `ReusedSQLTestCase`. ### Why are the changes needed? We don't need this: ```python classmethod def tearDownClass(cls): # We don't stop Spark session to reuse across all tests. # The Spark session will be started and stopped at PyTest session level. # Please see pyspark/pandas/conftest.py. pass ``` anymore in Apache Spark. This has existed to speed up the tests when the codes are in Koalas repository where the tests run sequentially in single process. In Apache Spark, we run in multiple processes, and we don't need this anymore. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Existing CI should test it out. Closes #36652 from HyukjinKwon/SPARK-39273. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a6dd6076d708713d11585bf7f3401d522ea48822) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 May 2022, 00:56:42 UTC
d491e39 Preparing development version 3.3.1-SNAPSHOT 24 May 2022, 10:15:35 UTC
a725927 Preparing Spark release v3.3.0-rc3 24 May 2022, 10:15:29 UTC
459c4b0 [SPARK-39144][SQL] Nested subquery expressions deduplicate relations should be done bottom up ### What changes were proposed in this pull request? When we have nested subquery expressions, there is a chance that deduplicate relations could replace an attributes with a wrong one. This is because the attributes replacement is done by top down than bottom up. This could happen if the subplan gets deduplicate relations first (thus two same relation with different attributes id), then a more complex plan built on top of the subplan (e.g. a UNION of queries with nested subquery expressions) can trigger this wrong attribute replacement error. For concrete example please see the added unit test. ### Why are the changes needed? This is bug that we can fix. Without this PR, we could hit that outer attribute reference does not exist in the outer relation at certain scenario. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #36503 from amaliujia/testnestedsubqueryexpression. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d9fd36eb76fcfec95763cc4dc594eb7856b0fad2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 May 2022, 05:06:21 UTC
505248d [SPARK-39258][TESTS] Fix `Hide credentials in show create table` ### What changes were proposed in this pull request? [SPARK-35378-FOLLOWUP](https://github.com/apache/spark/pull/36632) changes the return value of `CommandResultExec.executeCollect()` from `InternalRow` to `UnsafeRow`, this change causes the result of `r.tostring` in the following code: https://github.com/apache/spark/blob/de73753bb2e5fd947f237e731ff05aa9f2711677/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L1143-L1148 change from ``` [CREATE TABLE tab1 ( NAME STRING, THEID INT) USING org.apache.spark.sql.jdbc OPTIONS ( 'dbtable' = 'TEST.PEOPLE', 'password' = '*********(redacted)', 'url' = '*********(redacted)', 'user' = 'testUser') ] ``` to ``` [0,10000000d5,5420455441455243,62617420454c4241,414e20200a282031,4e4952545320454d,45485420200a2c47,a29544e49204449,726f20474e495355,6568636170612e67,732e6b726170732e,a6362646a2e6c71,20534e4f4954504f,7462642720200a28,203d2027656c6261,45502e5453455427,200a2c27454c504f,6f77737361702720,2a27203d20276472,2a2a2a2a2a2a2a2a,6574636164657228,2720200a2c272964,27203d20276c7275,2a2a2a2a2a2a2a2a,746361646572282a,20200a2c27296465,3d20277265737527,7355747365742720,a29277265] ``` and the UT `JDBCSuite$Hide credentials in show create table` failed in master branch. This pr is change to use `executeCollectPublic()` instead of `executeCollect()` to fix this UT. ### Why are the changes needed? Fix UT failed in mater branch after [SPARK-35378-FOLLOWUP](https://github.com/apache/spark/pull/36632) ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? - GitHub Action pass - Manual test Run `mvn clean install -DskipTests -pl sql/core -am -Dtest=none -DwildcardSuites=org.apache.spark.sql.jdbc.JDBCSuite` **Before** ``` - Hide credentials in show create table *** FAILED *** "[0,10000000d5,5420455441455243,62617420454c4241,414e20200a282031,4e4952545320454d,45485420200a2c47,a29544e49204449,726f20474e495355,6568636170612e67,732e6b726170732e,a6362646a2e6c71,20534e4f4954504f,7462642720200a28,203d2027656c6261,45502e5453455427,200a2c27454c504f,6f77737361702720,2a27203d20276472,2a2a2a2a2a2a2a2a,6574636164657228,2720200a2c272964,27203d20276c7275,2a2a2a2a2a2a2a2a,746361646572282a,20200a2c27296465,3d20277265737527,7355747365742720,a29277265]" did not contain "TEST.PEOPLE" (JDBCSuite.scala:1146) ``` **After** ``` Run completed in 24 seconds, 868 milliseconds. Total number of tests run: 93 Suites: completed 2, aborted 0 Tests: succeeded 93, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #36637 from LuciferYang/SPARK-39258. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6eb15d12ae6bd77412dbfbf46eb8dbeec1eab466) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2022, 21:28:19 UTC
2a31bf5 [MINOR][ML][DOCS] Fix sql data types link in the ml-pipeline page ### What changes were proposed in this pull request? <img width="939" alt="image" src="https://user-images.githubusercontent.com/8326978/169767919-6c48554c-87ff-4d40-a47d-ec4da0c993f7.png"> [Spark SQL datatype reference](https://spark.apache.org/docs/latest/sql-reference.html#data-types) - `https://spark.apache.org/docs/latest/sql-reference.html#data-types` is invalid and it shall be [Spark SQL datatype reference](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) - `https://spark.apache.org/docs/latest/sql-ref-datatypes.html` https://spark.apache.org/docs/latest/ml-pipeline.html#dataframe ### Why are the changes needed? doc fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? `bundle exec jekyll serve` Closes #36633 from yaooqinn/minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit de73753bb2e5fd947f237e731ff05aa9f2711677) Signed-off-by: huaxingao <huaxin_gao@apple.com> 23 May 2022, 14:46:24 UTC
0f13606 [SPARK-35378][SQL][FOLLOW-UP] Fix incorrect return type in CommandResultExec.executeCollect() ### What changes were proposed in this pull request? This PR is a follow-up for https://github.com/apache/spark/pull/32513 and fixes an issue introduced by that patch. CommandResultExec is supposed to return `UnsafeRow` records in all of the `executeXYZ` methods but `executeCollect` was left out which causes issues like this one: ``` Error in SQL statement: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow ``` We need to return `unsafeRows` instead of `rows` in `executeCollect` similar to other methods in the class. ### Why are the changes needed? Fixes a bug in CommandResultExec. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test to check the return type of all commands. Closes #36632 from sadikovi/fix-command-exec. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a0decfc7db68c464e3ba2c2fb0b79a8b0c464684) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 May 2022, 08:58:56 UTC
047c108 [SPARK-39250][BUILD] Upgrade Jackson to 2.13.3 ### What changes were proposed in this pull request? This PR aims to upgrade Jackson to 2.13.3. ### Why are the changes needed? Although Spark is not affected, Jackson 2.13.0~2.13.2 has the following regression which affects the user apps. - https://github.com/FasterXML/jackson-databind/issues/3446 Here is a full release note. - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.13.3 ### Does this PR introduce _any_ user-facing change? No. The previous version is not released yet. ### How was this patch tested? Pass the CIs. Closes #36627 from dongjoon-hyun/SPARK-39250. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 73438c048fc646f944415ba2e99cb08cc57d856b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 May 2022, 21:13:30 UTC
fa400c6 [SPARK-39243][SQL][DOCS] Rules of quoting elements in error messages ### What changes were proposed in this pull request? In the PR, I propose to describe the rules of quoting elements in error messages introduced by the PRs: - https://github.com/apache/spark/pull/36210 - https://github.com/apache/spark/pull/36233 - https://github.com/apache/spark/pull/36259 - https://github.com/apache/spark/pull/36324 - https://github.com/apache/spark/pull/36335 - https://github.com/apache/spark/pull/36359 - https://github.com/apache/spark/pull/36579 ### Why are the changes needed? To improve code maintenance, and the process of code review. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing GAs. Closes #36621 from MaxGekk/update-error-class-guide. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 2a4d8a4ea709339175257027e31a75bdeed5daec) Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 May 2022, 15:58:36 UTC
3f77be2 [SPARK-39240][INFRA][BUILD] Source and binary releases using different tool to generate hashes for integrity ### What changes were proposed in this pull request? unify the hash generator for release files. ### Why are the changes needed? Currently, we use `shasum` for source but `gpg` for binary, since https://github.com/apache/spark/pull/30123 this confuses me when validating the integrities of spark 3.3.0 RC https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/ ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test script manually Closes #36619 from yaooqinn/SPARK-39240. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 3e783375097d14f1c28eb9b0e08075f1f8daa4a2) Signed-off-by: Sean Owen <srowen@gmail.com> 20 May 2022, 15:54:59 UTC
ab057c7 [SPARK-39237][DOCS] Update the ANSI SQL mode documentation ### What changes were proposed in this pull request? 1. Remove the Experimental notation in ANSI SQL compliance doc 2. Update the description of `spark.sql.ansi.enabled`, since the ANSI reversed keyword is disabled by default now ### Why are the changes needed? 1. The ANSI SQL dialect is GAed in Spark 3.2 release: https://spark.apache.org/releases/spark-release-3-2-0.html We should not mark it as "Experimental" in the doc. 2. The ANSI reversed keyword is disabled by default now ### Does this PR introduce _any_ user-facing change? No, just doc change ### How was this patch tested? Doc preview: <img width="700" alt="image" src="https://user-images.githubusercontent.com/1097932/169444094-de9c33c2-1b01-4fc3-b583-b752c71e16d8.png"> <img width="1435" alt="image" src="https://user-images.githubusercontent.com/1097932/169472239-1edf218f-1f7b-48ec-bf2a-5d043600f1bc.png"> Closes #36614 from gengliangwang/updateAnsiDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 86a351c13d62644d596cc5249fc1c45d318a0bbf) Signed-off-by: Gengliang Wang <gengliang@apache.org> 20 May 2022, 08:58:36 UTC
e8e330f [SPARK-39218][SS][PYTHON] Make foreachBatch streaming query stop gracefully ### What changes were proposed in this pull request? This PR proposes to make the `foreachBatch` streaming query stop gracefully by handling the interrupted exceptions at `StreamExecution.isInterruptionException`. Because there is no straightforward way to access to the original JVM exception, here we rely on string pattern match for now (see also "Why are the changes needed?" below). There is only one place from Py4J https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/protocol.py#L326-L328 so the approach would work at least. ### Why are the changes needed? In `foreachBatch`, the Python user-defined function in the microbatch runs till the end even when `StreamingQuery.stop` is invoked. However, when any Py4J access is attempted within the user-defined function: - With the pinned thread mode disabled, the interrupt exception is not blocked, and the Python function is executed till the end in a different thread. - With the pinned thread mode enabled, the interrupt exception is raised in the same thread, and the Python thread raises a Py4J exception in the same thread. The latter case is a problem because the interrupt exception is first thrown from JVM side (`java.lang. InterruptedException`) -> Python callback server (`py4j.protocol.Py4JJavaError`) -> JVM (`py4j.Py4JException`), and `py4j.Py4JException` is not listed in `StreamExecution.isInterruptionException` which doesn't gracefully stop the query. Therefore, we should handle this exception at `StreamExecution.isInterruptionException`. ### Does this PR introduce _any_ user-facing change? Yes, it will make the query gracefully stop. ### How was this patch tested? Manually tested with: ```python import time def func(batch_df, batch_id): time.sleep(10) print(batch_df.count()) q = spark.readStream.format("rate").load().writeStream.foreachBatch(func).start() time.sleep(5) q.stop() ``` Closes #36589 from HyukjinKwon/SPARK-39218. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 499de87b77944157828a6d905d9b9df37b7c9a67) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 May 2022, 04:02:37 UTC
c3a171d [SPARK-38681][SQL] Support nested generic case classes ### What changes were proposed in this pull request? Master and branch-3.3 will fail to derive schema for case classes with generic parameters if the parameter was not used directly as a field, but instead pass on as a generic parameter to another type. e.g. ``` case class NestedGeneric[T]( generic: GenericData[T]) ``` This is a regression from the latest release of 3.2.1 where this works as expected. ### Why are the changes needed? Support more general case classes that user might have. ### Does this PR introduce _any_ user-facing change? Better support for generic case classes. ### How was this patch tested? New specs in ScalaReflectionSuite and ExpressionEncoderSuite. All the new test cases that does not use value classes pass if added to the 3.2 branch Closes #36004 from eejbyfeldt/SPARK-38681-nested-generic. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 49c68020e702f9258f3c693f446669bea66b12f4) Signed-off-by: Sean Owen <srowen@gmail.com> 20 May 2022, 00:12:47 UTC
2977791 [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators ### What changes were proposed in this pull request? 1. Explicitly return in GeneratorNestedColumnAliasing when the generator is not Explode. 2. Add extensive comment to GeneratorNestedColumnAliasing. 3. An off-hand code refactor to make the code clearer. ### Why are the changes needed? GeneratorNestedColumnAliasing does not handle other generators correctly. We only try to rewrite the generator for Explode but try to rewrite all ExtractValue expressions. This can cause inconsistency for non-Explode generators. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #35850 from minyyy/gnca_non_explode. Authored-by: minyyy <min.yang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 026102489b8edce827a05a1dba3b0ef8687f134f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 May 2022, 14:52:30 UTC
24a3fa9 [SPARK-39233][SQL] Remove the check for TimestampNTZ output in Analyzer ### What changes were proposed in this pull request? In [#36094](https://github.com/apache/spark/pull/36094), a check for failing TimestampNTZ type output(since we are disabling TimestampNTZ in 3.3) is added: ``` case operator: LogicalPlan if !Utils.isTesting && operator.output.exists(attr => attr.resolved && attr.dataType.isInstanceOf[TimestampNTZType]) => operator.failAnalysis("TimestampNTZ type is not supported in Spark 3.3.") ``` However, the check can cause misleading error messages. In 3.3: ``` > sql( "select date '2018-11-17' > 1").show() org.apache.spark.sql.AnalysisException: Invalid call to toAttribute on unresolved object; 'Project [unresolvedalias((2018-11-17 > 1), None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.UnresolvedAlias.toAttribute(unresolved.scala:510) at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:70) ``` In master or 3.2 ``` > sql( "select date '2018-11-17' > 1").show() org.apache.spark.sql.AnalysisException: cannot resolve '(DATE '2018-11-17' > 1)' due to data type mismatch: differing types in '(DATE '2018-11-17' > 1)' (date and int).; line 1 pos 7; 'Project [unresolvedalias((2018-11-17 > 1), None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` We should just remove the check to avoid such regression. It's not necessary for disabling TimestampNTZ anyway. ### Why are the changes needed? Fix regression in the error output of analysis check. ### Does this PR introduce _any_ user-facing change? No, it is not released yet. ### How was this patch tested? Build and try on `spark-shell` Closes #36609 from gengliangwang/fixRegression. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 May 2022, 12:34:50 UTC
88c076d [SPARK-39229][SQL][3.3] Separate query contexts from error-classes.json ### What changes were proposed in this pull request? Separate query contexts for runtime errors from error-classes.json. ### Why are the changes needed? The message is JSON should only contain parameters explicitly thrown. It is more elegant to separate query contexts from error-classes.json. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #36607 from gengliangwang/SPARK-39229-3.3. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 May 2022, 12:02:44 UTC
e088c82 [SPARK-39212][SQL][3.3] Use double quotes for values of SQL configs/DS options in error messages ### What changes were proposed in this pull request? Wrap values of SQL configs and datasource options in error messages by double quotes. Added the `toDSOption()` method to `QueryErrorsBase` to quote DS options. This is a backport of https://github.com/apache/spark/pull/36579. ### Why are the changes needed? 1. To highlight SQL config/DS option values and make them more visible for users. 2. To be able to easily parse values from error text. 3. To be consistent to other outputs of identifiers, sql statement and etc. where Spark uses quotes or ticks. ### Does this PR introduce _any_ user-facing change? Yes, it changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "testOnly *QueryCompilationErrorsSuite" $ build/sbt "testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "testOnly *QueryExecutionErrorsSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 96f4b7dbc1facd1a38be296263606aa312861c95) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36600 from MaxGekk/move-ise-from-query-errors-3.3-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 May 2022, 08:51:20 UTC
669fc1b [SPARK-39216][SQL] Do not collapse projects in CombineUnions if it hasCorrelatedSubquery ### What changes were proposed in this pull request? Makes `CombineUnions` do not collapse projects if it hasCorrelatedSubquery. For example: ```sql SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT true) t(x) UNION SELECT 1 AS a ``` It will throw exception: ``` java.lang.IllegalStateException: Couldn't find x#4 in [] ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #36595 from wangyum/SPARK-39216. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 85bb7bf008d0346feaedc2aab55857d8f1b19908) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2022, 06:37:33 UTC
77b1313 [SPARK-39226][SQL] Fix the precision of the return type of round-like functions ### What changes were proposed in this pull request? Currently, the precision of the return type of round-like functions (round, bround, ceil, floor) with decimal inputs has a few problems: 1. It does not reserve one more digit in the integral part, in the case of rounding. As a result, `CEIL(CAST(99 AS DECIMAL(2, 0)), -1)` fails. 2. It should return more accurate precision, to count for the scale loose. For example, `CEIL(1.23456, 1)` does not need to keep the precision as 7 in the result. 3. `round`/`bround` with negative scale fails if the input is decimal type. This is not a bug but a little weird. This PR fixes these issues by correcting the formula of calculating the returned decimal type. ### Why are the changes needed? Fix bugs ### Does this PR introduce _any_ user-facing change? Yes, the new functions in 3.3:`ceil`/`floor` with scale parameter, can report a more accurate precision in the result type, and can run some certain queries which failed before. The old functions: `round` and `bround`, can support negative scale parameter with decimal inputs. ### How was this patch tested? new tests Closes #36598 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit ee0aecca05af9b0cb256fd81a78430958a09d19f) Signed-off-by: Gengliang Wang <gengliang@apache.org> 19 May 2022, 05:40:33 UTC
69b7e1a [SPARK-39219][DOC] Promote Structured Streaming over DStream ### What changes were proposed in this pull request? This PR proposes to add NOTE section for DStream guide doc to promote Structured Streaming. Screenshot: <img width="992" alt="screenshot-spark-streaming-programming-guide-change" src="https://user-images.githubusercontent.com/1317309/168977732-4c32db9a-0fb1-4a82-a542-bf385e5f3683.png"> ### Why are the changes needed? We see efforts of community are more focused on Structured Streaming (based on Spark SQL) than Spark Streaming (DStream). We would like to encourage end users to use Structured Streaming than Spark Streaming whenever possible for their workloads. ### Does this PR introduce _any_ user-facing change? Yes, doc change. ### How was this patch tested? N/A Closes #36590 from HeartSaVioR/SPARK-39219. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 7d153392db2f61104da0af1cb175f4ee7c7fbc38) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 19 May 2022, 02:50:17 UTC
47c47b6 [SPARK-39214][SQL][3.3] Improve errors related to CAST ### What changes were proposed in this pull request? In the PR, I propose to rename the error classes: 1. INVALID_SYNTAX_FOR_CAST -> CAST_INVALID_INPUT 2. CAST_CAUSES_OVERFLOW -> CAST_OVERFLOW and change error messages: CAST_INVALID_INPUT: `The value <value> of the type <sourceType> cannot be cast to <targetType> because it is malformed. ...` CAST_OVERFLOW: `The value <value> of the type <sourceType> cannot be cast to <targetType> due to an overflow....` Also quote the SQL config `"spark.sql.ansi.enabled"` and a function name. This is a backport of https://github.com/apache/spark/pull/36553. ### Why are the changes needed? To improve user experience with Spark SQL by making errors/error classes related to CAST more clear and **unified**. ### Does this PR introduce _any_ user-facing change? Yes, the PR changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "testOnly *CastSuite" $ build/sbt "test:testOnly *DateFormatterSuite" $ build/sbt "test:testOnly *TimestampFormatterSuite" $ build/sbt "testOnly *DSV2SQLInsertTestSuite" $ build/sbt "test:testOnly *InsertSuite" $ build/sbt "testOnly *AnsiCastSuiteWithAnsiModeOff" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 66648e96e4283223003c80a1a997325bbd27f940) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36591 from MaxGekk/error-class-improve-msg-2-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 18 May 2022, 18:39:53 UTC
b5ce32f [SPARK-39162][SQL][3.3] Jdbc dialect should decide which function could be pushed down ### What changes were proposed in this pull request? This PR used to back port https://github.com/apache/spark/pull/36521 to 3.3 ### Why are the changes needed? Let function push-down more flexible. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? Exists tests. Closes #36556 from beliefer/SPARK-39162_3.3. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 May 2022, 12:29:01 UTC
ec6fc74 [SPARK-39210][SQL] Provide query context of Decimal overflow in AVG when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides runtime error query context for the Average expression when WSCG is off. ### Why are the changes needed? Enhance the runtime error query context of Average function. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Closes #36582 from gengliangwang/fixAvgContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 8b5b3e95f8761af97255cbcba35c3d836a419dba) Signed-off-by: Gengliang Wang <gengliang@apache.org> 18 May 2022, 10:52:27 UTC
72eb58a [SPARK-39215][PYTHON] Reduce Py4J calls in pyspark.sql.utils.is_timestamp_ntz_preferred ### What changes were proposed in this pull request? This PR proposes to reduce the number of Py4J calls at `pyspark.sql.utils.is_timestamp_ntz_preferred` by having a single method to check. ### Why are the changes needed? For better performance, and simplicity in the code. ### Does this PR introduce _any_ user-facing change? Yes, the number of Py4J calls will be reduced, and the driver side access will become faster. ### How was this patch tested? Existing tests should cover. Closes #36587 from HyukjinKwon/SPARK-39215. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 28e7764bbe6949b2a68ef1466e210ca6418a3018) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 May 2022, 09:21:24 UTC
0e998d3 [SPARK-39193][SQL] Fasten Timestamp type inference of JSON/CSV data sources ### What changes were proposed in this pull request? When reading JSON/CSV files with inferring timestamp types (`.option("inferTimestamp", true)`), the Timestamp conversion will throw and catch exceptions. As we are putting decent error messages in the exception: ``` def cannotCastToDateTimeError( value: Any, from: DataType, to: DataType, errorContext: String): Throwable = { val valueString = toSQLValue(value, from) new SparkDateTimeException("INVALID_SYNTAX_FOR_CAST", Array(toSQLType(to), valueString, SQLConf.ANSI_ENABLED.key, errorContext)) } ``` Throwing and catching the timestamp parsing exceptions is actually not cheap. It consumes more than 90% of the type inference time. This PR improves the default timestamp parsing by returning optional results instead of throwing/catching the exceptions. With this PR, the schema inference time is reduced by 90% in a local benchmark. Note this PR is for the default timestamp parser. It doesn't cover the scenarios of * users provide a customized timestamp format via option * users enable legacy timestamp formatter We can have follow-ups for it. ### Why are the changes needed? Performance improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Also manual test the runtime to inferring a JSON file of 624MB with inferring timestamp enabled: ``` spark.read.option("inferTimestamp", true).json(file) ``` Before the change, it takes 166 seconds After the change, it only 16 seconds. Closes #36562 from gengliangwang/improveInferTS. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 888bf1b2ef44a27c3d4be716a72175bbaa8c6453) Signed-off-by: Gengliang Wang <gengliang@apache.org> 18 May 2022, 03:00:07 UTC
e52b048 [SPARK-39104][SQL] InMemoryRelation#isCachedColumnBuffersLoaded should be thread-safe ### What changes were proposed in this pull request? Add `synchronized` on method `isCachedColumnBuffersLoaded` ### Why are the changes needed? `isCachedColumnBuffersLoaded` should has `synchronized` wrapped, otherwise may cause NPE when modify `_cachedColumnBuffers` concurrently. ``` def isCachedColumnBuffersLoaded: Boolean = { _cachedColumnBuffers != null && isCachedRDDLoaded } def isCachedRDDLoaded: Boolean = { _cachedColumnBuffersAreLoaded || { val bmMaster = SparkEnv.get.blockManager.master val rddLoaded = _cachedColumnBuffers.partitions.forall { partition => bmMaster.getBlockStatus(RDDBlockId(_cachedColumnBuffers.id, partition.index), false) .exists { case(_, blockStatus) => blockStatus.isCached } } if (rddLoaded) { _cachedColumnBuffersAreLoaded = rddLoaded } rddLoaded } } ``` ``` java.lang.NullPointerException at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedRDDLoaded(InMemoryRelation.scala:247) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.isCachedColumnBuffersLoaded(InMemoryRelation.scala:241) at org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8(CacheManager.scala:189) at org.apache.spark.sql.execution.CacheManager.$anonfun$uncacheQuery$8$adapted(CacheManager.scala:176) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:395) at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.CacheManager.recacheByCondition(CacheManager.scala:219) at org.apache.spark.sql.execution.CacheManager.uncacheQuery(CacheManager.scala:176) at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3220) at org.apache.spark.sql.Dataset.unpersist(Dataset.scala:3231) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT. Closes #36496 from pan3793/SPARK-39104. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 3c8d8d7a864281fbe080316ad8de9b8eac80fa71) Signed-off-by: Sean Owen <srowen@gmail.com> 17 May 2022, 23:27:03 UTC
4fb7fe2 [SPARK-39208][SQL] Fix query context bugs in decimal overflow under codegen mode ### What changes were proposed in this pull request? 1. Fix logical bugs in adding query contexts as references under codegen mode. https://github.com/apache/spark/pull/36040/files#diff-4a70d2f3a4b99f58796b87192143f9838f4c4cf469f3313eb30af79c4e07153aR145 The code ``` val errorContextCode = if (nullOnOverflow) { ctx.addReferenceObj("errCtx", queryContext) } else { "\"\"" } ``` should be ``` val errorContextCode = if (nullOnOverflow) { "\"\"" } else { ctx.addReferenceObj("errCtx", queryContext) } ``` 2. Similar to https://github.com/apache/spark/pull/36557, make `CheckOverflowInSum` support query context when WSCG is not available. ### Why are the changes needed? Bugfix and enhancement in the query context of decimal expressions. ### Does this PR introduce _any_ user-facing change? No, the query context is not released yet. ### How was this patch tested? New UT Closes #36577 from gengliangwang/fixDecimalSumOverflow. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 191e535b975e5813719d3143797c9fcf86321368) Signed-off-by: Gengliang Wang <gengliang@apache.org> 17 May 2022, 14:31:50 UTC
c07f65c [SPARK-32268][SQL][TESTS][FOLLOW-UP] Use function registry in the SparkSession ### What changes were proposed in this pull request? This PR proposes: 1. Use the function registry in the Spark Session being used 2. Move function registration into `beforeAll` ### Why are the changes needed? Registration of the function without `beforeAll` at `builtin` can affect other tests. See also https://lists.apache.org/thread/jp0ccqv10ht716g9xldm2ohdv3mpmmz1. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Unittests fixed. Closes #36576 from HyukjinKwon/SPARK-32268-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c5351f85dec628a5c806893aa66777cbd77a4d65) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 May 2022, 14:05:56 UTC
c25624b [SPARK-36718][SQL][FOLLOWUP] Improve the extract-only check in CollapseProject ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/36510 , to fix a corner case: if the `CreateStruct` is only referenced once in non-extract expressions, we should still allow collapsing the projects. ### Why are the changes needed? completely fix the perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes #36572 from cloud-fan/regression. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 98fad57221d4dffc6f1fe28d9aca1093172ecf72) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 May 2022, 07:57:05 UTC
30bb19e [SPARK-39183][BUILD] Upgrade Apache Xerces Java to 2.12.2 ### What changes were proposed in this pull request? Upgrade Apache Xerces Java to 2.12.2 [Release notes](https://xerces.apache.org/xerces2-j/releases.html) ### Why are the changes needed? [Infinite Loop in Apache Xerces Java](https://github.com/advisories/GHSA-h65f-jvqw-m9fj) There's a vulnerability within the Apache Xerces Java (XercesJ) XML parser when handling specially crafted XML document payloads. This causes, the XercesJ XML parser to wait in an infinite loop, which may sometimes consume system resources for prolonged duration. This vulnerability is present within XercesJ version 2.12.1 and the previous versions. References https://nvd.nist.gov/vuln/detail/CVE-2022-23437 https://lists.apache.org/thread/6pjwm10bb69kq955fzr1n0nflnjd27dl http://www.openwall.com/lists/oss-security/2022/01/24/3 https://www.oracle.com/security-alerts/cpuapr2022.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #36544 from bjornjorgensen/Upgrade-xerces-to-2.12.2. Authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 181436bd990d3bdf178a33fa6489ad416f3e7f94) Signed-off-by: Sean Owen <srowen@gmail.com> 16 May 2022, 23:10:17 UTC
e2ce088 [SPARK-39190][SQL] Provide query context for decimal precision overflow error when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides query context for decimal precision overflow error when WSCG is off ### Why are the changes needed? Enhance the runtime error query context of checking decimal overflow. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #36557 from gengliangwang/decimalContextWSCG. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 17b85ff97569a43d7fd33863d17bfdaf62d539e0) Signed-off-by: Gengliang Wang <gengliang@apache.org> 16 May 2022, 09:45:07 UTC
1853eb1 [SPARK-39187][SQL][3.3] Remove `SparkIllegalStateException` ### What changes were proposed in this pull request? Remove `SparkIllegalStateException` and replace it by `IllegalStateException` where it was used. This is a backport of https://github.com/apache/spark/pull/36550. ### Why are the changes needed? To improve code maintenance and be consistent to other places where `IllegalStateException` is used in illegal states (for instance, see https://github.com/apache/spark/pull/36524). After the PR https://github.com/apache/spark/pull/36500, the exception is substituted by `SparkException` w/ the `INTERNAL_ERROR` error class. ### Does this PR introduce _any_ user-facing change? No. Users shouldn't face to the exception in regular cases. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "sql/test:testOnly *QueryExecutionErrorsSuite*" $ build/sbt "test:testOnly *ArrowUtilsSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 1a90512f605c490255f7b38215c207e64621475b) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36558 from MaxGekk/remove-SparkIllegalStateException-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 16 May 2022, 08:39:37 UTC
af38fce Preparing development version 3.3.1-SNAPSHOT 16 May 2022, 05:42:35 UTC
c8c657b Preparing Spark release v3.3.0-rc2 16 May 2022, 05:42:28 UTC
386c756 [SPARK-39186][PYTHON] Make pandas-on-Spark's skew consistent with pandas ### What changes were proposed in this pull request? the logics of computing skewness are different between spark sql and pandas: spark sql: [`sqrt(n) * m3 / sqrt(m2 * m2 * m2))`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala#L304) pandas: [`(count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2**1.5)`](https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py#L1221) ### Why are the changes needed? to make skew consistent with pandas ### Does this PR introduce _any_ user-facing change? yes, the logic to compute skew was changed ### How was this patch tested? added UT Closes #36549 from zhengruifeng/adjust_pandas_skew. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7e4519c9a8ba35958ef6d408be3ca4e97917c965) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:31:09 UTC
ab1d986 [SPARK-37544][SQL] Correct date arithmetic in sequences ### What changes were proposed in this pull request? Change `InternalSequenceBase` to pass a time-zone aware value to `DateTimeUtils#timestampAddInterval`, rather than a time-zone agnostic value, when performing `Date` arithmetic. ### Why are the changes needed? The following query gets the wrong answer if run in the America/Los_Angeles time zone: ``` spark-sql> select sequence(date '2021-01-01', date '2022-01-01', interval '3' month) x; [2021-01-01,2021-03-31,2021-06-30,2021-09-30,2022-01-01] Time taken: 0.664 seconds, Fetched 1 row(s) spark-sql> ``` The answer should be ``` [2021-01-01,2021-04-01,2021-07-01,2021-10-01,2022-01-01] ``` `InternalSequenceBase` converts the date to micros by multiplying days by micros per day. This converts the date into a time-zone agnostic timestamp. However, `InternalSequenceBase` uses `DateTimeUtils#timestampAddInterval` to perform the arithmetic, and that function assumes a _time-zone aware_ timestamp. One simple fix would be to call `DateTimeUtils#timestampNTZAddInterval` instead for date arithmetic. However, Spark date arithmetic is typically time-zone aware (see the comment in the test added by this PR), so this PR converts the date to a time-zone aware value before calling `DateTimeUtils#timestampAddInterval`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #36546 from bersprockets/date_sequence_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 14ee0d8f04f218ad61688196a0b984f024151468) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:26:26 UTC
2672624 [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors ### What changes were proposed in this pull request? Document PySpark(SQL, pandas API on Spark, and Py4J) common exceptions/errors and respective solutions. ### Why are the changes needed? Make PySpark debugging easier. There are common exceptions/errors in PySpark SQL, pandas API on Spark, and Py4J. Documenting exceptions and respective solutions help users debug PySpark. ### Does this PR introduce _any_ user-facing change? No. Document change only. ### How was this patch tested? Manual test. <img width="1019" alt="image" src="https://user-images.githubusercontent.com/47337188/165145874-b0de33b1-835a-459d-9062-94086e62e254.png"> Please refer to https://github.com/apache/spark/blob/7a1c7599a21cbbe2778707b72643cf98ac601ab1/python/docs/source/development/debugging.rst#common-exceptions--errors for the whole rendered page. Closes #36267 from xinrong-databricks/common_err. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f940d7adfd6d071bc3bdcc406e01263a7f03e955) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 May 2022, 00:25:10 UTC
2db9d78 [SPARK-39174][SQL] Catalogs loading swallows missing classname for ClassNotFoundException ### What changes were proposed in this pull request? this PR captures the actual missing classname when catalog loading meets ClassNotFoundException ### Why are the changes needed? ClassNotFoundException can occur when missing dependencies, we shall not always report the catalog class is missing ### Does this PR introduce _any_ user-facing change? yes, when loading catalogs and ClassNotFoundException occurs, it shows the correct missing class. ### How was this patch tested? new test added Closes #36534 from yaooqinn/SPARK-39174. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1b37f19876298e995596a30edc322c856ea1bbb4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 14 May 2022, 10:45:34 UTC
30834b8 [SPARK-39157][SQL] H2Dialect should override getJDBCType so as make the data type is correct ### What changes were proposed in this pull request? Currently, `H2Dialect` not implement `getJDBCType` of `JdbcDialect`, so the DS V2 push-down will throw exception show below: ``` Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13) (jiaan-gengdembp executor driver): org.h2.jdbc.JdbcSQLNonTransientException: Unknown data type: "STRING"; SQL statement: SELECT "DEPT","NAME","SALARY","BONUS","IS_MANAGER" FROM "test"."employee" WHERE ("BONUS" IS NOT NULL) AND ("DEPT" IS NOT NULL) AND (CAST("BONUS" AS string) LIKE '%30%') AND (CAST("DEPT" AS byte) > 1) AND (CAST("DEPT" AS short) > 1) AND (CAST("BONUS" AS decimal(20,2)) > 1200.00) [50004-210] ``` H2Dialect should implement `getJDBCType` of `JdbcDialect`. ### Why are the changes needed? make the H2 data type is correct. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug for `H2Dialect`. ### How was this patch tested? New tests. Closes #36516 from beliefer/SPARK-39157. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa3f096e02d408fbeab5f69af451ef8bc8f5b3db) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 May 2022, 14:01:27 UTC
e743e68 [SPARK-39178][CORE] SparkFatalException should show root cause when print error stack ### What changes were proposed in this pull request? Our user meet an case when running broadcast, throw `SparkFatalException`, but in error stack, it don't show the error case. ### Why are the changes needed? Make exception more clear ### Does this PR introduce _any_ user-facing change? User can got root cause when application throw `SparkFatalException`. ### How was this patch tested? For ut ``` test("xxxx") { throw new SparkFatalException( new OutOfMemoryError("Not enough memory to build and broadcast the table to all " + "worker nodes. As a workaround, you can either disable broadcast by setting " + s"driver memory by setting ${SparkLauncher.DRIVER_MEMORY} to a higher value.") .initCause(null)) } ``` Before this pr: ``` [info] org.apache.spark.util.SparkFatalException: [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:59) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) ``` After this pr: ``` [info] org.apache.spark.util.SparkFatalException: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting driver memory by setting spark.driver.memory to a higher value. [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:59) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:203) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:64) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:64) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) [info] Cause: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting driver memory by setting spark.driver.memory to a higher value. [info] at org.apache.spark.SparkContextSuite.$anonfun$new$1(SparkContextSuite.scala:58) ``` Closes #36539 from AngersZhuuuu/SPARK-39178. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit d7317b03e975f8dc1a8c276dd0a931e00c478717) Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 13:47:21 UTC
1a49de6 [SPARK-39177][SQL] Provide query context on map key not exists error when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides query context for "map key not exists" runtime error when WSCG is off. ### Why are the changes needed? Enhance the runtime error query context for "map key not exists" runtime error. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #36538 from gengliangwang/fixMapKeyContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 1afddf407436c3b315ec601fab5a4a1b2028e672) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 13:45:20 UTC
1372f31 [SPARK-39164][SQL][3.3] Wrap asserts/illegal state exceptions by the INTERNAL_ERROR exception in actions ### What changes were proposed in this pull request? In the PR, I propose to catch `java.lang.IllegalStateException` and `java.lang.AssertionError` (raised by asserts), and wrap them by Spark's exception w/ the `INTERNAL_ERROR` error class. The modification affects only actions so far. This PR affects the case of missing bucket file. After the changes, Spark throws `SparkException` w/ `INTERNAL_ERROR` instead of `IllegalStateException`. Since this is not Spark's illegal state, the exception should be replaced by another runtime exception. Created the ticket SPARK-39163 to fix this. This is a backport of https://github.com/apache/spark/pull/36500. ### Why are the changes needed? To improve user experience with Spark SQL and unify representation of internal errors by using error classes like for other errors. Usually, users shouldn't observe asserts and illegal states, but even if such situation happens, they should see errors in the same way as other errors (w/ error class `INTERNAL_ERROR`). ### Does this PR introduce _any_ user-facing change? Yes. At least, in one particular case, see the modified test suites and SPARK-39163. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *.BucketedReadWithoutHiveSupportSuite" $ build/sbt "test:testOnly *.AdaptiveQueryExecSuite" $ build/sbt "test:testOnly *.WholeStageCodegenSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit f5c3f0c228fef7808d1f927e134595ddd4d31723) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36533 from MaxGekk/class-internal-error-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 13:43:53 UTC
c2bd7ba [SPARK-39165][SQL][3.3] Replace `sys.error` by `IllegalStateException` ### What changes were proposed in this pull request? Replace all invokes of `sys.error()` by throwing of `IllegalStateException` in the `sql` namespace. This is a backport of https://github.com/apache/spark/pull/36524. ### Why are the changes needed? In the context of wrapping all internal errors like asserts/illegal state exceptions (see https://github.com/apache/spark/pull/36500), it is impossible to distinguish `RuntimeException` of `sys.error()` from Spark's exceptions like `SparkRuntimeException`. The last one can be propagated to the user space but `sys.error` exceptions shouldn't be visible to users in regular cases. ### Does this PR introduce _any_ user-facing change? No, shouldn't. sys.error shouldn't propagate exception to user space in regular cases. ### How was this patch tested? By running the existing test suites. Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 95c7efd7571464d8adfb76fb22e47a5816cf73fb) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36532 from MaxGekk/sys_error-internal-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 13 May 2022, 09:47:53 UTC
27c03e5 [SPARK-39175][SQL] Provide runtime error query context for Cast when WSCG is off ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/36525, this PR provides runtime error query context for the Cast expression when WSCG is off. ### Why are the changes needed? Enhance the runtime error query context of Cast expression. After changes, it works when the whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #36535 from gengliangwang/fixCastContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit cdd33e83c3919c4475e2e1ef387acb604bea81b9) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 09:46:45 UTC
63f20c5 [SPARK-39166][SQL] Provide runtime error query context for binary arithmetic when WSCG is off ### What changes were proposed in this pull request? Currently, for most of the cases, the project https://issues.apache.org/jira/browse/SPARK-38615 is able to show where the runtime errors happen within the original query. However, after trying on production, I found that the following queries won't show where the divide by 0 error happens ``` create table aggTest(i int, j int, k int, d date) using parquet insert into aggTest values(1, 2, 0, date'2022-01-01') select sum(j)/sum(k),percentile(i, 0.9) from aggTest group by d ``` With `percentile` function in the query, the plan can't execute with whole stage codegen. Thus the child plan of `Project` is serialized to executors for execution, from ProjectExec: ``` protected override def doExecute(): RDD[InternalRow] = { child.execute().mapPartitionsWithIndexInternal { (index, iter) => val project = UnsafeProjection.create(projectList, child.output) project.initialize(index) iter.map(project) } } ``` Note that the `TreeNode.origin` is not serialized to executors since `TreeNode` doesn't extend the trait `Serializable`, which results in an empty query context on errors. For more details, please read https://issues.apache.org/jira/browse/SPARK-39140 A dummy fix is to make `TreeNode` extend the trait `Serializable`. However, it can be performance regression if the query text is long (every `TreeNode` carries it for serialization). A better fix is to introduce a new trait `SupportQueryContext` and materialize the truncated query context for special expressions. This PR targets on binary arithmetic expressions only. I will create follow-ups for the remaining expressions which support runtime error query context. ### Why are the changes needed? Improve the error context framework and make sure it works when whole stage codegen is not available. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #36525 from gengliangwang/serializeContext. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit e336567c8a9704b500efecd276abaf5bd3988679) Signed-off-by: Gengliang Wang <gengliang@apache.org> 13 May 2022, 05:50:40 UTC
3cc47a1 Revert "[SPARK-36837][BUILD] Upgrade Kafka to 3.1.0" ### What changes were proposed in this pull request? This PR aims to revert commit 973ea0f06e72ab64574cbf00e095922a3415f864 from `branch-3.3` to exclude it from Apache Spark 3.3 scope. ### Why are the changes needed? SPARK-36837 tried to use Apache Kafka 3.1.0 at Apache Spark 3.3.0 and initially wanted to upgrade to Apache Kafka 3.3.1 before the official release. However, we can use the stable Apache Kafka 2.8.1 at Spark 3.3.0 and wait for more proven versions, Apache Kafka 3.2.x or 3.3.x. Apache Kafka 3.2.0 vote is already passed and will arrive. - https://lists.apache.org/thread/9k5sysvchg98lchv2rvvvq6xhpgk99cc Apache Kafka 3.3.0 release discussion is started too. - https://lists.apache.org/thread/cmol5bcf011s1xl91rt4ylb1dgz2vb1r ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #36517 from dongjoon-hyun/SPARK-36837-REVERT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 May 2022, 17:55:21 UTC
0baa5c7 [SPARK-36718][SQL][FOLLOWUP] Fix the `isExtractOnly` check in CollapseProject This PR fixes a perf regression in Spark 3.3 caused by https://github.com/apache/spark/pull/33958 In `CollapseProject`, we want to treat `CreateStruct` and its friends as cheap expressions if they are only referenced by `ExtractValue`, but the check is too conservative, which causes a perf regression. This PR fixes this check. Now "extract-only" means: the attribute only appears as a child of `ExtractValue`, but the consumer expression can be in any shape. Fixes perf regression No new tests Closes #36510 from cloud-fan/bug. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 547f032d04bd2cf06c54b5a4a2f984f5166beb7d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 May 2022, 04:59:36 UTC
70becf2 [SPARK-39155][PYTHON] Access to JVM through passed-in GatewayClient during type conversion ### What changes were proposed in this pull request? Access to JVM through passed-in GatewayClient during type conversion. ### Why are the changes needed? In customized type converters, we may utilize the passed-in GatewayClient to access JVM, rather than rely on the `SparkContext._jvm`. That's [how](https://github.com/py4j/py4j/blob/master/py4j-python/src/py4j/java_collections.py#L508) Py4J explicit converters access JVM. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36504 from xinrong-databricks/gateway_client_jvm. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 92fcf214c107358c1a70566b644cec2d35c096c0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 03:22:19 UTC
ebe4252 [SPARK-39149][SQL] SHOW DATABASES command should not quote database names under legacy mode ### What changes were proposed in this pull request? This is a bug of the command legacy mode as it does not fully restore to the legacy behavior. The legacy v1 SHOW DATABASES command does not quote the database names. This PR fixes it. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no change by default, unless people turn on legacy mode, in which case SHOW DATABASES common won't quote the database names. ### How was this patch tested? new tests Closes #36508 from cloud-fan/regression. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3094e495095635f6c9e83f4646d3321c2a9311f4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 May 2022, 03:18:35 UTC
65dd727 [SPARK-39154][PYTHON][DOCS] Remove outdated statements on distributed-sequence default index ### What changes were proposed in this pull request? Remove outdated statements on distributed-sequence default index. ### Why are the changes needed? Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. ### Does this PR introduce _any_ user-facing change? No. Doc change only. ### How was this patch tested? Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f37150a5549a8f3cb4c1877bcfd2d1459fc73cac) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 00:13:34 UTC
e4bb341 Revert "[SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index" This reverts commit f75c00da3cf01e63d93cedbe480198413af41455. 12 May 2022, 00:12:13 UTC
f75c00d [SPARK-34827][PYTHON][DOC] Remove outdated statements on distributed-sequence default index Remove outdated statements on distributed-sequence default index. Since distributed-sequence default index is updated to be enforced only while execution, there are stale statements in documents to be removed. No. Doc change only. Manual tests. Closes #36513 from xinrong-databricks/defaultIndexDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cec1e7b4e68deac321f409d424a3acdcd4cb91be) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 May 2022, 00:07:49 UTC
8608baa [SPARK-37878][SQL][FOLLOWUP] V1Table should always carry the "location" property ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/35204 . https://github.com/apache/spark/pull/35204 introduced a potential regression: it removes the "location" table property from `V1Table` if the table is not external. The intention was to avoid putting the LOCATION clause for managed tables in `ShowCreateTableExec`. However, if we use the v2 DESCRIBE TABLE command by default in the future, this will bring a behavior change and v2 DESCRIBE TABLE command won't print the table location for managed tables. This PR fixes this regression by using a different idea to fix the SHOW CREATE TABLE issue: 1. introduce a new reserved table property `is_managed_location`, to indicate that the location is managed by the catalog, not user given. 2. `ShowCreateTableExec` only generates the LOCATION clause if the "location" property is present and is not managed. ### Why are the changes needed? avoid a potential regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. We can add a test when we use v2 DESCRIBE TABLE command by default. Closes #36498 from cloud-fan/regression. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa2bda5c4eabb23d5f5b3e14ccd055a2453f579f) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 May 2022, 06:50:11 UTC
7d57577 [SPARK-39112][SQL] UnsupportedOperationException if spark.sql.ui.explainMode is set to cost ### What changes were proposed in this pull request? Add a new leaf like node `LeafNodeWithoutStats` and apply to the list: - ResolvedDBObjectName - ResolvedNamespace - ResolvedTable - ResolvedView - ResolvedNonPersistentFunc - ResolvedPersistentFunc ### Why are the changes needed? We enable v2 command at 3.3.0 branch by default `spark.sql.legacy.useV1Command`. However this is a behavior change between v1 and c2 command. - v1 command: We resolve logical plan to command at analyzer phase by `ResolveSessionCatalog` - v2 commnd: We resolve logical plan to v2 command at physical phase by `DataSourceV2Strategy` Foe cost explain mode, we will call `LogicalPlanStats.stats` using optimized plan so there is a gap between v1 and v2 command. Unfortunately, the logical plan of v2 command contains the `LeafNode` which does not override the `computeStats`. As a result, there is a error running such sql: ```sql set spark.sql.ui.explainMode=cost; show tables; ``` ``` java.lang.UnsupportedOperationException: at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats$(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.analysis.ResolvedNamespace.computeStats(v2ResolutionPlans.scala:155) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:49) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:27) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30) ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #36488 from ulysses-you/SPARK-39112. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 06fd340daefd67a3e96393539401c9bf4b3cbde9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 May 2022, 06:32:21 UTC
1738f48 [MINOR][DOCS][PYTHON] Fixes pandas import statement in code example ### What changes were proposed in this pull request? In 'Applying a Function' section, example code import statement `as pd` was added ### Why are the changes needed? In 'Applying a Function' section, example code import statement needs to have a `as pd` because in function definitions we're using `pd`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Documentation change only. Continues to be markdown compliant. Closes #36502 from snifhex/patch-1. Authored-by: Sachin Tripathi <sachintripathis84@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 60edc5758e82e76a37ce5a5f98e870fac587b656) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 May 2022, 00:13:30 UTC
a4b420c [SPARK-39106][SQL] Correct conditional expression constant folding - add try-catch when we fold children inside `ConditionalExpression` if it's not foldable - mark `CaseWhen` and `If` as foldable if it's children are foldable For a conditional expression, we should add a try-catch to partially fold the constant inside it's children because some bracnhes may no be evaluated at runtime. For example if c1 or c2 is not null, the last branch should be never hit: ```sql SELECT COALESCE(c1, c2, 1/0); ``` Besides, for CaseWhen and If, we should mark it as foldable if it's children are foldable. It is safe since the both non-codegen and codegen code path have already respected the evaluation order. yes, bug fix add more test in sql file Closes #36468 from ulysses-you/SPARK-39106. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 08a4ade8ba881589da0741b3ffacd3304dc1e9b5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2022, 12:43:24 UTC
c6584af [SPARK-39135][SQL] DS V2 aggregate partial push-down should supports group by without aggregate functions ### What changes were proposed in this pull request? Currently, the SQL show below not supported by DS V2 aggregate partial push-down. `select key from tab group by key` ### Why are the changes needed? Make DS V2 aggregate partial push-down supports group by without aggregate functions. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes #36492 from beliefer/SPARK-39135. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit decd393e23406d82b47aa75c4d24db04c7d1efd6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 May 2022, 09:37:39 UTC
7838a14 [MINOR][INFRA][3.3] Add ANTLR generated files to .gitignore ### What changes were proposed in this pull request? Add git ignore entries for files created by ANTLR. This is a backport of #35838. ### Why are the changes needed? To avoid developers from accidentally adding those files when working on parser/lexer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By making sure those files are ignored by git status when they exist. Closes #36489 from yutoacts/minor_gitignore_3.3. Authored-by: Yuto Akutsu <rhythmy.dev@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> 10 May 2022, 02:50:44 UTC
c759151 [SPARK-39107][SQL] Account for empty string input in regex replace ### What changes were proposed in this pull request? When trying to perform a regex replace, account for the possibility of having empty strings as input. ### Why are the changes needed? https://github.com/apache/spark/pull/29891 was merged to address https://issues.apache.org/jira/browse/SPARK-30796 and introduced a bug that would not allow regex matching on empty strings, as it would account for position within substring but not consider the case where input string has length 0 (empty string) From https://issues.apache.org/jira/browse/SPARK-39107 there is a change in behavior between spark versions. 3.0.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | <empty>| +---+--------+ ``` 3.1.2 ``` scala> val df = spark.sql("SELECT '' AS col") df: org.apache.spark.sql.DataFrame = [col: string] scala> df.withColumn("replaced", regexp_replace(col("col"), "^$", "<empty>")).show +---+--------+ |col|replaced| +---+--------+ | | | +---+--------+ ``` The 3.0.2 outcome is the expected and correct one ### Does this PR introduce _any_ user-facing change? Yes compared to spark 3.2.1, as it brings back the correct behavior when trying to regex match empty strings, as shown in the example above. ### How was this patch tested? Added special casing test in `RegexpExpressionsSuite.RegexReplace` with empty string replacement. Closes #36457 from LorenzoMartini/lmartini/fix-empty-string-replace. Authored-by: Lorenzo Martini <lmartini@palantir.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 731aa2cdf8a78835621fbf3de2d3492b27711d1a) Signed-off-by: Sean Owen <srowen@gmail.com> 10 May 2022, 00:44:26 UTC
cf13262 [SPARK-38939][SQL][FOLLOWUP] Replace named parameter with comment in ReplaceColumns ### What changes were proposed in this pull request? This PR aims to replace named parameter with comment in `ReplaceColumns`. ### Why are the changes needed? #36252 changed signature of deleteColumn#**TableChange.java**, but this PR breaks sbt compilation in k8s integration test. ```shell > build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=r -Dspark.kubernetes.test.imageRepo=kubespark "kubernetes-integration-tests/test" [error] /Users/IdeaProjects/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala:147:45: not found: value ifExists [error] TableChange.deleteColumn(Array(name), ifExists = false) [error] ^ [error] /Users/IdeaProjects/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala:159:19: value ++ is not a member of Array[Nothing] [error] deleteChanges ++ addChanges [error] ^ [error] two errors found [error] (catalyst / Compile / compileIncremental) Compilation failed ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the GA and k8s integration test. Closes #36487 from dcoliversun/SPARK-38939. Authored-by: Qian.Sun <qian.sun2020@gmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit 16b5124d75dc974c37f2fd87c78d231f8a3bf772) Signed-off-by: huaxingao <huaxin_gao@apple.com> 09 May 2022, 17:00:44 UTC
a52a245 [SPARK-38786][SQL][TEST] Bug in StatisticsSuite 'change stats after add/drop partition command' ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979 It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug. ### Why are the changes needed? Due to this test bug, the drop command was dropping a wrong (`partDir1`) underlying file in the test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added extra underlying file location check. Closes #36075 from kazuyukitanimura/SPARK-38786. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit a6b04f007c07fe00637aa8be33a56f247a494110) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 May 2022, 00:21:59 UTC
e9ee2c8 [SPARK-37618][CORE][FOLLOWUP] Support cleaning up shuffle blocks from external shuffle service ### What changes were proposed in this pull request? Fix test failure in build. Depending on the umask of the process running tests (which is typically inherited from the user's default umask), the group writable bit for the files/directories could be set or unset. The test was assuming that by default the umask will be restrictive (and so files/directories wont be group writable). Since this is not a valid assumption, we use jnr to change the umask of the process to be more restrictive - so that the test can validate the behavior change - and reset it back once the test is done. ### Why are the changes needed? Fix test failure in build ### Does this PR introduce _any_ user-facing change? No Adds jnr as a test scoped dependency, which does not bring in any other new dependency (asm is already a dep in spark). ``` [INFO] +- com.github.jnr:jnr-posix:jar:3.0.9:test [INFO] | +- com.github.jnr:jnr-ffi:jar:2.0.1:test [INFO] | | +- com.github.jnr:jffi:jar:1.2.7:test [INFO] | | +- com.github.jnr:jffi:jar:native:1.2.7:test [INFO] | | +- org.ow2.asm:asm:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-commons:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-analysis:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-tree:jar:5.0.3:test [INFO] | | +- org.ow2.asm:asm-util:jar:5.0.3:test [INFO] | | \- com.github.jnr:jnr-x86asm:jar:1.0.2:test [INFO] | \- com.github.jnr:jnr-constants:jar:0.8.6:test ``` ### How was this patch tested? Modification to existing test. Tested on Linux, skips test when native posix env is not found. Closes #36473 from mridulm/fix-SPARK-37618-test. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 317407171cb36439c371153cfd45c1482bf5e425) Signed-off-by: Sean Owen <srowen@gmail.com> 08 May 2022, 13:11:31 UTC
feba335 [SPARK-39083][CORE] Fix race condition between update and clean app data ### What changes were proposed in this pull request? make `cleanAppData` atomic to prevent race condition between update and clean app data. When the race condition happens, it could lead to a scenario when `cleanAppData` delete the entry of ApplicationInfoWrapper for an application right after it has been updated by `mergeApplicationListing`. So there will be cases when the HS Web UI displays `Application not found` for applications whose logs does exist. #### Error message ``` 22/04/29 17:16:21 DEBUG FsHistoryProvider: New/updated attempts found: 1 ArrayBuffer(viewfs://iu/log/spark3/application_1651119726430_138107_1) 22/04/29 17:16:21 INFO FsHistoryProvider: Parsing viewfs://iu/log/spark3/application_1651119726430_138107_1 for listing data... 22/04/29 17:16:21 INFO FsHistoryProvider: Looking for end event; skipping 10805037 bytes from viewfs://iu/log/spark3/application_1651119726430_138107_1... 22/04/29 17:16:21 INFO FsHistoryProvider: Finished parsing viewfs://iu/log/spark3/application_1651119726430_138107_1 22/04/29 17:16:21 ERROR Utils: Uncaught exception in thread log-replay-executor-7 java.util.NoSuchElementException at org.apache.spark.util.kvstore.InMemoryStore.read(InMemoryStore.java:85) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$3(FsHistoryProvider.scala:927) at scala.Option.foreach(Option.scala:407) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkAndCleanLog$1(FsHistoryProvider.scala:926) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:2032) at org.apache.spark.deploy.history.FsHistoryProvider.checkAndCleanLog(FsHistoryProvider.scala:916) at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:712) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$15(FsHistoryProvider.scala:576) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` #### Background Currently, the HS runs the `checkForLogs` to build the application list based on the current contents of the log directory for every 10 seconds by default. - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L296-L299 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L472 In each turn of execution, this method scans the specified logDir and parse the log files to update its KVStore: - detect new updated/added files to process : https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L574-L578 - detect stale data to remove: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600 These 2 operations are executed in different threads as `submitLogProcessTask` uses `replayExecutor` to submit tasks. https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1389-L1401 ### When does the bug happen? `Application not found` error happens in the following scenario: In the first run of `checkForLogs`, it detected a newly-added log `viewfs://iu/log/spark3/AAA_1.inprogress` (log of an in-progress application named AAA). So it will add 2 entries to the KVStore - one entry of key-value as the key is the logPath (`viewfs://iu/log/spark3/AAA_1.inprogress`) and the value is an instance of LogInfo represented the log - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L495-L505 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L545-L552 - one entry of key-value as the key is the applicationId (`AAA`) and the value is an instance of ApplicationInfoWrapper holding the information of the application. - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L825 - https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 In the next run of `checkForLogs`, now the AAA application has finished, the log `viewfs://iu/log/spark3/AAA_1.inprogress` has been deleted and a new log `viewfs://iu/log/spark3/AAA_1` is created. So `checkForLogs` will do the following 2 things in 2 different threads: - Thread 1: parsing the new log `viewfs://iu/log/spark3/AAA_1` and update data in its KVStore - add a new entry of key: `viewfs://iu/log/spark3/AAA_1` and value: an instance of LogInfo represented the log - updated the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper (for example: the isCompleted flag now change from false to true) - Thread 2: data related to `viewfs://iu/log/spark3/AAA_1.inprogress` is now considered as stale and it must be deleted. - clean App data for `viewfs://iu/log/spark3/AAA_1.inprogress` https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L586-L600 - Inside `cleanAppData`, first it loads the latest information of `ApplicationInfoWrapper` from the KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L632 For most of the time, when this line is executed, Thread 1 already finished `updating the entry with key=applicationId (AAA) with new value of an instance of ApplicationInfoWrapper` so this condition https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L637 will be evaluated as false, so `isStale` will be false. However, in some rare cases, when Thread1 does not finish the update yet, the old data of ApplicationInfoWrapper will be load, so `isStale` will be true and it leads to deleting the entry of ApplicationInfoWrapper in KVStore: https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L656-L662 and the worst thing is it delete the entry right after when Thread 1 has finished updating the entry with key=applicationId (`AAA`) with new value of an instance of ApplicationInfoWrapper. So the entry for the ApplicationInfoWrapper of applicationId= `AAA` is removed forever then when users access the Web UI for this application, and `Application not found` is shown up while the log for the app does exist. So here we make the `cleanAppData` method atomic just like the `addListing` method https://github.com/apache/spark/blob/v3.2.1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L1172 so that - If Thread 1 gets the lock on the `listing` before Thread 2, it will update the entry for the application, so in Thread2 `isStale` will be false, the entry for the application will not be removed from KVStore - If Thread 2 gets the lock on the `listing` before Thread 1, then `isStale` will be true, the entry for the application will be removed from KVStore but after that it will be added again by Thread 1. In both case, the entry for the application will not be deleted unexpectedly from KVStore. ### Why are the changes needed? Fix the bug causing HS Web UI to display `Application not found` for applications whose logs does exist. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ## How was this patch tested? Manual test. Deployed in our Spark HS and the `java.util.NoSuchElementException` exception does not happen anymore. `Application not found` error does not happen anymore. Closes #36424 from tanvn/SPARK-39083. Authored-by: tan.vu <tan.vu@linecorp.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 29643265a9f5e8142d20add5350c614a55161451) Signed-off-by: Sean Owen <srowen@gmail.com> 08 May 2022, 13:09:32 UTC
3241f20 [SPARK-39093][SQL][FOLLOWUP] Fix Period test ### What changes were proposed in this pull request? Change the Period test to use months rather than days. ### Why are the changes needed? While the Period test as-is confirms that the compilation error is fixed, it doesn't confirm that the newly generated code is correct. Spark ignores any unit less than months in Periods. As a result, we are always testing `0/(num + 3)`, so the test doesn't verify that the code generated for the right-hand operand is correct. ### Does this PR introduce _any_ user-facing change? No, only changes a test. ### How was this patch tested? Unit test. Closes #36481 from bersprockets/SPARK-39093_followup. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 670710e91dfb2ed27c24b139c50a6bcf03132024) Signed-off-by: Gengliang Wang <gengliang@apache.org> 08 May 2022, 10:14:58 UTC
6378365 [SPARK-39121][K8S][DOCS] Fix format error on running-on-kubernetes doc ### What changes were proposed in this pull request? Fix format error on running-on-kubernetes doc ### Why are the changes needed? Fix format syntax error ### Does this PR introduce _any_ user-facing change? No, unreleased doc only ### How was this patch tested? - `SKIP_API=1 bundle exec jekyll serve --watch` - CI passed Closes #36476 from Yikun/SPARK-39121. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 2349f74866ae1b365b5e4e0ec8a58c4f7f06885c) Signed-off-by: Max Gekk <max.gekk@gmail.com> 07 May 2022, 07:20:02 UTC
74ffaec [SPARK-39012][SQL] SparkSQL cast partition value does not support all data types ### What changes were proposed in this pull request? Add binary type support for schema inference in SparkSQL. ### Why are the changes needed? Spark does not support binary and boolean types when casting partition value to a target type. This PR adds the support for binary and boolean. ### Does this PR introduce _any_ user-facing change? No. This is more like fixing a bug. ### How was this patch tested? UT Closes #36344 from amaliujia/inferbinary. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 08c07a17717691f931d4d3206dd0385073f5bd08) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 May 2022, 09:39:13 UTC
de3673e [SPARK-39105][SQL] Add ConditionalExpression trait ### What changes were proposed in this pull request? Add `ConditionalExpression` trait. ### Why are the changes needed? For developers, if a custom conditional like expression contains common sub expression then the evaluation order may be changed since Spark will pull out and eval the common sub expressions first during execution. Add ConditionalExpression trait is friendly for developers. ### Does this PR introduce _any_ user-facing change? no, add a new trait ### How was this patch tested? Pass existed test Closes #36455 from ulysses-you/SPARK-39105. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fa86a078bb7d57d7dbd48095fb06059a9bdd6c2e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 May 2022, 07:42:10 UTC
6a61f95 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases ### What changes were proposed in this pull request? Add missed dependencies to `dev/create-release/spark-rm/Dockerfile`. ### Why are the changes needed? To be able to build Spark releases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By building the Spark 3.3 release via: ``` $ dev/create-release/do-release-docker.sh -d /home/ubuntu/max/spark-3.3-rc1 ``` Closes #36449 from MaxGekk/deps-Dockerfile. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 4b1c2fb7a27757ebf470416c8ec02bb5c1f7fa49) Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 May 2022, 17:10:17 UTC
0f2e3ec [SPARK-35912][SQL][FOLLOW-UP] Add a legacy configuration for respecting nullability in DataFrame.schema.csv/json(ds) ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/33436, that adds a legacy configuration. It's found that it can break a valid usacase (https://github.com/apache/spark/pull/33436/files#r863271189): ```scala import org.apache.spark.sql.types._ val ds = Seq("a,", "a,b").toDS spark.read.schema( StructType( StructField("f1", StringType, nullable = false) :: StructField("f2", StringType, nullable = false) :: Nil) ).option("mode", "DROPMALFORMED").csv(ds).show() ``` **Before:** ``` +---+---+ | f1| f2| +---+---+ | a| b| +---+---+ ``` **After:** ``` +---+----+ | f1| f2| +---+----+ | a|null| | a| b| +---+----+ ``` This PR adds a configuration to restore **Before** behaviour. ### Why are the changes needed? To avoid breakage of valid usecases. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new configuration `spark.sql.legacy.respectNullabilityInTextDatasetConversion` (`false` by default) to respect the nullability in `DataFrameReader.schema(schema).csv(dataset)` and `DataFrameReader.schema(schema).json(dataset)` when the user-specified schema is provided. ### How was this patch tested? Unittests were added. Closes #36435 from HyukjinKwon/SPARK-35912. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6689b97ec76abe5bab27f02869f8f16b32530d1a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 May 2022, 07:23:43 UTC
1fa3171 [SPARK-39060][SQL][3.3] Typo in error messages of decimal overflow ### What changes were proposed in this pull request? This PR removes extra curly bracket from debug string for Decimal type in SQL. This is a backport from master branch. Commit: 165ce4eb7d6d75201beb1bff879efa99fde24f94 ### Why are the changes needed? Typo in error messages of decimal overflow. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests: ``` $ build/sbt "sql/testOnly" ``` Closes #36450 from vli-databricks/SPARK-39060-3.3. Authored-by: Vitalii Li <vitalii.li@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 May 2022, 06:08:38 UTC
94d3d6b [SPARK-38891][SQL] Skipping allocating vector for repetition & definition levels when possible ### What changes were proposed in this pull request? This PR adds two optimization on the vectorized Parquet reader for complex types: - avoid allocating vectors for repetition and definition levels whenever can be applied - avoid reading definition levels whenever can be applied ### Why are the changes needed? At the moment, Spark will allocate vectors for repetition and definition levels, and also read definition levels even if it's not necessary, for instance, when reading primitive types. This may add extra memory footprint especially when reading wide tables. Therefore, we should avoid them if possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #36202 from sunchao/SPARK-38891. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> 04 May 2022, 16:37:55 UTC
fd998c8 [SPARK-39093][SQL] Avoid codegen compilation error when dividing year-month intervals or day-time intervals by an integral ### What changes were proposed in this pull request? In `DivideYMInterval#doGenCode` and `DivideDTInterval#doGenCode`, rely on the operand variable names provided by `nullSafeCodeGen` rather than calling `genCode` on the operands twice. ### Why are the changes needed? `DivideYMInterval#doGenCode` and `DivideDTInterval#doGenCode` call `genCode` on the operands twice (once directly, and once indirectly via `nullSafeCodeGen`). However, if you call `genCode` on an operand twice, you might not get back the same variable name for both calls (e.g., when the operand is not a `BoundReference` or if whole-stage codegen is turned off). When that happens, `nullSafeCodeGen` generates initialization code for one set of variables, but the divide expression generates usage code for another set of variables, resulting in compilation errors like this: ``` spark-sql> create or replace temp view v1 as > select * FROM VALUES > (interval '10' months, interval '10' day, 2) > as v1(period, duration, num); Time taken: 2.81 seconds spark-sql> cache table v1; Time taken: 2.184 seconds spark-sql> select period/(num + 3) from v1; 22/05/03 08:56:37 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 44: Expression "project_value_2" is not an rvalue ... 22/05/03 08:56:37 WARN UnsafeProjection: Expr codegen error and falling back to interpreter mode ... 0-2 Time taken: 0.149 seconds, Fetched 1 row(s) spark-sql> select duration/(num + 3) from v1; 22/05/03 08:57:29 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 40, Column 54: Expression "project_value_2" is not an rvalue ... 22/05/03 08:57:29 WARN UnsafeProjection: Expr codegen error and falling back to interpreter mode ... 2 00:00:00.000000000 Time taken: 0.089 seconds, Fetched 1 row(s) ``` The error is not fatal (unless you have `spark.sql.codegen.fallback` set to `false`), but it muddies the log and can slow the query (since the expression is interpreted). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests (unit tests run with `spark.sql.codegen.fallback` set to `false`, so the new tests fail without the fix). Closes #36442 from bersprockets/interval_div_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit ca87bead23ca32a05c6a404a91cea47178f63e70) Signed-off-by: Gengliang Wang <gengliang@apache.org> 04 May 2022, 09:22:24 UTC
d3aadb4 [SPARK-39087][SQL][3.3] Improve messages of error classes ### What changes were proposed in this pull request? In the PR, I propose to modify error messages of the following error classes: - INVALID_JSON_SCHEMA_MAP_TYPE - INCOMPARABLE_PIVOT_COLUMN - INVALID_ARRAY_INDEX_IN_ELEMENT_AT - INVALID_ARRAY_INDEX - DIVIDE_BY_ZERO This is a backport of https://github.com/apache/spark/pull/36428. ### Why are the changes needed? To improve readability of error messages. ### Does this PR introduce _any_ user-facing change? Yes. It changes user-facing error messages. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite*" $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite*" $ build/sbt "sql/testOnly *QueryExecutionAnsiErrorsSuite" $ build/sbt "test:testOnly *SparkThrowableSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 040526391a45ad610422a48c05aa69ba5133f922) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36439 from MaxGekk/error-class-improve-msg-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 04 May 2022, 05:45:03 UTC
0515536 Preparing development version 3.3.1-SNAPSHOT 03 May 2022, 18:15:51 UTC
482b7d5 Preparing Spark release v3.3.0-rc1 03 May 2022, 18:15:45 UTC
4177626 [SPARK-35320][SQL][FOLLOWUP] Remove duplicated test ### What changes were proposed in this pull request? Follow-up for https://github.com/apache/spark/pull/33525 to remove duplicated test. ### Why are the changes needed? We don't need to do the same test twice. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch remove the duplicated test, so the existing test should pass. Closes #36436 from itholic/SPARK-35320. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a1ac5c57c7b79fb70656638d284b77dfc4261d35) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 May 2022, 10:28:13 UTC
bd6fd7e [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion ### What changes were proposed in this pull request? This PR fixes the issue described in https://issues.apache.org/jira/browse/SPARK-39084 where calling `df.rdd.isEmpty()` on a particular dataset could result in a JVM crash and/or executor failure. The issue was due to Python iterator not being synchronised with Java iterator so when the task is complete, the Python iterator continues to process data. We have introduced ContextAwareIterator as part of https://issues.apache.org/jira/browse/SPARK-33277 but we did not fix all of the places where this should be used. ### Why are the changes needed? Fixes the JVM crash when checking isEmpty() on a dataset. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test case that reproduces the issue 100%. I confirmed that the test fails without the fix and passes with the fix. Closes #36425 from sadikovi/fix-pyspark-iter-2. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9305cc744d27daa6a746d3eb30e7639c63329072) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 May 2022, 23:30:14 UTC
1804f5c [SPARK-37474][R][DOCS][FOLLOW-UP] Make SparkR documentation able to build on Mac OS ### What changes were proposed in this pull request? Currently SparkR documentation fails because of the usage `grep -oP `. Mac OS does not have this. This PR fixes it via using the existing way used in the current scripts at: https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/R/check-cran.sh#L52 ### Why are the changes needed? To make the dev easier. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested via: ```bash cd R ./create-docs.sh ``` Closes #36423 from HyukjinKwon/SPARK-37474. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6479455b8db40d584045cdb13e6c3cdfda7a2c0b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 May 2022, 08:50:09 UTC
4c09616 [SPARK-38085][SQL][FOLLOWUP] Do not fail too early for DeleteFromTable ### What changes were proposed in this pull request? `DeleteFromTable` has been in Spark for a long time and there are existing Spark extensions to compile `DeleteFromTable` to physical plans. However, the new analyzer rule `RewriteDeleteFromTable` fails very early if the v2 table does not support delete. This breaks certain Spark extensions which can still execute `DeleteFromTable` for certain v2 tables. This PR simply removes the error throwing in `RewriteDeleteFromTable`. It's a safe change because: 1. the new delete-related rules only match v2 table with `SupportsRowLevelOperations`, so won't be affected by this change 2. the planner rule will fail eventually if the v2 table doesn't support deletion. Spark eagerly executes commands so Spark users can still see this error immediately. ### Why are the changes needed? To not break existing Spark extesions. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #36402 from cloud-fan/follow. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5630f700768432396a948376f5b46b00d4186e1b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 May 2022, 02:28:51 UTC
c87180a [SPARK-39040][SQL] Respect NaNvl in EquivalentExpressions for expression elimination ### What changes were proposed in this pull request? Respect NaNvl in EquivalentExpressions for expression elimination. ### Why are the changes needed? For example the query will fail: ```sql set spark.sql.ansi.enabled=true; set spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding; SELECT nanvl(1, 1/0 + 1/0); ``` ```sql org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4) (10.221.98.68 executor driver): org.apache.spark.SparkArithmeticException: divide by zero. To return NULL instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error. == SQL(line 1, position 17) == select nanvl(1 , 1/0 + 1/0) ^^^ at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:151) ``` We should respect the ordering of conditional expression that always evaluate the predicate branch first, so the query above should not fail. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #36376 from ulysses-you/SPARK-39040. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f6b43f0384f9681b963f52a53759c521f6ac11d5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 April 2022, 04:36:05 UTC
e7caea5 [SPARK-39047][SQL][3.3] Replace the error class ILLEGAL_SUBSTRING by INVALID_PARAMETER_VALUE ### What changes were proposed in this pull request? In the PR, I propose to remove the `ILLEGAL_SUBSTRING` error class, and use `INVALID_PARAMETER_VALUE` in the case when the `strfmt` parameter of the `format_string()` function contains `%0$`. The last value is handled differently by JDKs: _"... Java 8 and Java 11 uses it as "%1$", and Java 17 throws IllegalFormatArgumentIndexException(Illegal format argument index = 0)"_. This is a backport of https://github.com/apache/spark/pull/36380. ### Why are the changes needed? To improve code maintenance and user experience with Spark SQL by reducing the number of user-facing error classes. ### Does this PR introduce _any_ user-facing change? Yes, it changes user-facing error message. Before: ```sql spark-sql> select format_string('%0$s', 'Hello'); Error in query: [ILLEGAL_SUBSTRING] The argument_index of string format cannot contain position 0$.; line 1 pos 7 ``` After: ```sql spark-sql> select format_string('%0$s', 'Hello'); Error in query: [INVALID_PARAMETER_VALUE] The value of parameter(s) 'strfmt' in `format_string` is invalid: expects %1$, %2$ and so on, but got %0$.; line 1 pos 7 ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *SparkThrowableSuite" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z text.sql" $ build/sbt "test:testOnly *QueryCompilationErrorsSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 9dcc24c36f6fcdf43bf66fe50415be575f7b2918) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #36390 from MaxGekk/error-class-ILLEGAL_SUBSTRING-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 28 April 2022, 11:11:17 UTC
50f5e9c [SPARK-39055][DOC] Fix documentation 404 page ### What changes were proposed in this pull request? replace unused `docs/_layouts/404.html` with `docs/404.md` ### Why are the changes needed? make the custom 404 page work <img width="638" alt="image" src="https://user-images.githubusercontent.com/8326978/165706963-6cc96cf5-299e-4b60-809f-79dd771f3b5d.png"> ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` bundle exec jekyll serve ``` and visit a non-existing page `http://localhost:4000/abcd` Closes #36392 from yaooqinn/SPARK-39055. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ec2bfa566ed3c796e91987f7a158e8b60fbd5c42) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 April 2022, 10:57:51 UTC
606a99f [SPARK-39046][SQL] Return an empty context string if TreeNode.origin is wrongly set ### What changes were proposed in this pull request? For the query context `TreeNode.origin.context`, this PR proposal to return an empty context string if * the query text/ the start index/ the stop index is missing * the start index is less than 0 * the stop index is larger than the length of query text * the start index is larger than the stop index ### Why are the changes needed? There are downstream projects that depend on Spark. There is no guarantee for the correctness of TreeNode.origin. Developers may create a plan/expression with a Origin containing wrong startIndex/stopIndex/sqlText. Thus, to avoid errors in calling `String.substring` or showing misleading debug information, I suggest returning an empty context string if TreeNode.origin is wrongly set. The query context is just for better error messages and we should handle it cautiously. ### Does this PR introduce _any_ user-facing change? No, the context framework is not released yet. ### How was this patch tested? UT Closes #36379 from gengliangwang/safeContext. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 7fe2759e9f81ec267e92e1c6f8a48f42042db791) Signed-off-by: Gengliang Wang <gengliang@apache.org> 28 April 2022, 01:59:34 UTC
96d66b0 [SPARK-38988][PYTHON] Suppress PerformanceWarnings of `DataFrame.info` ### What changes were proposed in this pull request? Suppress PerformanceWarnings of DataFrame.info ### Why are the changes needed? To improve usability. ### Does this PR introduce _any_ user-facing change? No. Only PerformanceWarnings of DataFrame.info are suppressed. ### How was this patch tested? Manual tests. Closes #36367 from xinrong-databricks/frame.info. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 594337fad131280f62107326062fb554f0566d43) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 April 2022, 00:25:51 UTC
c9b6b50 [SPARK-39049][PYTHON][CORE][ML] Remove unneeded `pass` ### What changes were proposed in this pull request? Remove unneeded `pass` ### Why are the changes needed? Class`s Estimator, Transformer and Evaluator are abstract classes. Which has functions. ValueError in def run() has code. By removing `pass` it will be easier to read, understand and reuse code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests passed. Closes #36383 from bjornjorgensen/remove-unneeded-pass. Lead-authored-by: Bjørn Jørgensen <bjornjorgensen@gmail.com> Co-authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0e875875059c1cbf36de49205a4ce8dbc483d9d1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 April 2022, 00:21:07 UTC
84addc5 [SPARK-39051][PYTHON] Minor refactoring of `python/pyspark/sql/pandas/conversion.py` Minor refactoring of `python/pyspark/sql/pandas/conversion.py`, which includes: - doc change - renaming To improve code readability and maintainability. No. Existing tests. Closes #36384 from xinrong-databricks/conversion.py. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c19fadabde3ef3f9c7e4fa9bf74632a4f8e1f3e2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 April 2022, 00:20:18 UTC
4a4e35a [SPARK-38997][SQL] DS V2 aggregate push-down supports group by expressions ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down only supports group by column. But the SQL show below is very useful and common. ``` SELECT CASE WHEN 'SALARY' > 8000.00 AND 'SALARY' < 10000.00 THEN 'SALARY' ELSE 0.00 END AS key, SUM('SALARY') FROM "test"."employee" GROUP BY key ``` ### Why are the changes needed? Let DS V2 aggregate push-down supports group by expressions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests Closes #36325 from beliefer/SPARK-38997. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ee6ea3c68694e35c36ad006a7762297800d1e463) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2022, 16:44:18 UTC
b25276f [SPARK-39015][SQL][3.3] Remove the usage of toSQLValue(v) without an explicit type ### What changes were proposed in this pull request? This PR is a backport of https://github.com/apache/spark/pull/36351 This PR proposes to remove the the usage of `toSQLValue(v)` without an explicit type. `Literal(v)` is intended to be used from end-users so it cannot handle our internal types such as `UTF8String` and `ArrayBasedMapData`. Using this method can lead to unexpected error messages such as: ``` Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45) ... ``` Since It is impossible to have the corresponding data type from the internal types as one type can map to multiple external types (e.g., `Long` for `Timestamp`, `TimestampNTZ`, and `LongType`), the removal approach was taken. ### Why are the changes needed? To provide the error messages as intended. ### Does this PR introduce _any_ user-facing change? Yes. ```scala import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType, StringType) val arrayStructureSchema = new StructType().add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) spark.conf.set("spark.sql.ansi.enabled", true) mapTypeDF.selectExpr("element_at(properties, 'hair')").show ``` Before: ``` Caused by: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE] The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:241) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:99) at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:45) ... ``` After: ``` Caused by: org.apache.spark.SparkNoSuchElementException: [MAP_KEY_DOES_NOT_EXIST] Key 'hair' does not exist. To return NULL instead, use 'try_element_at'. If necessary set spark.sql.ansi.enabled to false to bypass this error. == SQL(line 1, position 0) == element_at(properties, 'hair') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` ### How was this patch tested? Unittest was added. Otherwise, existing test cases should cover. Closes #36375 from HyukjinKwon/SPARK-39015-3.3. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 27 April 2022, 13:17:51 UTC
b3ecff3 [SPARK-34079][SQL][FOLLOW-UP] Revert some changes in InjectRuntimeFilterSuite To remove unnecessary changes from `InjectRuntimeFilterSuite` after https://github.com/apache/spark/pull/32298. These are not needed after https://github.com/apache/spark/pull/34929 as the final optimized plan does'n contain any `WithCTE` nodes. No need for those changes. No. Added new test. Closes #36361 from peter-toth/SPARK-34079-multi-column-scalar-subquery-follow-up-2. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d05e01d54024e3844f1e48e03bad3fd814b8f6b9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2022, 08:18:06 UTC
d59f118 [SPARK-39032][PYTHON][DOCS] Examples' tag for pyspark.sql.functions.when() ### What changes were proposed in this pull request? Fix missing keyword for `pyspark.sql.functions.when()` documentation. ### Why are the changes needed? [Documentation](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html) is not formatted correctly ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All tests passed. Closes #36369 from vadim/SPARK-39032. Authored-by: vadim <86705+vadim@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5b53bdfa83061c160652e07b999f996fc8bd2ece) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 April 2022, 07:56:30 UTC
793ba60 [SPARK-38918][SQL] Nested column pruning should filter out attributes that do not belong to the current relation ### What changes were proposed in this pull request? This PR updates `ProjectionOverSchema` to use the outputs of the data source relation to filter the attributes in the nested schema pruning. This is needed because the attributes in the schema do not necessarily belong to the current data source relation. For example, if a filter contains a correlated subquery, then the subquery's children can contain attributes from both the inner query and the outer query. Since the `RewriteSubquery` batch happens after early scan pushdown rules, nested schema pruning can wrongly use the inner query's attributes to prune the outer query data schema, thus causing wrong results and unexpected exceptions. ### Why are the changes needed? To fix a bug in `SchemaPruning`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #36216 from allisonwang-db/spark-38918-nested-column-pruning. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 150434b5d7909dcf8248ffa5ec3d937ea3da09fd) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 27 April 2022, 05:39:53 UTC
3df3421 [SPARK-39027][SQL][3.3] Output SQL statements in error messages in upper case and w/o double quotes ### What changes were proposed in this pull request? In the PR, I propose to output any SQL statement in error messages in upper case, and apply new implementation of `QueryErrorsBase.toSQLStmt()` to all exceptions in `Query.*Errors` w/ error classes. Also this PR modifies all affected tests, see the list in the section "How was this patch tested?". This is a backport of https://github.com/apache/spark/pull/36359. ### Why are the changes needed? To improve user experience with Spark SQL by highlighting SQL statements in error massage and make them more visible to users. Also this PR makes SQL statements in error messages consistent to the docs where such elements are showed in upper case w/ any quotes. ### Does this PR introduce _any_ user-facing change? Yes. The changes might influence on error messages: Before: ```sql The operation "DESC PARTITION" is not allowed ``` After: ```sql The operation DESC PARTITION is not allowed ``` ### How was this patch tested? By running affected test suites: ``` $ build/sbt "sql/testOnly *QueryExecutionErrorsSuite" $ build/sbt "sql/testOnly *QueryParsingErrorsSuite" $ build/sbt "sql/testOnly *QueryCompilationErrorsSuite" $ build/sbt "test:testOnly *QueryCompilationErrorsDSv2Suite" $ build/sbt "test:testOnly *ExtractPythonUDFFromJoinConditionSuite" $ build/sbt "testOnly *PlanParserSuite" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z transform.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z join-lateral.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z describe.sql" $ build/sbt "testOnly *DDLParserSuite" ``` Closes #36363 from MaxGekk/error-class-toSQLStmt-no-quotes-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 April 2022, 04:15:04 UTC
80efaa2 [SPARK-34863][SQL][FOLLOW-UP] Handle IsAllNull in OffHeapColumnVector ### What changes were proposed in this pull request? This PR fixes an issue of reading null columns with the vectorised Parquet reader when the entire column is null or does not exist. This is especially noticeable when performing a merge or schema evolution in Parquet. The issue is only exposed with the `OffHeapColumnVector` which does not handle `isAllNull` flag - `OnHeapColumnVector` already handles `isAllNull` so everything works fine there. ### Why are the changes needed? The change is needed to correctly read null columns using the vectorised reader in the off-heap mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I updated the existing unit tests to ensure we cover off-heap mode. I confirmed that the tests pass with the fix and fail without. Closes #36366 from sadikovi/fix-off-heap-cv. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Chao Sun <sunchao@apple.com> 27 April 2022, 03:45:23 UTC
052ae96 [SPARK-39030][PYTHON] Rename sum to avoid shading the builtin Python function ### What changes were proposed in this pull request? Rename sum to something else. ### Why are the changes needed? Sum is a build in function in python. [SUM() at python docs](https://docs.python.org/3/library/functions.html#sum) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use existing tests. Closes #36364 from bjornjorgensen/rename-sum. Authored-by: bjornjorgensen <bjornjorgensen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 3821d807a599a2d243465b4e443f1eb68251d432) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 April 2022, 01:10:20 UTC
41d44d1 Revert "[SPARK-38354][SQL] Add hash probes metric for shuffled hash join" This reverts commit 158436655f30141bbd5afa8d95aec66282a5c4b4, as the original PR caused performance regression reported in https://github.com/apache/spark/pull/35686#issuecomment-1107807027 . Closes #36338 from c21/revert-metrics. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6b5a1f9df28262fa90d28dc15af67e8a37a9efcf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 April 2022, 23:12:45 UTC
e2f82fd [SPARK-39019][TESTS] Use `withTempPath` to clean up temporary data directory after `SPARK-37463: read/write Timestamp ntz to Orc with different time zone` ### What changes were proposed in this pull request? `SPARK-37463: read/write Timestamp ntz to Orc with different time zone` use the absolute path to save the test data, and does not clean up the test data after the test. This pr change to use `withTempPath` to ensure the data directory is cleaned up after testing. ### Why are the changes needed? Clean up the temporary data directory after test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA - Manual test: Run ``` mvn clean install -pl sql/core -am -DskipTests mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcV1QuerySuite git status ``` **Before** ``` sql/core/ts_ntz_orc/ ls sql/core/ts_ntz_orc _SUCCESS part-00000-9523e257-5024-4980-8bb3-12070222b0bd-c000.snappy.orc ``` **After** No residual `sql/core/ts_ntz_orc/` Closes #36352 from LuciferYang/SPARK-39019. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 97449d23a3b2232e14e63c6645919c5d93e4491c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 April 2022, 22:52:53 UTC
dd192b7 [SPARK-34079][SQL][FOLLOW-UP] Remove debug logging ### What changes were proposed in this pull request? To remove debug logging accidentally left in code after https://github.com/apache/spark/pull/32298. ### Why are the changes needed? No need for that logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36354 from peter-toth/SPARK-34079-multi-column-scalar-subquery-follow-up. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f24c9b0d135ce7ef4f219ab661a6b665663039f0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 April 2022, 10:19:58 UTC
ff9b163 [SPARK-39001][SQL][DOCS][FOLLOW-UP] Revert the doc changes for dropFieldIfAllNull, prefersDecimal and primitivesAsString (schema_of_json) This PR is a followup of https://github.com/apache/spark/pull/36339. Actually `schema_of_json` expression supports `dropFieldIfAllNull`, `prefersDecimal` an `primitivesAsString`: ```scala scala> spark.range(1).selectExpr("""schema_of_json("{'a': null}", map('dropFieldIfAllNull', 'true'))""").show() +---------------------------+ |schema_of_json({'a': null})| +---------------------------+ | STRUCT<>| +---------------------------+ scala> spark.range(1).selectExpr("""schema_of_json("{'b': 1.0}", map('prefersDecimal', 'true'))""").show() +--------------------------+ |schema_of_json({'b': 1.0})| +--------------------------+ | STRUCT<b: DECIMAL...| +--------------------------+ scala> spark.range(1).selectExpr("""schema_of_json("{'b': 1.0}", map('primitivesAsString', 'true'))""").show() +--------------------------+ |schema_of_json({'b': 1.0})| +--------------------------+ | STRUCT<b: STRING>| +--------------------------+ ``` For correct documentation. To end users, no because it's a partial revert of the docs unreleased yet. Partial logical revert so I did not add a test also since this is just a doc change. Closes #36346 from HyukjinKwon/SPARK-39001-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5056c6cc333982d39546f2acf9a889d102cc4ab3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 April 2022, 01:57:58 UTC
back to top