https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
ea1a426 Preparing Spark release v3.3.1-rc1 14 September 2022, 00:20:48 UTC
d70f156 [SPARK-40417][K8S][DOCS] Use YuniKorn v1.1+ ### What changes were proposed in this pull request? This PR aims to update K8s document to declare the support of YuniKorn v1.1.+ ### Why are the changes needed? YuniKorn 1.1.0 has 87 JIRAs and is the first version to support multi-arch officially. - https://yunikorn.apache.org/release-announce/1.1.0 ``` $ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture "Architecture": "amd64", $ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture "Architecture": "arm64", ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested with Apache YuniKorn v1.1.0+. ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests \ -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" \ -Dtest.exclude.tags=minikube,local,decom \ -Dtest.default.exclude.tags= ... [info] KubernetesSuite: [info] - Run SparkPi with no resources (11 seconds, 238 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (11 seconds, 58 milliseconds) [info] - Run SparkPi with a very long application name. (9 seconds, 948 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 884 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (9 seconds, 834 milliseconds) [info] - Run SparkPi with an argument. (9 seconds, 870 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 887 milliseconds) [info] - All pods have the same service account by default (9 seconds, 891 milliseconds) [info] - Run extraJVMOptions check on driver (5 seconds, 888 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (10 seconds, 261 milliseconds) [info] - Run SparkPi with env and mount secrets. (18 seconds, 702 milliseconds) [info] - Run PySpark on simple pi.py example (10 seconds, 944 milliseconds) [info] - Run PySpark to test a pyfiles example (13 seconds, 934 milliseconds) [info] - Run PySpark with memory customization (10 seconds, 853 milliseconds) [info] - Run in client mode. (11 seconds, 301 milliseconds) [info] - Start pod creation from template (9 seconds, 853 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (9 seconds, 923 milliseconds) [info] - Run SparkR on simple dataframe.R example (13 seconds, 929 milliseconds) [info] YuniKornSuite: [info] - Run SparkPi with no resources (9 seconds, 769 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (9 seconds, 776 milliseconds) [info] - Run SparkPi with a very long application name. (9 seconds, 856 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 803 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (10 seconds, 783 milliseconds) [info] - Run SparkPi with an argument. (10 seconds, 771 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 868 milliseconds) [info] - All pods have the same service account by default (10 seconds, 811 milliseconds) [info] - Run extraJVMOptions check on driver (6 seconds, 858 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (11 seconds, 171 milliseconds) [info] - Run SparkPi with env and mount secrets. (18 seconds, 221 milliseconds) [info] - Run PySpark on simple pi.py example (11 seconds, 970 milliseconds) [info] - Run PySpark to test a pyfiles example (13 seconds, 990 milliseconds) [info] - Run PySpark with memory customization (11 seconds, 992 milliseconds) [info] - Run in client mode. (11 seconds, 294 milliseconds) [info] - Start pod creation from template (11 seconds, 10 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (9 seconds, 956 milliseconds) [info] - Run SparkR on simple dataframe.R example (12 seconds, 992 milliseconds) [info] Run completed in 10 minutes, 15 seconds. [info] Total number of tests run: 36 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 36, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 751 s (12:31), completed Sep 13, 2022, 11:47:24 AM ``` Closes #37872 from dongjoon-hyun/SPARK-40417. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a934cabd24afa5c8f6e8e1d2341829166129a5c8) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 September 2022, 19:52:33 UTC
1741aab [SPARK-40362][SQL][3.3] Fix BinaryComparison canonicalization ### What changes were proposed in this pull request? Change canonicalization to a one pass process and move logic from `Canonicalize.reorderCommutativeOperators` to the respective commutative operators' `canonicalize`. ### Why are the changes needed? https://github.com/apache/spark/pull/34883 improved expression canonicalization performance but introduced regression when a commutative operator is under a `BinaryComparison`. This is because children reorder by their hashcode can't happen in `preCanonicalized` phase when children are not yet "final". ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #37866 from peter-toth/SPARK-40362-fix-binarycomparison-canonicalization-3.3. Lead-authored-by: Peter Toth <ptoth@cloudera.com> Co-authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 September 2022, 19:38:02 UTC
0a180c0 [SPARK-40292][SQL] Fix column names in "arrays_zip" function when arrays are referenced from nested structs ### What changes were proposed in this pull request? This PR fixes an issue in `arrays_zip` function where a field index was used as a column name in the resulting schema which was a regression from Spark 3.1. With this change, the original behaviour is restored: a corresponding struct field name will be used instead of a field index. Example: ```sql with q as ( select named_struct( 'my_array', array(1, 2, 3), 'my_array2', array(4, 5, 6) ) as my_struct ) select arrays_zip(my_struct.my_array, my_struct.my_array2) from q ``` would return schema: ``` root |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false) | |-- element: struct (containsNull = false) | | |-- 0: integer (nullable = true) | | |-- 1: integer (nullable = true) ``` which is somewhat inaccurate. PR adds handling of `GetStructField` expression to return the struct field names like this: ``` root |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false) | |-- element: struct (containsNull = false) | | |-- my_array: integer (nullable = true) | | |-- my_array2: integer (nullable = true) ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, `arrays_zip` function returns struct field names now as in Spark 3.1 instead of field indices. Some users might have worked around this issue so this patch would affect them by bringing back the original behaviour. ### How was this patch tested? Existing unit tests. I also added a test case that reproduces the problem. Closes #37833 from sadikovi/SPARK-40292. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 443eea97578c41870c343cdb88cf69bfdf27033a) Signed-off-by: Max Gekk <max.gekk@gmail.com> 12 September 2022, 04:33:50 UTC
052d60c [SPARK-40228][SQL][3.3] Do not simplify multiLike if child is not a cheap expression This PR backport https://github.com/apache/spark/pull/37672 to branch-3.3. The original PR's description: ### What changes were proposed in this pull request? Do not simplify multiLike if child is not a cheap expression. ### Why are the changes needed? 1. Simplifying multiLike in this cases can not benefit the query because it cannot be pushed down. 2. Reduce the number of evaluations for these expressions. For example: ```sql select * from t1 where substr(name, 1, 5) like any('%a', 'b%', '%c%'); ``` ``` == Physical Plan == *(1) Filter ((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) OR Contains(substr(name#0, 1, 5), c)) +- *(1) ColumnarToRow +- FileScan parquet default.t1[name#0] Batched: true, DataFilters: [((EndsWith(substr(name#0, 1, 5), a) OR StartsWith(substr(name#0, 1, 5), b)) OR Contains(substr(n..., Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:string> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37813 from wangyum/SPARK-40228-branch-3.3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 September 2022, 23:48:34 UTC
18fc8e8 [SPARK-39915][SQL][3.3] Dataset.repartition(N) may not create N partitions Non-AQE part ### What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/37706 for branch-3.3 Skip optimize the root user-specified repartition in `PropagateEmptyRelation`. ### Why are the changes needed? Spark should preserve the final repatition which can affect the final output partition which is user-specified. For example: ```scala spark.sql("select * from values(1) where 1 < rand()").repartition(1) // before: == Optimized Logical Plan == LocalTableScan <empty>, [col1#0] // after: == Optimized Logical Plan == Repartition 1, true +- LocalRelation <empty>, [col1#0] ``` ### Does this PR introduce _any_ user-facing change? yes, the empty plan may change ### How was this patch tested? add test Closes #37730 from ulysses-you/empty-3.3. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 September 2022, 21:43:19 UTC
4f69c98 [SPARK-39830][SQL][TESTS][3.3] Add a test case to read ORC table that requires type promotion ### What changes were proposed in this pull request? Increase ORC test coverage. [ORC-1205](https://issues.apache.org/jira/browse/ORC-1205) Size of batches in some ConvertTreeReaders should be ensured before using ### Why are the changes needed? When spark reads an orc with type promotion, an `ArrayIndexOutOfBoundsException` may be thrown, which has been fixed in version 1.7.6 and 1.8.0. ```java java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) at org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) at org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add UT Closes #37808 from cxzl25/SPARK-39830-3.3. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 September 2022, 21:36:39 UTC
aaa8292 [SPARK-40389][SQL][FOLLOWUP][3.3] Fix a test failure in SQLQuerySuite ### What changes were proposed in this pull request? Fix a test failure in SQLQuerySuite on branch-3.3. It's from the backport of https://github.com/apache/spark/pull/37832 since there is no error class "CAST_OVERFLOW" on branch-3.3 ### Why are the changes needed? Fix test failure ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #37848 from gengliangwang/fixTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 September 2022, 18:57:00 UTC
b18d582 [SPARK-40280][SQL][FOLLOWUP][3.3] Fix 'ParquetFilterSuite' issue ### What changes were proposed in this pull request? ### Why are the changes needed? Fix 'ParquetFilterSuite' issue after merging #37747 : The `org.apache.parquet.filter2.predicate.Operators.In` was added in the parquet 1.12.3, but spark branch-3.3 uses the parquet 1.12.2. Use `Operators.And` instead of `Operators.In`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #37847 from zzcclp/SPARK-40280-hotfix-3.3. Authored-by: Zhichao Zhang <zhangzc@apache.org> Signed-off-by: huaxingao <huaxin_gao@apple.com> 09 September 2022, 18:31:16 UTC
cd9f564 [SPARK-40389][SQL] Decimals can't upcast as integral types if the cast can overflow ### What changes were proposed in this pull request? In Spark SQL, the method `canUpCast` returns true iff we can safely up-cast the `from` type to `to` type without truncating or precision loss or possible runtime failures. Meanwhile, DecimalType(10, 0) is considered as `canUpCast` to Integer type. This is wrong since casting `9000000000BD` as Integer type will overflow. As a result: * The optimizer rule `SimplifyCasts` replies on the method `canUpCast` and it will mistakenly convert `cast(cast(9000000000BD as int) as long)` as `cast(9000000000BD as long)` * The STRICT store assignment policy relies on this method too. With the policy enabled, inserting `9000000000BD` into integer columns will pass compiling time check and insert an unexpected value `410065408`. * etc... ### Why are the changes needed? Bug fix on the method `Cast.canUpCast` ### Does this PR introduce _any_ user-facing change? Yes, fix a bug on the checking condition of whether a decimal can safely cast as integral types. ### How was this patch tested? New UT Closes #37832 from gengliangwang/SPARK-40389. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 17982519a749bd4ca2aa7eca12fba00ccc1520aa) Signed-off-by: Gengliang Wang <gengliang@apache.org> 08 September 2022, 20:23:52 UTC
0cdb081 [SPARK-40280][SQL] Add support for parquet push down for annotated int and long ### What changes were proposed in this pull request? This fixes SPARK-40280 by normalizing a parquet int/long that has optional metadata with it to look like the expected version that does not have the extra metadata. ## Why are the changes needed? This allows predicate push down in parquet to work when reading files that are complaint with the parquet specification, but different from what Spark writes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I added unit tests that cover this use case. I also did some manual testing on some queries to verify that less data is actually read after this change. Closes #37747 from revans2/normalize_int_long_parquet_push. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 24b3baf0177fc1446bf59bb34987296aefd4b318) Signed-off-by: Thomas Graves <tgraves@apache.org> 08 September 2022, 13:55:48 UTC
10cd3ac [SPARK-40380][SQL] Fix constant-folding of InvokeLike to avoid non-serializable literal embedded in the plan ### What changes were proposed in this pull request? Block `InvokeLike` expressions with `ObjectType` result from constant-folding, to ensure constant-folded results are trusted to be serializable. This is a conservative fix for ease of backport to Spark 3.3. A separate future change can relax the restriction and support constant-folding to serializable `ObjectType` as well. ### Why are the changes needed? This fixes a regression introduced by https://github.com/apache/spark/pull/35207 . It enabled the constant-folding logic to aggressively fold `InvokeLike` expressions (e.g. `Invoke`, `StaticInvoke`), when all arguments are foldable and the expression itself is deterministic. But it could go overly aggressive and constant-fold to non-serializable results, which would be problematic when that result needs to be serialized and sent over the wire. In the wild, users of sparksql-scalapb have hit this issue. The constant folding logic would fold a chain of `Invoke` / `StaticInvoke` expressions from only holding onto a serializable literal to holding onto a non-serializable literal: ``` Literal(com.example.protos.demo.Person$...).scalaDescriptor.findFieldByNumber.get ``` this expression works fine before constant-folding because the literal that gets sent to the executors is serializable, but when aggressive constant-folding kicks in it ends up as a `Literal(scalapb.descriptors.FieldDescriptor...)` which isn't serializable. The following minimal repro demonstrates this issue: ``` import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions.Literal import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke} import org.apache.spark.sql.types.{LongType, ObjectType} class NotSerializableBoxedLong(longVal: Long) { def add(other: Long): Long = longVal + other } case class SerializableBoxedLong(longVal: Long) { def toNotSerializable(): NotSerializableBoxedLong = new NotSerializableBoxedLong(longVal) } val litExpr = Literal.fromObject(SerializableBoxedLong(42L), ObjectType(classOf[SerializableBoxedLong])) val toNotSerializableExpr = Invoke(litExpr, "toNotSerializable", ObjectType(classOf[NotSerializableBoxedLong])) val addExpr = Invoke(toNotSerializableExpr, "add", LongType, Seq(UnresolvedAttribute.quotedString("id"))) val df = spark.range(2).select(new Column(addExpr)) df.collect ``` would result in an error if aggressive constant-folding kicked in: ``` ... Caused by: java.io.NotSerializableException: NotSerializableBoxedLong Serialization stack: - object not serializable (class: NotSerializableBoxedLong, value: NotSerializableBoxedLong71231636) - element of array (index: 1) - array (class [Ljava.lang.Object;, size 2) - element of array (index: 1) - array (class [Ljava.lang.Object;, size 3) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$3123/1641694389185db22c) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:49) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:115) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:441) ``` ### Does this PR introduce _any_ user-facing change? Yes, a regression in ObjectType expression starting from Spark 3.3.0 is fixed. ### How was this patch tested? The existing test cases in `ConstantFoldingSuite` continues to pass; added a new test case to demonstrate the regression issue. Closes #37823 from rednaxelafx/fix-invokelike-constantfold. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5b96e82ad6a4f5d5e4034d9d7112077159cf5044) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 September 2022, 13:21:15 UTC
81cb08b add back a mistakenly removed test case 07 September 2022, 11:29:39 UTC
433469f [SPARK-40149][SQL] Propagate metadata columns through Project This PR fixes a regression caused by https://github.com/apache/spark/pull/32017 . In https://github.com/apache/spark/pull/32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of https://github.com/apache/spark/pull/32017 . After propagating metadata columns, a problem from https://github.com/apache/spark/pull/31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. fix a regression For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. new tests Closes #37758 from cloud-fan/metadata. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 99ae1d9a897909990881f14c5ea70a0d1a0bf456) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 September 2022, 11:03:52 UTC
84c0918 [SPARK-40315][SQL] Add hashCode() for Literal of ArrayBasedMapData ### What changes were proposed in this pull request? There is no explicit `hashCode()` function override for `ArrayBasedMapData`. As a result, there is a non-deterministic error where the `hashCode()` computed for `Literal`s of `ArrayBasedMapData` can be different for two equal objects (`Literal`s of `ArrayBasedMapData` with equal keys and values). In this PR, we add a `hashCode` function so that it works exactly as we expect. ### Why are the changes needed? This is a bug fix for a non-deterministic error. It is also more consistent with the rest of Spark if we implement the `hashCode` method instead of relying on defaults. We can't add the `hashCode` directly to `ArrayBasedMapData` because of SPARK-9415. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A simple unit test was added. Closes #37807 from c27kwan/SPARK-40315-lit. Authored-by: Carmen Kwan <carmen.kwan@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e85a4ffbdfa063c8da91b23dfbde77e2f9ed62e9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 September 2022, 13:11:59 UTC
1324f7d [SPARK-40297][SQL] CTE outer reference nested in CTE main body cannot be resolved This PR fixes a bug where a CTE reference cannot be resolved if this reference occurs in an inner CTE definition nested in the outer CTE's main body FROM clause. E.g., ``` WITH cte_outer AS ( SELECT 1 ) SELECT * FROM ( WITH cte_inner AS ( SELECT * FROM cte_outer ) SELECT * FROM cte_inner ) ``` This fix is to change the `CTESubstitution`'s traverse order from `resolveOperatorsUpWithPruning` to `resolveOperatorsDownWithPruning` and also to recursively call `traverseAndSubstituteCTE` for CTE main body. Bug fix. Without the fix an `AnalysisException` would be thrown for CTE queries mentioned above. No. Added UTs. Closes #37751 from maryannxue/spark-40297. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 September 2022, 04:41:37 UTC
9473840 [SPARK-38404][SQL][3.3] Improve CTE resolution when a nested CTE references an outer CTE ### What changes were proposed in this pull request? Please note that the bug in the [SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed already with https://github.com/apache/spark/pull/34929. This PR is a minor improvement to the current implementation by collecting already resolved outer CTEs to avoid re-substituting already collected CTE definitions. ### Why are the changes needed? Small improvement + additional tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test case. Closes #37760 from peter-toth/SPARK-38404-nested-cte-references-outer-cte-3.3. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 September 2022, 04:34:39 UTC
b066561 [SPARK-40326][BUILD] Upgrade `fasterxml.jackson.version` to 2.13.4 upgrade `com.fasterxml.jackson.dataformat:jackson-dataformat-yaml` and `fasterxml.jackson.databind.version` from 2.13.3 to 2.13.4 [CVE-2022-25857](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857) [SNYK-JAVA-ORGYAML](https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360) No. Pass GA Closes #37796 from bjornjorgensen/upgrade-fasterxml.jackson-to-2.13.4. Authored-by: Bjørn <bjornjorgensen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a82a006df80ac3aa6900d8688eb5bf77b804785d) Signed-off-by: Sean Owen <srowen@gmail.com> 06 September 2022, 00:55:21 UTC
284954a Revert "[SPARK-33861][SQL] Simplify conditional in predicate" This reverts commit 32d4a2b and 3aa4e11. Closes #37729 from wangyum/SPARK-33861. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 43cbdc6ec9dbcf9ebe0b48e14852cec4af18b4ec) Signed-off-by: Yuming Wang <yumwang@ebay.com> 03 September 2022, 08:14:43 UTC
399c397 [SPARK-40304][K8S][TESTS] Add `decomTestTag` to K8s Integration Test ### What changes were proposed in this pull request? This PR aims to add a new test tag, `decomTestTag`, to K8s Integration Test. ### Why are the changes needed? Decommission-related tests took over 6 minutes (`363s`). It would be helpful we can run them selectively. ``` [info] - Test basic decommissioning (44 seconds, 51 milliseconds) [info] - Test basic decommissioning with shuffle cleanup (44 seconds, 450 milliseconds) [info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 43 seconds) [info] - Test decommissioning timeouts (44 seconds, 389 milliseconds) [info] - SPARK-37576: Rolling decommissioning (1 minute, 8 seconds) ``` ### Does this PR introduce _any_ user-facing change? No, this is a test-only change. ### How was this patch tested? Pass the CIs and test manually. ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests \ -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" \ -Dtest.exclude.tags=minikube,local,decom ... [info] KubernetesSuite: [info] - Run SparkPi with no resources (12 seconds, 441 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (11 seconds, 949 milliseconds) [info] - Run SparkPi with a very long application name. (11 seconds, 999 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (11 seconds, 846 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (11 seconds, 176 milliseconds) [info] - Run SparkPi with an argument. (11 seconds, 868 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (11 seconds, 858 milliseconds) [info] - All pods have the same service account by default (11 seconds, 5 milliseconds) [info] - Run extraJVMOptions check on driver (5 seconds, 757 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (12 seconds, 467 milliseconds) [info] - Run SparkPi with env and mount secrets. (21 seconds, 119 milliseconds) [info] - Run PySpark on simple pi.py example (13 seconds, 129 milliseconds) [info] - Run PySpark to test a pyfiles example (14 seconds, 937 milliseconds) [info] - Run PySpark with memory customization (12 seconds, 195 milliseconds) [info] - Run in client mode. (11 seconds, 343 milliseconds) [info] - Start pod creation from template (11 seconds, 975 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (11 seconds, 901 milliseconds) [info] - Run SparkR on simple dataframe.R example (14 seconds, 305 milliseconds) ... ``` Closes #37755 from dongjoon-hyun/SPARK-40304. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fd0498f81df72c196f19a5b26053660f6f3f4d70) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 September 2022, 16:35:06 UTC
7c19df6 [SPARK-40302][K8S][TESTS] Add `YuniKornSuite` This PR aims the followings. 1. Add `YuniKornSuite` integration test suite which extends `KubernetesSuite` on Apache YuniKorn scheduler. 2. Support `--default-exclude-tags` command to override `test.default.exclude.tags`. To improve test coverage. No. This is a test suite addition. Since this requires `Apache YuniKorn` installation, the test suite is disabled by default. So, CI K8s integration test should pass without running this suite. In order to run the tests, we need to override `test.default.exclude.tags` like the following. **SBT** ``` $ build/sbt -Psparkr -Pkubernetes -Pkubernetes-integration-tests \ -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/test" \ -Dtest.exclude.tags=minikube,local \ -Dtest.default.exclude.tags= ``` **MAVEN** ``` $ dev/dev-run-integration-tests.sh --deploy-mode docker-desktop \ --exclude-tag minikube,local \ --default-exclude-tags '' ``` Closes #37753 from dongjoon-hyun/SPARK-40302. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b2e38e16bfc547a62957e0a67085985b3c65d525) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 September 2022, 16:27:33 UTC
03556ca [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs ### What changes were proposed in this pull request? Add a section under [customized-kubernetes-schedulers-for-spark-on-kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#customized-kubernetes-schedulers-for-spark-on-kubernetes) to explain how to run Spark with Apache YuniKorn. This is based on the review comments from #35663. ### Why are the changes needed? Explain how to run Spark with Apache YuniKorn ### Does this PR introduce _any_ user-facing change? No Closes #37622 from yangwwei/SPARK-40187. Authored-by: Weiwei Yang <wwei@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4b1877398410fb23a285ed0d2c6b34711f52fc43) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 September 2022, 05:26:53 UTC
c04aa36 [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed ### Why are the changes needed? This PR aims to fix the case ```scala sql("create table t1(a decimal(3, 0)) using parquet") sql("insert into t1 values(100), (10), (1)") sql("select * from t1 where a in(100000, 1.00)").show ``` ``` java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) ``` 1. the rule `UnwrapCastInBinaryComparison` transforms the expression `In` to Equals ``` CAST(a as decimal(12,2)) IN (100000.00,1.00) OR( CAST(a as decimal(12,2)) = 100000.00, CAST(a as decimal(12,2)) = 1.00 ) ``` 2. using `UnwrapCastInBinaryComparison.unwrapCast()` to optimize each `EqualTo` ``` // Expression1 CAST(a as decimal(12,2)) = 100000.00 => CAST(a as decimal(12,2)) = 100000.00 // Expression2 CAST(a as decimal(12,2)) = 1.00 => a = 1 ``` 3. return the new unwrapped cast expression `In` ``` a IN (100000.00, 1.00) ``` Before this PR: the method `UnwrapCastInBinaryComparison.unwrapCast()` returns the original expression when downcasting to a decimal type fails (the `Expression1`),returns the original expression if the downcast to the decimal type succeeds (the `Expression2`), the two expressions have different data type which would break the structural integrity ``` a IN (100000.00, 1.00) | | decimal(12, 2) | decimal(3, 0) ``` After this PR: the PR transform the downcasting failed expression to `falseIfNotNull(fromExp)` ``` ((isnull(a) AND null) OR a IN (1.00) ``` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? Unit test. Closes #37439 from cfmcgrady/SPARK-39896. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6e62b93f3d1ef7e2d6be0a3bb729ab9b2d55a36d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 August 2022, 05:32:34 UTC
e46d2e2 [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style This PR make `compute.max_rows` option as `None` working in `DataFrame.style`, as expected instead of throwing an exception., by collecting it all to a pandas DataFrame. To make the configuration working as expected. Yes. ```python import pyspark.pandas as ps ps.set_option("compute.max_rows", None) ps.get_option("compute.max_rows") ps.range(1).style ``` **Before:** ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style pdf = self.head(max_results + 1)._to_internal_pandas() TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' ``` **After:** ``` <pandas.io.formats.style.Styler object at 0x7fdf78250430> ``` Manually tested, and unittest was added. Closes #37718 from HyukjinKwon/SPARK-40270. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 August 2022, 07:26:24 UTC
c9710c5 [SPARK-40152][SQL][TESTS][FOLLOW-UP][3.3] Capture a different error message in 3.3 ### What changes were proposed in this pull request? This PR fixes the error message in branch-3.3. Different error message is thrown at the test added in https://github.com/apache/spark/commit/4b0c3bab1ab082565a051990bf45774f15962deb. ### Why are the changes needed? `branch-3.3` is broken because of the different error message being thrown (https://github.com/apache/spark/runs/8065373173?check_suite_focus=true). ``` [info] - elementAt *** FAILED *** (996 milliseconds) [info] (non-codegen mode) Expected error message is `The index 0 is invalid`, but `SQL array indices start at 1` found (ExpressionEvalHelper.scala:176) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.fail(Assertions.scala:933) [info] at org.scalatest.Assertions.fail$(Assertions.scala:929) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.$anonfun$checkExceptionInExpression$1(ExpressionEvalHelper.scala:176) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) [info] at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkException$1(ExpressionEvalHelper.scala:163) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression(ExpressionEvalHelper.scala:183) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression$(ExpressionEvalHelper.scala:156) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkExceptionInExpression(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression(ExpressionEvalHelper.scala:153) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkExceptionInExpression$(ExpressionEvalHelper.scala:150) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkExceptionInExpression(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.$anonfun$new$365(CollectionExpressionsSuite.scala:1555) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should test it out. Closes #37708 from HyukjinKwon/SPARK-40152-3.3. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> 29 August 2022, 14:39:29 UTC
60bd91f [SPARK-40247][SQL] Fix BitSet equality check ### What changes were proposed in this pull request? Spark's `BitSet` doesn't implement `equals()` and `hashCode()` but it is used in `FileSourceScanExec` for bucket pruning. ### Why are the changes needed? Without proper equality check reuse issues can occur. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #37696 from peter-toth/SPARK-40247-fix-bitset-equals. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 527ddece8fdbe703dcd239401c97ddb2c6122182) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 August 2022, 07:26:17 UTC
e3f6b6d [SPARK-40212][SQL] SparkSQL castPartValue does not properly handle byte, short, or float The `castPartValueToDesiredType` function now returns byte for ByteType and short for ShortType, rather than ints; also floats for FloatType rather than double. Previously, attempting to read back in a file partitioned on one of these column types would result in a ClassCastException at runtime (for Byte, `java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Byte`). I can't think this is anything but a bug, as returning the correct data type prevents the crash. Yes: it changes the observed behavior when reading in a byte/short/float-partitioned file. Added unit test. Without the `castPartValueToDesiredType` updates, the test fails with the stated exception. === I'll note that I'm not familiar enough with the spark repo to know if this will have ripple effects elsewhere, but tests pass on my fork and since the very similar https://github.com/apache/spark/pull/36344/files only needed to touch these two files I expect this change is self-contained as well. Closes #37659 from BrennanStein/spark40212. Authored-by: Brennan Stein <brennan.stein@ekata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 146f187342140635b83bfe775b6c327755edfbe1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 August 2022, 01:57:31 UTC
694e4e7 [SPARK-40241][DOCS] Correct the link of GenericUDTF ### What changes were proposed in this pull request? Correct the link ### Why are the changes needed? existing link was wrong ### Does this PR introduce _any_ user-facing change? yes, a link was updated ### How was this patch tested? Manually check Closes #37685 from zhengruifeng/doc_fix_udtf. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 8ffcecb68fafd0466e839281588aab50cd046b49) Signed-off-by: Yuming Wang <yumwang@ebay.com> 27 August 2022, 10:39:43 UTC
2de364a [SPARK-40152][SQL][TESTS][FOLLOW-UP] Disable ANSI for out of bound test at ElementAt This PR proposes to fix the test to pass with ANSI mode on. Currently `elementAt` test fails when ANSI mode is on: ``` [info] - elementAt *** FAILED *** (309 milliseconds) [info] Exception evaluating element_at(stringsplitsql(11.12.13, .), 10, Some(), true) (ExpressionEvalHelper.scala:205) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.fail(Assertions.scala:949) [info] at org.scalatest.Assertions.fail$(Assertions.scala:945) [info] at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen(ExpressionEvalHelper.scala:205) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluationWithoutCodegen$(ExpressionEvalHelper.scala:199) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkEvaluationWithoutCodegen(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation(ExpressionEvalHelper.scala:87) [info] at org.apache.spark.sql.catalyst.expressions.ExpressionEvalHelper.checkEvaluation$(ExpressionEvalHelper.scala:82) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.checkEvaluation(CollectionExpressionsSuite.scala:39) [info] at org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite.$anonfun$new$333(CollectionExpressionsSuite.scala:1546) ``` https://github.com/apache/spark/runs/8046961366?check_suite_focus=true To recover the build with ANSI mode. No, test-only. Unittest fixed. Closes #37684 from HyukjinKwon/SPARK-40152. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4b0c3bab1ab082565a051990bf45774f15962deb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 August 2022, 06:55:07 UTC
167f3ff [SPARK-40152][SQL][TESTS] Move tests from SplitPart to elementAt Move tests from SplitPart to elementAt in CollectionExpressionsSuite. Simplify test. No. N/A. Closes #37637 from wangyum/SPARK-40152-3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 06997d6eb73f271aede5b159d86d1db80a73b89f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 August 2022, 06:55:02 UTC
784cb0f [SPARK-40218][SQL] GROUPING SETS should preserve the grouping columns ### What changes were proposed in this pull request? This PR fixes a bug caused by https://github.com/apache/spark/pull/32022 . Although we deprecate `GROUP BY ... GROUPING SETS ...`, it should still work if it worked before. https://github.com/apache/spark/pull/32022 made a mistake that it didn't preserve the order of user-specified group by columns. Usually it's not a problem, as `GROUP BY a, b` is no different from `GROUP BY b, a`. However, the `grouping_id(...)` function requires the input to be exactly the same with the group by columns. This PR fixes the problem by preserve the order of user-specified group by columns. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now a query that worked before 3.2 can work again. ### How was this patch tested? new test Closes #37655 from cloud-fan/grouping. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1ed592ef28abdb14aa1d8c8a129d6ac3084ffb0c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 August 2022, 07:24:32 UTC
d9dc280 [SPARK-40213][SQL] Support ASCII value conversion for Latin-1 characters ### What changes were proposed in this pull request? This PR proposes to support ASCII value conversion for Latin-1 Supplement characters. ### Why are the changes needed? `ascii()` should be the inverse of `chr()`. But for latin-1 char, we get incorrect ascii value. For example: ```sql select ascii('§') -- output: -62, expect: 167 select chr(167) -- output: '§' ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes the incorrect ASCII conversion for Latin-1 Supplement characters ### How was this patch tested? UT Closes #37651 from linhongliu-db/SPARK-40213. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c07852380471f02955d6d17cddb3150231daa71f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 August 2022, 04:19:00 UTC
d725d9c [SPARK-40124][SQL][TEST][3.3] Update TPCDS v1.4 q32 for Plan Stability tests ### What changes were proposed in this pull request? This is port of SPARK-40124 to Spark 3.3. Fix query 32 for TPCDS v1.4 ### Why are the changes needed? Current q32.sql seems to be wrong. It is just selection `1`. Reference for query template: https://github.com/databricks/tpcds-kit/blob/eff5de2c30337b71cc0dc1976147742d2c65d378/query_templates/query32.tpl#L41 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test change only Closes #37615 from mskapilks/change-q32-3.3. Authored-by: Kapil Kumar Singh <kapilsingh@microsoft.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 August 2022, 00:02:36 UTC
6572c66 [SPARK-40172][ML][TESTS] Temporarily disable flaky test cases in ImageFileFormatSuite ### What changes were proposed in this pull request? 3 test cases in ImageFileFormatSuite become flaky in the GitHub action tests: https://github.com/apache/spark/runs/7941765326?check_suite_focus=true https://github.com/gengliangwang/spark/runs/7928658069 Before they are fixed(https://issues.apache.org/jira/browse/SPARK-40171), I suggest disabling them in OSS. ### Why are the changes needed? Disable flaky tests before they are fixed. The test cases keep failing from time to time, while they always pass on local env. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing CI Closes #37605 from gengliangwang/disableFlakyTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 50f2f506327b7d51af9fb0ae1316135905d2f87d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 August 2022, 17:55:17 UTC
008b3a3 [SPARK-40152][SQL][TESTS] Add tests for SplitPart ### What changes were proposed in this pull request? Add tests for `SplitPart`. ### Why are the changes needed? Improve test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. Closes #37626 from wangyum/SPARK-40152-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 4f525eed7d5d461498aee68c4d3e57941f9aae2c) Signed-off-by: Sean Owen <srowen@gmail.com> 23 August 2022, 13:55:34 UTC
db121c3 [SPARK-39184][SQL][FOLLOWUP] Make interpreted and codegen paths for date/timestamp sequences the same ### What changes were proposed in this pull request? Change how the length of the new result array is calculated in `InternalSequenceBase.eval` to match how the same is calculated in the generated code. ### Why are the changes needed? This change brings the interpreted mode code in line with the generated code. Although I am not aware of any case where the current interpreted mode code fails, the generated code is more correct (it handles the case where the result array must grow more than once, whereas the current interpreted mode code does not). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #37542 from bersprockets/date_sequence_array_size_issue_follow_up. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit d718867a16754c62cb8c30a750485f4856481efc) Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 August 2022, 16:17:39 UTC
e16467c [SPARK-40089][SQL] Fix sorting for some Decimal types ### What changes were proposed in this pull request? This fixes https://issues.apache.org/jira/browse/SPARK-40089 where the prefix can overflow in some cases and the code assumes that the overflow is always on the negative side, not the positive side. ### Why are the changes needed? This adds a check when the overflow does happen to know what is the proper prefix to return. ### Does this PR introduce _any_ user-facing change? No, unless you consider getting the sort order correct a user facing change. ### How was this patch tested? I tested manually with the file in the JIRA and I added a small unit test. Closes #37540 from revans2/fix_dec_sort. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8dfd3dfc115d6e249f00a9a434b866d28e2eae45) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 August 2022, 08:34:01 UTC
233a54d [SPARK-40152][SQL] Fix split_part codegen compilation issue ### What changes were proposed in this pull request? Fix `split_part` codegen compilation issue: ```sql SELECT split_part(str, delimiter, partNum) FROM VALUES ('11.12.13', '.', 3) AS v1(str, delimiter, partNum); ``` ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 1: Expression "project_isNull_0 = false" is not a type ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37589 from wangyum/SPARK-40152. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit cf1a80eeae8bf815270fb39568b1846c2bd8d437) Signed-off-by: Sean Owen <srowen@gmail.com> 21 August 2022, 19:30:17 UTC
7c69614 [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns ### What changes were proposed in this pull request? This PR fixes a correctness issue in Parquet DSv1 FileFormat when projection does not contain columns referenced in pushed filters. This typically happens when partition columns and data columns overlap. This could result in empty result when in fact there were records matching predicate as can be seen in the provided fields. The problem is especially visible with `count()` and `show()` reporting different results, for example, show() would return 1+ records where the count() would return 0. In Parquet, when the predicate is provided and column index is enabled, we would try to filter row ranges to figure out what the count should be. Unfortunately, there is an issue that if the projection is empty or is not in the set of filter columns, any checks on columns would fail and 0 rows are returned (`RowRanges.EMPTY`) even though there is data matching the filter. Note that this is rather a mitigation, a quick fix. The actual fix needs to go into Parquet-MR: https://issues.apache.org/jira/browse/PARQUET-2170. The fix is not required in DSv2 where the overlapping columns are removed in `FileScanBuilder::readDataSchema()`. ### Why are the changes needed? Fixes a correctness issue when projection columns are not referenced by columns in pushed down filters or the schema is empty in Parquet DSv1. Downsides: Parquet column filter would be disabled if it had not been explicitly enabled which could affect performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces this behaviour. The test fails without the fix and passes with the fix. Closes #37419 from sadikovi/SPARK-39833. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cde71aaf173aadd14dd6393b09e9851b5caad903) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 August 2022, 10:00:00 UTC
88f8ac6 [SPARK-40065][K8S] Mount ConfigMap on executors with non-default profile as well ### What changes were proposed in this pull request? This fixes a bug where ConfigMap is not mounted on executors if they are under a non-default resource profile. ### Why are the changes needed? When `spark.kubernetes.executor.disableConfigMap` is `false`, expected behavior is that the ConfigMap is mounted regardless of executor's resource profile. However, it is not mounted if the resource profile is non-default. ### Does this PR introduce _any_ user-facing change? Executors with non-default resource profile will have the ConfigMap mounted that was missing before if `spark.kubernetes.executor.disableConfigMap` is `false` or default. If certain users need to keep that behavior for some reason, they would need to explicitly set `spark.kubernetes.executor.disableConfigMap` to `true`. ### How was this patch tested? A new test case is added just below the existing ConfigMap test case. Closes #37504 from nsuke/SPARK-40065. Authored-by: Aki Sukegawa <nsuke@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 41ca6299eff4155aa3ac28656fe96501a7573fb0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 August 2022, 19:28:57 UTC
87f957d [SPARK-35542][ML] Fix: Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? Fix: Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #37568 from WeichenXu123/SPARK-35542. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 876ce6a5df118095de51c3c4789d6db6da95eb23) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 19 August 2022, 04:26:59 UTC
bd79706 [SPARK-40134][BUILD] Update ORC to 1.7.6 This PR aims to update ORC to 1.7.6. This will bring the latest changes and bug fixes. https://github.com/apache/orc/releases/tag/v1.7.6 - ORC-1204: ORC MapReduce writer to flush when long arrays - ORC-1205: `nextVector` should invoke `ensureSize` when reusing vectors - ORC-1215: Remove a wrong `NotNull` annotation on `value` of `setAttribute` - ORC-1222: Upgrade `tools.hadoop.version` to 2.10.2 - ORC-1227: Use `Constructor.newInstance` instead of `Class.newInstance` - ORC-1228: Fix `setAttribute` to handle null value No. Pass the CIs. Closes #37563 from williamhyun/ORC-176. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a1a049f01986c15e50a2f76d1fa8538ca3b6307e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 August 2022, 20:01:03 UTC
e1c5f90 [SPARK-40132][ML] Restore rawPredictionCol to MultilayerPerceptronClassifier.setParams ### What changes were proposed in this pull request? Restore rawPredictionCol to MultilayerPerceptronClassifier.setParams ### Why are the changes needed? This param was inadvertently removed in the refactoring in https://github.com/apache/spark/commit/40cdb6d51c2befcfeac8fb5cf5faf178d1a5ee7b#r81473316 Without it, using this param in the constructor fails. ### Does this PR introduce _any_ user-facing change? Not aside from the bug fix. ### How was this patch tested? Existing tests. Closes #37561 from srowen/SPARK-40132. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 6768d9cc38a320f7e1c6781afcd170577c5c7d0f) Signed-off-by: Sean Owen <srowen@gmail.com> 18 August 2022, 05:24:00 UTC
1a01a49 [SPARK-40121][PYTHON][SQL] Initialize projection used for Python UDF ### What changes were proposed in this pull request? This PR proposes to initialize the projection so non-deterministic expressions can be evaluated with Python UDFs. ### Why are the changes needed? To make the Python UDF working with non-deterministic expressions. ### Does this PR introduce _any_ user-facing change? Yes. ```python from pyspark.sql.functions import udf, rand spark.range(10).select(udf(lambda x: x, "double")(rand())).show() ``` **Before** ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) ``` **After** ``` +----------------------------------+ |<lambda>rand(-2507211707257730645)| +----------------------------------+ | 0.7691724424045242| | 0.09602244075319044| | 0.3006471278112862| | 0.4182649571961977| | 0.29349096650900974| | 0.7987097908937618| | 0.5324802583101007| | 0.72460930912789| | 0.1367749768412846| | 0.17277322931919348| +----------------------------------+ ``` ### How was this patch tested? Manually tested, and unittest was added. Closes #37552 from HyukjinKwon/SPARK-40121. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 336c9bc535895530cc3983b24e7507229fa9570d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 August 2022, 03:23:14 UTC
9601be9 [SPARK-39887][SQL][FOLLOW-UP] Do not exclude Union's first child attributes when traversing other children in RemoveRedundantAliases ### What changes were proposed in this pull request? Do not exclude `Union`'s first child attributes when traversing other children in `RemoveRedundantAliases`. ### Why are the changes needed? We don't need to exclude those attributes that `Union` inherits from its first child. See discussion here: https://github.com/apache/spark/pull/37496#discussion_r944509115 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #37534 from peter-toth/SPARK-39887-keep-attributes-of-unions-first-child-follow-up. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e732232dac420826af269d8cf5efacb52933f59a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 August 2022, 06:57:53 UTC
0db7842 [SPARK-40117][PYTHON][SQL] Convert condition to java in DataFrameWriterV2.overwrite ### What changes were proposed in this pull request? Fix DataFrameWriterV2.overwrite() fails to convert the condition parameter to java. This prevents the function from being called. It is caused by the following commit that deleted the `_to_java_column` call instead of fixing it: https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9 ### Why are the changes needed? DataFrameWriterV2.overwrite() cannot be called. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually checked whether the arguments are sent to JVM or not. Closes #37547 from looi/fix-overwrite. Authored-by: Wenli Looi <wlooi@ucalgary.ca> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 46379863ab0dd2ee8fcf1e31e76476ff18397f60) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 August 2022, 06:29:04 UTC
21acaae [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences ### What changes were proposed in this pull request? Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence. ### Why are the changes needed? `InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array. `getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed. `getSequenceLength` sometimes underestimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR). For example: ``` select sequence( timestamp'2022-03-13 00:00:00', timestamp'2022-03-14 00:00:00', interval 1 day) as x; ``` In the America/Los_Angeles time zone, this results in the following error: ``` java.lang.ArrayIndexOutOfBoundsException: 1 at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77) ``` This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1. However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array. The new unit test includes examples of problematic date sequences. This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #37513 from bersprockets/date_sequence_array_size_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 3a1136aa05dd5e16de81c7ec804416b3498ca967) Signed-off-by: Max Gekk <max.gekk@gmail.com> 16 August 2022, 08:53:51 UTC
2ee196d [SPARK-40079] Add Imputer inputCols validation for empty input case Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? Add Imputer inputCols validation for empty input case ### Why are the changes needed? If Imputer inputCols is empty, the `fit` works fine but when saving model, error will be raised: > AnalysisException: Datasource does not support writing empty or nested empty schemas. Please make sure the data schema has at least one or more column(s). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37518 from WeichenXu123/imputer-param-validation. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 87094f89655b7df09cdecb47c653461ae855b0ac) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 15 August 2022, 10:04:14 UTC
a6957d3 Revert "[SPARK-40047][TEST] Exclude unused `xalan` transitive dependency from `htmlunit`" ### What changes were proposed in this pull request? This pr revert SPARK-40047 due to mvn test still need this dependency. ### Why are the changes needed? mvn test still need `xalan` dependency although GA passed before this pr. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: ``` mvn clean install -DskipTests -pl core -am build/mvn clean test -pl core -Dtest=noen -DwildcardSuites=org.apache.spark.ui.UISeleniumSuite ``` **Before** ``` UISeleniumSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/apache/xml/utils/PrefixResolver at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2583) at java.lang.Class.getField0(Class.java:2975) at java.lang.Class.getField(Class.java:1701) at com.gargoylesoftware.htmlunit.svg.SvgElementFactory.<clinit>(SvgElementFactory.java:64) at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.<clinit>(HtmlUnitNekoHtmlParser.java:77) at com.gargoylesoftware.htmlunit.DefaultPageCreator.<clinit>(DefaultPageCreator.java:93) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:191) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:273) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:263) ... Cause: java.lang.ClassNotFoundException: org.apache.xml.utils.PrefixResolver at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2583) at java.lang.Class.getField0(Class.java:2975) at java.lang.Class.getField(Class.java:1701) at com.gargoylesoftware.htmlunit.svg.SvgElementFactory.<clinit>(SvgElementFactory.java:64) at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.<clinit>(HtmlUnitNekoHtmlParser.java:77) ... ``` **After** ``` UISeleniumSuite: - all jobs page should be rendered even though we configure the scheduling mode to fair - effects of unpersist() / persist() should be reflected - failed stages should not appear to be active - spark.ui.killEnabled should properly control kill button display - jobs page should not display job group name unless some job was submitted in a job group - job progress bars should handle stage / task failures - job details page should display useful information for stages that haven't started - job progress bars / cells reflect skipped stages / tasks - stages that aren't run appear as 'skipped stages' after a job finishes - jobs with stages that are skipped should show correct link descriptions on all jobs page - attaching and detaching a new tab - kill stage POST/GET response is correct - kill job POST/GET response is correct - stage & job retention - live UI json application list - job stages should have expected dotfile under DAG visualization - stages page should show skipped stages - Staleness of Spark UI should not last minutes or hours - description for empty jobs - Support disable event timeline Run completed in 17 seconds, 986 milliseconds. Total number of tests run: 20 Suites: completed 2, aborted 0 Tests: succeeded 20, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #37508 from LuciferYang/revert-40047. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit afd7098c7fb6c95aece39acc32cdad764c984cd2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 August 2022, 18:42:11 UTC
d7af1d2 [SPARK-39976][SQL] ArrayIntersect should handle null in left expression correctly ### What changes were proposed in this pull request? `ArrayInterscet` miss judge if null contains in right expression's hash set. ``` >>> a = [1, 2, 3] >>> b = [3, None, 5] >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"]) >>> df.show() +---------+------------+ | a| b| +---------+------------+ |[1, 2, 3]|[3, null, 5]| +---------+------------+ >>> df.selectExpr("array_intersect(a,b)").show() +---------------------+ |array_intersect(a, b)| +---------------------+ | [3]| +---------------------+ >>> df.selectExpr("array_intersect(b,a)").show() +---------------------+ |array_intersect(b, a)| +---------------------+ | [3, null]| +---------------------+ ``` In origin code gen's code path, when handle `ArrayIntersect`'s array1, it use the below code ``` def withArray1NullAssignment(body: String) = if (left.dataType.asInstanceOf[ArrayType].containsNull) { if (right.dataType.asInstanceOf[ArrayType].containsNull) { s""" |if ($array1.isNullAt($i)) { | if ($foundNullElement) { | $nullElementIndex = $size; | $foundNullElement = false; | $size++; | $builder.$$plus$$eq($nullValueHolder); | } |} else { | $body |} """.stripMargin } else { s""" |if (!$array1.isNullAt($i)) { | $body |} """.stripMargin } } else { body } ``` We have a flag `foundNullElement` to indicate if array2 really contains a null value. But when implement https://issues.apache.org/jira/browse/SPARK-36829, misunderstand the meaning of `ArrayType.containsNull`, so when implement `SQLOpenHashSet.withNullCheckCode()` ``` def withNullCheckCode( arrayContainsNull: Boolean, setContainsNull: Boolean, array: String, index: String, hashSet: String, handleNotNull: (String, String) => String, handleNull: String): String = { if (arrayContainsNull) { if (setContainsNull) { s""" |if ($array.isNullAt($index)) { | if (!$hashSet.containsNull()) { | $hashSet.addNull(); | $handleNull | } |} else { | ${handleNotNull(array, index)} |} """.stripMargin } else { s""" |if (!$array.isNullAt($index)) { | ${handleNotNull(array, index)} |} """.stripMargin } } else { handleNotNull(array, index) } } ``` The code path of ` if (arrayContainsNull && setContainsNull) ` is misinterpreted that array's openHashSet really have a null value. In this pr we add a new parameter `additionalCondition ` to complements the previous implementation of `foundNullElement`. Also refactor the method's parameter name. ### Why are the changes needed? Fix data correct issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #37436 from AngersZhuuuu/SPARK-39776-FOLLOW_UP. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dff5c2f2e9ce233e270e0e5cde0a40f682ba9534) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 August 2022, 02:54:09 UTC
21d9db3 [SPARK-39887][SQL][3.3] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique ### What changes were proposed in this pull request? Keep the output attributes of a `Union` node's first child in the `RemoveRedundantAliases` rule to avoid correctness issues. ### Why are the changes needed? To fix the result of the following query: ``` SELECT a, b AS a FROM ( SELECT a, a AS b FROM (SELECT a FROM VALUES (1) AS t(a)) UNION ALL SELECT a, b FROM (SELECT a, b FROM VALUES (1, 2) AS t(a, b)) ) ``` Before this PR the query returns the incorrect result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 2| 2| +---+---+ ``` After this PR it returns the expected result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 1| 2| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UTs. Closes #37472 from peter-toth/SPARK-39887-keep-attributes-of-unions-first-child-3.3. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 August 2022, 02:41:37 UTC
221fee8 [SPARK-40047][TEST] Exclude unused `xalan` transitive dependency from `htmlunit` ### What changes were proposed in this pull request? This pr exclude `xalan` from `htmlunit` to clean warning of CVE-2022-34169: ``` Provides transitive vulnerable dependency xalan:xalan:2.7.2 CVE-2022-34169 7.5 Integer Coercion Error vulnerability with medium severity found Results powered by Checkmarx(c) ``` `xalan:xalan:2.7.2` is the latest version, the code base has not been updated for 5 years, so can't solve by upgrading `xalan`. ### Why are the changes needed? The vulnerability is described is [CVE-2022-34169](https://github.com/advisories/GHSA-9339-86wc-4qgf), better to exclude it although it's just test dependency for Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Manual test: run `mvn dependency:tree -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive | grep xalan` to check that `xalan` is not matched after this pr Closes #37481 from LuciferYang/exclude-xalan. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7f3baa77acbf7747963a95d0f24e3b8868c7b16a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 August 2022, 22:10:50 UTC
248e8b4 [SPARK-40043][PYTHON][SS][DOCS] Document DataStreamWriter.toTable and DataStreamReader.table ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30835 that adds `DataStreamWriter.toTable` and `DataStreamReader.table` into PySpark documentation. ### Why are the changes needed? To document both features. ### Does this PR introduce _any_ user-facing change? Yes, both API will be shown in PySpark reference documentation. ### How was this patch tested? Manually built the documentation and checked. Closes #37477 from HyukjinKwon/SPARK-40043. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 447003324d2cf9f2bfa799ef3a1e744a5bc9277d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 August 2022, 06:01:25 UTC
42b30ee [SPARK-40022][YARN][TESTS] Ignore pyspark suites in `YarnClusterSuite` when python3 is unavailable ### What changes were proposed in this pull request? This pr adds `assume(isPythonAvailable)` to `testPySpark` method in `YarnClusterSuite` to make `YarnClusterSuite` test succeeded in an environment without Python 3 configured. ### Why are the changes needed? `YarnClusterSuite` should not `ABORTED` when `python3` is not configured. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Manual test Run ``` mvn clean test -pl resource-managers/yarn -am -Pyarn -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -Dtest=none ``` in an environment without Python 3 configured: **Before** ``` YarnClusterSuite: org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** java.lang.RuntimeException: Unable to load a Suite class that was discovered in the runpath: org.apache.spark.deploy.yarn.YarnClusterSuite at org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81) at org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:286) ... Run completed in 833 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 *** 1 SUITE ABORTED *** ``` **After** ``` YarnClusterSuite: - run Spark in yarn-client mode - run Spark in yarn-cluster mode - run Spark in yarn-client mode with unmanaged am - run Spark in yarn-client mode with different configurations, ensuring redaction - run Spark in yarn-cluster mode with different configurations, ensuring redaction - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' and gateway-replacement path - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path containing an environment variable - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'file' - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'file' - run Spark in yarn-cluster mode unsuccessfully - run Spark in yarn-cluster mode failure after sc initialized - run Python application in yarn-client mode !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - run Python application in yarn-cluster mode !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - user class path first in client mode - user class path first in cluster mode - monitor app using launcher library - running Spark in yarn-cluster mode displays driver log links - timeout to get SparkContext in cluster mode triggers failure - executor env overwrite AM env in client mode - executor env overwrite AM env in cluster mode - SPARK-34472: ivySettings file with no scheme or file:// scheme should be localized on driver in cluster mode - SPARK-34472: ivySettings file with no scheme or file:// scheme should retain user provided path in client mode - SPARK-34472: ivySettings file with non-file:// schemes should throw an error Run completed in 7 minutes, 2 seconds. Total number of tests run: 25 Suites: completed 2, aborted 0 Tests: succeeded 25, failed 0, canceled 3, ignored 0, pending 0 All tests passed. ``` Closes #37454 from LuciferYang/yarnclustersuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8e472443081342a0e0dc37aa154e30a0a6df39b7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 August 2022, 00:59:49 UTC
9353437 [SPARK-40002][SQL] Don't push down limit through window using ntile ### What changes were proposed in this pull request? Change `LimitPushDownThroughWindow` so that it no longer supports pushing down a limit through a window using ntile. ### Why are the changes needed? In an unpartitioned window, the ntile function is currently applied to the result of the limit. This behavior produces results that conflict with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268 #### Example Assume this data: ``` create table t1 stored as parquet as select * from range(101); ``` Also assume this query: ``` select id, ntile(10) over (order by id) as nt from t1 limit 10; ``` With Spark 3.2.2, Spark 3.3.0, and master, the limit is applied before the ntile function: ``` +---+---+ |id |nt | +---+---+ |0 |1 | |1 |2 | |2 |3 | |3 |4 | |4 |5 | |5 |6 | |6 |7 | |7 |8 | |8 |9 | |9 |10 | +---+---+ ``` With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied _after_ ntile. Spark 3.1.3: ``` +---+---+ |id |nt | +---+---+ |0 |1 | |1 |1 | |2 |1 | |3 |1 | |4 |1 | |5 |1 | |6 |1 | |7 |1 | |8 |1 | |9 |1 | +---+---+ ``` Hive 2.3.9: ``` +-----+-----+ | id | nt | +-----+-----+ | 0 | 1 | | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6 | 1 | | 7 | 1 | | 8 | 1 | | 9 | 1 | +-----+-----+ 10 rows selected (1.72 seconds) ``` Prestodb 0.268: ``` id | nt ----+---- 0 | 1 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 6 | 1 7 | 1 8 | 1 9 | 1 (10 rows) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Two new unit tests. Closes #37443 from bersprockets/pushdown_ntile. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c9156e5a3b9cb290c7cdda8db298c9875e67aa5e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 August 2022, 02:40:08 UTC
66faaa5 [SPARK-39965][K8S] Skip PVC cleanup when driver doesn't own PVCs ### What changes were proposed in this pull request? This PR aims to skip PVC cleanup logic when `spark.kubernetes.driver.ownPersistentVolumeClaim=false`. ### Why are the changes needed? To simplify Spark termination log by removing unnecessary log containing Exception message when Spark jobs have no PVC permission and at the same time `spark.kubernetes.driver.ownPersistentVolumeClaim` is `false`. ### Does this PR introduce _any_ user-facing change? Only in the termination logs of Spark jobs that has no PVC permission. ### How was this patch tested? Manually. Closes #37433 from dongjoon-hyun/SPARK-39965. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: pralabhkumar <pralabhkumar@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 87b312a9c9273535e22168c3da73834c22e1fbbb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 August 2022, 16:58:25 UTC
369b014 [SPARK-38034][SQL] Optimize TransposeWindow rule ### What changes were proposed in this pull request? Optimize the TransposeWindow rule to extend applicable cases and optimize time complexity. TransposeWindow rule will try to eliminate unnecessary shuffle: but the function compatiblePartitions will only take the first n elements of the window2 partition sequence, for some cases, this will not take effect, like the case below:  val df = spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS d") df.selectExpr( "sum(`d`) OVER(PARTITION BY `b`,`a`) as e", "sum(`c`) OVER(PARTITION BY `a`) as f" ).explain Current plan == Physical Plan == *(5) Project [e#10L, f#11L] +- Window [sum(c#4L) windowspecdefinition(a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#11L], [a#2L]    +- *(4) Sort [a#2L ASC NULLS FIRST], false, 0       +- Exchange hashpartitioning(a#2L, 200), true, [id=#41]          +- *(3) Project [a#2L, c#4L, e#10L]             +- Window [sum(d#5L) windowspecdefinition(b#3L, a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#10L], [b#3L, a#2L]                +- *(2) Sort [b#3L ASC NULLS FIRST, a#2L ASC NULLS FIRST], false, 0                   +- Exchange hashpartitioning(b#3L, a#2L, 200), true, [id=#33]                      +- *(1) Project [id#0L AS d#5L, id#0L AS b#3L, id#0L AS a#2L, id#0L AS c#4L]                         +- *(1) Range (0, 10, step=1, splits=10) Expected plan: == Physical Plan == *(4) Project [e#924L, f#925L] +- Window [sum(d#43L) windowspecdefinition(b#41L, a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#924L], [b#41L, a#40L]  +- *(3) Sort [b#41L ASC NULLS FIRST, a#40L ASC NULLS FIRST], false, 0       +- *(3) Project [d#43L, b#41L, a#40L, f#925L]          +- Window [sum(c#42L) windowspecdefinition(a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#925L], [a#40L]             +- *(2) Sort [a#40L ASC NULLS FIRST], false, 0                +- Exchange hashpartitioning(a#40L, 200), true, [id=#282]                   +- *(1) Project [id#38L AS d#43L, id#38L AS b#41L, id#38L AS a#40L, id#38L AS c#42L]                      +- *(1) Range (0, 10, step=1, splits=10) Also the permutations method has a O(n!) time complexity, which is very expensive when there are many partition columns, we could try to optimize it. ### Why are the changes needed? We could apply the rule for more cases, which could improve the execution performance by eliminate unnecessary shuffle, and by reducing the time complexity from O(n!) to O(n2), the performance for the rule itself could improve ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? UT Closes #35334 from constzhou/SPARK-38034_optimize_transpose_window_rule. Authored-by: xzhou <15210830305@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0cc331dc7e51e53000063052b0c8ace417eb281b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2022, 10:43:16 UTC
26f0d50 [SPARK-39981][SQL] Throw the exception QueryExecutionErrors.castingCauseOverflowErrorInTableInsert in Cast This PR is a followup of https://github.com/apache/spark/pull/37283. It missed `throw` keyword in the interpreted path. To throw an exception as intended instead of returning an exception itself. Yes, it will throw an exception as expected in the interpreted path. Haven't tested because it's too much straightforward. Closes #37414 from HyukjinKwon/SPARK-39981. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e6b9c6166a08ad4dca2550bbbb151fa575b730a8) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 August 2022, 09:32:36 UTC
c358ee6 [SPARK-39775][CORE][AVRO] Disable validate default values when parsing Avro schemas ### What changes were proposed in this pull request? This PR disables validate default values when parsing Avro schemas. ### Why are the changes needed? Spark will throw exception if upgrade to Spark 3.2. We have fixed the Hive serde tables before: SPARK-34512. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37191 from wangyum/SPARK-39775. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5c1b99f441ec5e178290637a9a9e7902aaa116e1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2022, 03:26:14 UTC
15ebd56 [SPARK-39952][SQL] SaveIntoDataSourceCommand should recache result relation ### What changes were proposed in this pull request? recacheByPlan the result relation inside `SaveIntoDataSourceCommand` ### Why are the changes needed? The behavior of `SaveIntoDataSourceCommand` is similar with `InsertIntoDataSourceCommand` which supports append or overwirte data. In order to keep data consistent, we should always do recacheByPlan the relation on post hoc. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #37380 from ulysses-you/refresh. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5fe0b245f7891a05bc4e1e641fd0aa9130118ea4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 August 2022, 17:04:10 UTC
6e9a58f [SPARK-39947][BUILD] Upgrade Jersey to 2.36 ### What changes were proposed in this pull request? This pr upgrade Jersey from 2.35 to 2.36. ### Why are the changes needed? This version adapts to Jack 2.13.3, which is also used by Spark currently - [Adopt Jackson 2.13](https://github.com/eclipse-ee4j/jersey/pull/4928) - [Update Jackson to 2.13.3](https://github.com/eclipse-ee4j/jersey/pull/5076) The release notes as follows: - https://github.com/eclipse-ee4j/jersey/releases/tag/2.36 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #37375 from LuciferYang/jersey-236. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit d1c145b0b0b892fcbf1e1adda7b8ecff75c56f6d) Signed-off-by: Sean Owen <srowen@gmail.com> # Conflicts: # dev/deps/spark-deps-hadoop-2-hive-2.3 # dev/deps/spark-deps-hadoop-3-hive-2.3 # pom.xml 03 August 2022, 13:32:23 UTC
630dc7e [SPARK-39900][SQL] Address partial or negated condition in binary format's predicate pushdown ### What changes were proposed in this pull request? fix `BinaryFileFormat` filter push down bug. Before modification, when Filter tree is: ```` -Not - - IsNotNull ```` Since `IsNotNull` cannot be matched, `IsNotNull` will return a result that is always true (that is, `case _ => (_ => true)`), that is, no filter pushdown is performed. But because there is still a `Not`, after negation, it will return a result that is always False, that is, no result can be returned. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test suit in `BinaryFileFormatSuite` ``` testCreateFilterFunction( Seq(Not(IsNull(LENGTH))), Seq((t1, true), (t2, true), (t3, true))) ``` Closes #37350 from zzzzming95/SPARK-39900. Lead-authored-by: zzzzming95 <505306252@qq.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a0dc7d9117b66426aaa2257c8d448a2f96882ecd) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2022, 12:23:04 UTC
3b02351 [SPARK-39867][SQL][3.3] Fix scala style ### What changes were proposed in this pull request? fix scala style ### Why are the changes needed? fix failed test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI Closes #37394 from ulysses-you/style. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2022, 08:48:12 UTC
bd3f36f [SPARK-39962][PYTHON][SQL] Apply projection when group attributes are empty ### What changes were proposed in this pull request? This PR proposes to apply the projection to respect the reordered columns in its child when group attributes are empty. ### Why are the changes needed? To respect the column order in the child. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug as below: ```python import pandas as pd from pyspark.sql import functions as f f.pandas_udf("double") def AVG(x: pd.Series) -> float: return x.mean() abc = spark.createDataFrame([(1.0, 5.0, 17.0)], schema=["a", "b", "c"]) abc.agg(AVG("a"), AVG("c")).show() abc.select("c", "a").agg(AVG("a"), AVG("c")).show() ``` **Before** ``` +------+------+ |AVG(a)|AVG(c)| +------+------+ | 17.0| 1.0| +------+------+ ``` **After** ``` +------+------+ |AVG(a)|AVG(c)| +------+------+ | 1.0| 17.0| +------+------+ ``` ### How was this patch tested? Manually tested, and added an unittest. Closes #37390 from HyukjinKwon/SPARK-39962. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5335c784ae76c9cc0aaa7a4b57b3cd6b3891ad9a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2022, 07:11:33 UTC
2254240 [SPARK-39867][SQL] Global limit should not inherit OrderPreservingUnaryNode Make GlobalLimit inherit UnaryNode rather than OrderPreservingUnaryNode Global limit can not promise the output ordering is same with child, it actually depend on the certain physical plan. For all physical plan with gobal limits: - CollectLimitExec: it does not promise output ordering - GlobalLimitExec: it required all tuples so it can assume the child is shuffle or child is single partition. Then it can use output ordering of child - TakeOrderedAndProjectExec: it do sort inside it's implementation This bug get worse since we pull out v1 write require ordering. yes, bug fix fix test and add test Closes #37284 from ulysses-you/sort. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e9cc1024df4d587a0f456842d495db91984ed9db) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 August 2022, 04:03:07 UTC
ea6d577 [SPARK-39911][SQL][3.3] Optimize global Sort to RepartitionByExpression this is for backport https://github.com/apache/spark/pull/37330 into branch-3.3 ### What changes were proposed in this pull request? Optimize Global sort to RepartitionByExpression, for example: ``` Sort local Sort local Sort global => RepartitionByExpression ``` ### Why are the changes needed? If a global sort below a local sort, the only meaningful thing is it's distribution. So this pr optimizes that global sort to RepartitionByExpression to save a local sort. ### Does this PR introduce _any_ user-facing change? we fix a bug in https://github.com/apache/spark/pull/37250 and that pr backport into branch-3.3. However, that fix may introduce performance regression. This pr itself is only to improve performance but in order to avoid the regression, we also backport this pr. see the details https://github.com/apache/spark/pull/37330#issuecomment-1201979396 ### How was this patch tested? add test Closes #37330 from ulysses-you/optimize-sort. Authored-by: ulysses-you <ulyssesyou18gmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> Closes #37373 from ulysses-you/SPARK-39911-3.3. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 August 2022, 03:11:22 UTC
41779ea [MINOR][DOCS] Remove generated statement about Scala version in docs homepage as Spark supports multiple versions ### What changes were proposed in this pull request? Remove this statement from the docs homepage: "For the Scala API, Spark 3.3.0 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x)." ### Why are the changes needed? It's misleading, as Spark supports 2.12 and 2.13. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #37381 from srowen/RemoveScalaStatement. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 73ef5432547e3e8e9b0cce0913200a94402aeb4c) Signed-off-by: Sean Owen <srowen@gmail.com> 03 August 2022, 02:18:52 UTC
a0242ea [SPARK-39951][SQL] Update Parquet V2 columnar check for nested fields ### What changes were proposed in this pull request? Update the `supportsColumnarReads` check for Parquet V2 to take into account support for nested fields. Also fixed a typo I saw in one of the tests. ### Why are the changes needed? Match Parquet V1 in returning columnar batches if nested field vectorization is enabled. ### Does this PR introduce _any_ user-facing change? Parquet V2 scans will return columnar batches with nested fields if the config is enabled. ### How was this patch tested? Added new UTs checking both V1 and V2 return columnar batches for nested fields when the config is enabled. Closes #37379 from Kimahriman/parquet-v2-columnar. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Chao Sun <sunchao@apple.com> 02 August 2022, 23:50:46 UTC
fb9f85e [SPARK-39932][SQL] WindowExec should clear the final partition buffer ### What changes were proposed in this pull request? Explicitly clear final partition buffer if can not find next in `WindowExec`. The same fix in `WindowInPandasExec` ### Why are the changes needed? We do a repartition after a window, then we need do a local sort after window due to RoundRobinPartitioning shuffle. The error stack: ```java ExternalAppendOnlyUnsafeRowArray INFO - Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:352) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:435) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:355) ``` `WindowExec` only clear buffer in `fetchNextPartition` so the final partition buffer miss to clear. It is not a big problem since we have task completion listener. ```scala taskContext.addTaskCompletionListener(context -> { cleanupResources(); }); ``` This bug only affects if the window is not the last operator for this task and the follow operator like sort. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? N/A Closes #37358 from ulysses-you/window. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1fac870126c289a7ec75f45b6b61c93b9a4965d4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 August 2022, 09:06:08 UTC
301d6e3 [SPARK-39857][SQL][TESTS][FOLLOW-UP] Make "translate complex expression" pass with ANSI mode on ### What changes were proposed in this pull request? This PR fixes `translate complex expression` to pass with ANSI mode on. We do push `Abs` with ANSI mode on (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/catalyst/util/V2ExpressionBuilder.scala#L93): ``` [info] - translate complex expression *** FAILED *** (22 milliseconds) [info] Expected None, but got Some((ABS(cint) - 2) <= 1) (DataSourceV2StrategySuite.scala:325) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.assertResult(Assertions.scala:867) [info] at org.scalatest.Assertions.assertResult$(Assertions.scala:863) [info] at org.scalatest.funsuite.AnyFunSuite.assertResult(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.execution.datasources.v2.DataSourceV2StrategySuite.testTranslateFilter(DataSourceV2StrategySuite.scala:325) [info] at org.apache.spark.sql.execution.datasources.v2.DataSourceV2StrategySuite.$anonfun$new$4(DataSourceV2StrategySuite.scala:176) [info] at org.apache.spark.sql.execution.datasources.v2.DataSourceV2StrategySuite.$anonfun$new$4$adapted(DataSourceV2StrategySuite.scala:170) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.execution.datasources.v2.DataSourceV2StrategySuite.$anonfun$new$3(DataSourceV2StrategySuite.scala:170) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:204) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:65) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:65) ``` https://github.com/apache/spark/runs/7595362617?check_suite_focus=true ### Why are the changes needed? To make the build pass with ANSI mode on. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually ran the unittest with ANSI mode on. Closes #37349 from HyukjinKwon/SPARK-39857-followup. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c211abe970d9e88fd25cd859ea729e630d9491a7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 August 2022, 00:12:48 UTC
1999104 [SPARK-39865][SQL][3.3] Show proper error messages on the overflow errors of table insert ### What changes were proposed in this pull request? In Spark 3.3, the error message of ANSI CAST is improved. However, the table insertion is using the same CAST expression: ``` > create table tiny(i tinyint); > insert into tiny values (1000); org.apache.spark.SparkArithmeticException[CAST_OVERFLOW]: The value 1000 of the type "INT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` to tolerate overflow and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. ``` Showing the hint of `If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error` doesn't help at all. This PR is to fix the error message. After changes, the error message of this example will become: ``` org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] Fail to insert a value of "INT" type into the "TINYINT" type column `i` due to an overflow. Use `try_cast` on the input value to tolerate overflow and return NULL instead. ``` ### Why are the changes needed? Show proper error messages on the overflow errors of table insert. The current message is super confusing. ### Does this PR introduce _any_ user-facing change? Yes, after changes it show proper error messages on the overflow errors of table insert. ### How was this patch tested? Unit test Closes #37311 from gengliangwang/PR_TOOL_PICK_PR_37283_BRANCH-3.3. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> 28 July 2022, 18:26:34 UTC
609efe1 [SPARK-39857][SQL][3.3] V2ExpressionBuilder uses the wrong LiteralValue data type for In predicate ### What changes were proposed in this pull request? When building V2 In `Predicate` in `V2ExpressionBuilder`, `InSet.dataType` (which is BooleanType) is used to build the `LiteralValue`, `InSet.child.dataType `should be used instead. back port https://github.com/apache/spark/pull/37271 to 3.3 ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #37324 from huaxingao/backport. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> 28 July 2022, 01:07:35 UTC
ee8cafb [SPARK-39839][SQL] Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check ### What changes were proposed in this pull request? Update the `UnsafeRow` structural integrity check in `UnsafeRowUtils.validateStructuralIntegrity` to handle a special case with null variable-length DecimalType value. ### Why are the changes needed? The check should follow the format that `UnsafeRowWriter` produces. In general, `UnsafeRowWriter` clears out a field with zero when the field is set to be null, c.f. `UnsafeRowWriter.setNullAt(ordinal)` and `UnsafeRow.setNullAt(ordinal)`. But there's a special case for `DecimalType` values: this is the only type that is both: - can be fixed-length or variable-length, depending on the precision, and - is mutable in `UnsafeRow`. To support a variable-length `DecimalType` to be mutable in `UnsafeRow`, the `UnsafeRowWriter` always leaves a 16-byte space in the variable-length section of the `UnsafeRow` (tail end of the row), regardless of whether the `Decimal` value being written is null or not. In the fixed-length part of the field, it would be an "OffsetAndSize", and the `offset` part always points to the start offset of the variable-length part of the field, while the `size` part will either be `0` for the null value, or `1` to at most `16` for non-null values. When `setNullAt(ordinal)` is called instead of passing a null value to `write(int, Decimal, int, int)`, however, the `offset` part gets zero'd out and this field stops being mutable. There's a comment on `UnsafeRow.setDecimal` that mentions to keep this field able to support updates, `setNullAt(ordinal)` cannot be called, but there's no code enforcement of that. So we need to recognize that in the structural integrity check and allow variable-length `DecimalType` to have non-zero field even for null. Note that for non-null values, the existing check does conform to the format from `UnsafeRowWriter`. It's only null value of variable-length `DecimalType` that'd trigger a bug, which can affect Structured Streaming's checkpoint file read where this check is applied. ### Does this PR introduce _any_ user-facing change? Yes, previously the `UnsafeRow` structural integrity validation will return false positive for correct data, when there's a null value in a variable-length `DecimalType` field. The fix will no longer return false positive. Because the Structured Streaming checkpoint file validation uses this check, previously a good checkpoint file may be rejected by the check, and the only workaround is to disable the check; with the fix, the correct checkpoint file will be allowed to load. ### How was this patch tested? Added new test case in `UnsafeRowUtilsSuite` Closes #37252 from rednaxelafx/fix-unsaferow-validation. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit c608ae2fc6a3a50f2e67f2a3dad8d4e4be1aaf9f) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 27 July 2022, 23:49:45 UTC
9fdd097 [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` ### What changes were proposed in this pull request? This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum memory usage. ### Why are the changes needed? Reduce the maximum memory usage of test cases. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Should monitor CI Closes #37298 from LuciferYang/reduce-local-cluster-memory. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 July 2022, 00:26:56 UTC
c9d5675 [SPARK-39856][TESTS][INFRA] Skip q72 at TPC-DS build at GitHub Actions ### What changes were proposed in this pull request? This PR reverts https://github.com/apache/spark/commit/7358253755762f9bfe6cedc1a50ec14616cfeace, https://github.com/apache/spark/commit/ae1f6a26ed39b297ace8d6c9420b72a3c01a3291 and https://github.com/apache/spark/commit/72b55ccf8327c00e173ab6130fdb428ad0d5aacc because they do not help fixing the TPC-DS build. In addition, this PR skips the problematic query in GitHub Actions to avoid OOM. ### Why are the changes needed? To make the build pass. ### Does this PR introduce _any_ user-facing change? No, dev and test-only. ### How was this patch tested? CI in this PR should test it out. Closes #37289 from HyukjinKwon/SPARK-39856-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit de9a4b0747a4127e320f80f5e1bf431429da70a9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 July 2022, 09:25:58 UTC
84ed614 [SPARK-39856][SQL][TESTS][FOLLOW-UP] Increase the number of partitions in TPC-DS build to avoid out-of-memory ### What changes were proposed in this pull request? This PR increases the number of partitions further more (see also https://github.com/apache/spark/pull/37270) ### Why are the changes needed? To make the build pass. At least, two builds (https://github.com/apache/spark/runs/7500542538?check_suite_focus=true and https://github.com/apache/spark/runs/7511748355?check_suite_focus=true) passed after https://github.com/apache/spark/pull/37273. I assume that the number of partitions helps, and this PR increases some more. ### Does this PR introduce _any_ user-facing change? No, test and dev-only. ### How was this patch tested? It's tested in https://github.com/LuciferYang/spark/runs/7497163716?check_suite_focus=true Closes #37286 from HyukjinKwon/SPARK-39856-follwup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 72b55ccf8327c00e173ab6130fdb428ad0d5aacc) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 26 July 2022, 06:03:28 UTC
aa53fca [SPARK-39856][SQL][TESTS][FOLLOW-UP] Increase the number of partitions in TPC-DS build to avoid out-of-memory ### What changes were proposed in this pull request? This PR increases the number of partitions further more (see also https://github.com/apache/spark/pull/37270) ### Why are the changes needed? To make the build pass. ### Does this PR introduce _any_ user-facing change? No, test and dev-only. ### How was this patch tested? It's tested in https://github.com/LuciferYang/spark/runs/7497163716?check_suite_focus=true Closes #37273 from LuciferYang/SPARK-39856-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 July 2022, 13:08:47 UTC
7603f8d [SPARK-39835][SQL] Fix EliminateSorts remove global sort below the local sort ### What changes were proposed in this pull request? Correct the `EliminateSorts` follows: - If the upper sort is global then we can remove the global or local sort recursively. - If the upper sort is local then we can only remove the local sort recursively. ### Why are the changes needed? If a global sort below locol sort, we should not remove the global sort becuase the output partitioning can be affected. This issue is going to worse since we pull out the V1 Write sort to logcial side. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #37250 from ulysses-you/remove-sort. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5dca26d514a150bda58f7c4919624c9638498fec) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 July 2022, 10:35:24 UTC
c7e2604 [SPARK-39856][SQL][TESTS] Increase the number of partitions in TPC-DS build to avoid out-of-memory ### What changes were proposed in this pull request? This PR proposes to avoid out-of-memory in TPC-DS build at GitHub Actions CI by: - Increasing the number of partitions being used in shuffle. - Truncating precisions after 10th in floats. The number of partitions was previously set to 1 because of different results in precisions that generally we can just ignore. - Sort the results regardless of join type since Apache Spark does not guarantee the order of results ### Why are the changes needed? One of the reasons for the large memory usage seems to be single partition that's being used in the shuffle. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? GitHub Actions in this CI will test it out. Closes #37270 from HyukjinKwon/deflake-tpcds. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7358253755762f9bfe6cedc1a50ec14616cfeace) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 July 2022, 03:45:05 UTC
0a6ed8a [SPARK-39847][SS] Fix race condition in RocksDBLoader.loadLibrary() if caller thread is interrupted ### What changes were proposed in this pull request? This PR fixes a race condition in `RocksDBLoader.loadLibrary()`, which can occur if the thread which calls that method is interrupted. One of our jobs experienced a failure in `RocksDBLoader`: ``` Caused by: java.lang.IllegalThreadStateException at java.lang.Thread.start(Thread.java:708) at org.apache.spark.sql.execution.streaming.state.RocksDBLoader$.loadLibrary(RocksDBLoader.scala:51) ``` After investigation, we determined that this was due to task cancellation/interruption: if the task which starts the RocksDB library loading is interrupted, another thread may begin a load and crash with the thread state exception: - Although the `loadLibraryThread` child thread is is uninterruptible, the task thread which calls loadLibrary is still interruptible. - Let's say we have two tasks, A and B, both of which will call `RocksDBLoader.loadLibrary()` - Say that Task A wins the race to perform the load and enters the `synchronized` block in `loadLibrary()`, starts the `loadLibraryThread`, then blocks in the `loadLibraryThread.join()` call. - If Task A is interrupted, an `InterruptedException` will be thrown and it will exit the loadLibrary synchronized block. - At this point, Task B enters the synchronized block of its `loadLibrary() call and sees that `exception == null` because the `loadLibraryThread` started by the other task is still running, so Task B calls `loadLibraryThread.start()` and hits the thread state error because it tries to start an already-started thread. This PR fixes this issue by adding code to check `loadLibraryThread`'s state before calling `start()`: if the thread has already been started then we will skip the `start()` call and proceed directly to the `join()`. I also modified the logging so that we can detect when this case occurs. ### Why are the changes needed? Fix a bug that can lead to task or job failures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I reproduced the original race condition by adding a `Thread.sleep(10000)` to `loadLibraryThread.run()` (so it wouldn't complete instantly), then ran ```scala test("multi-threaded RocksDBLoader calls with interruption") { val taskThread = new Thread("interruptible Task Thread 1") { override def run(): Unit = { RocksDBLoader.loadLibrary() } } taskThread.start() // Give the thread time to enter the `loadLibrary()` call: Thread.sleep(1000) taskThread.interrupt() // Check that the load hasn't finished: assert(RocksDBLoader.exception == null) assert(RocksDBLoader.loadLibraryThread.getState != Thread.State.NEW) // Simulate the second task thread starting the load: RocksDBLoader.loadLibrary() // The load should finish successfully: RocksDBLoader.exception.isEmpty } ``` This test failed prior to my changes and succeeds afterwards. I don't want to actually commit this test because I'm concerned about flakiness and false-negatives: in order to ensure that the test would have failed before my change, we need to carefully control the thread interleaving. This code rarely changes and is relatively simple, so I think the ROI on spending time to write and commit a reliable test is low. Closes #37260 from JoshRosen/rocksdbloader-fix. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9cee1bb2527a496943ffedbd935dc737246a2d89) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 July 2022, 22:46:26 UTC
e24e7bf [SPARK-39831][BUILD] Fix R dependencies installation failure ### What changes were proposed in this pull request? move `libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev` from Install dependencies for documentation generation to Install R linter dependencies and SparkR Update after https://github.com/apache/spark/pull/37243: **add `apt update` before installation.** ### Why are the changes needed? to make CI happy Install R linter dependencies and SparkR started to fail after devtools_2.4.4 was released. ``` --------------------------- [ANTICONF] -------------------------------- Configuration failed to find the fontconfig freetype2 library. Try installing: * deb: libfontconfig1-dev (Debian, Ubuntu, etc) * rpm: fontconfig-devel (Fedora, EPEL) * csw: fontconfig_dev (Solaris) * brew: freetype (OSX) it seems that libfontconfig1-dev is needed now. ``` also refer to https://github.com/r-lib/systemfonts/issues/35#issuecomment-633560151 ### Does this PR introduce any user-facing change? No ### How was this patch tested? CI passed Closes #37247 from Yikun/patch-25. Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ffa82c219029a7f6f3caf613dd1d0ab56d0c599e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 July 2022, 03:12:08 UTC
b54d985 Revert "[SPARK-39831][BUILD] Fix R dependencies installation failure" This reverts commit 29290306749f75eb96f51fc5b61114e9b8a3bf53. 22 July 2022, 00:05:09 UTC
25fdf93 [SPARK-39831][BUILD] Fix R dependencies installation failure ### What changes were proposed in this pull request? move `libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev` from `Install dependencies for documentation generation` to `Install R linter dependencies and SparkR` ### Why are the changes needed? to make CI happy `Install R linter dependencies and SparkR` started to fail after `devtools_2.4.4` was released. ``` --------------------------- [ANTICONF] -------------------------------- Configuration failed to find the fontconfig freetype2 library. Try installing: * deb: libfontconfig1-dev (Debian, Ubuntu, etc) * rpm: fontconfig-devel (Fedora, EPEL) * csw: fontconfig_dev (Solaris) * brew: freetype (OSX) ``` it seems that `libfontconfig1-dev` is needed now. also refer to https://github.com/r-lib/systemfonts/issues/35#issuecomment-633560151 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #37243 from zhengruifeng/ci_add_dep. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 67efa318ec8cababdb5683ac262a8ebc3b3beefb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 July 2022, 13:29:27 UTC
dcaa6e0 [MINOR][PYTHON][DOCS] Fix broken Example section in col/column functions This PR fixes a bug in the documentation. Trailing `'` breaks Example section in Python reference documentation. This PR removes it. To render the documentation as intended in NumPy documentation style. Yes, the documentation is updated. **Before** <img width="789" alt="Screen Shot 2022-07-19 at 12 20 55 PM" src="https://user-images.githubusercontent.com/6477701/179661216-715dec96-bff2-474f-ab48-41577bf4c15c.png"> **After** <img width="633" alt="Screen Shot 2022-07-19 at 12 48 04 PM" src="https://user-images.githubusercontent.com/6477701/179661245-72d15184-aeed-43c2-b9c9-5f3cab1ae28d.png"> Manually built the documentation and tested. Closes #37223 from HyukjinKwon/minor-doc-fx. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 2bdb5bfa48d1fc44358c49f7e379c2afc4a1a32f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 19 July 2022, 08:53:44 UTC
aeafb17 [SPARK-39806][SQL] Accessing `_metadata` on partitioned table can crash a query This changes alters the projection used in `FileScanRDD` to attach file metadata to a row produced by the reader. This projection used to remove the partitioning columns from the produced row. The produced row had different schema than expected by the consumers, and was missing part of the data, which resulted in query failure. This is a bug. `FileScanRDD` should produce rows matching expected schema, and containing all the requested data. Queries should not crash due to internal errors. No. Adds a new test in `FileMetadataStructSuite.scala` that reproduces the issue. Closes #37214 from ala/metadata-partition-by. Authored-by: Ala Luszczak <ala@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 385f1c8e4037928afafbf6664e30dc268510c05e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 July 2022, 01:07:13 UTC
7765b8a [SPARK-39777][DOCS] Remove Hive bucketing incompatiblity documentation ### What changes were proposed in this pull request? We support Hive bucketing (with Hive hash function - https://issues.apache.org/jira/browse/SPARK-32709 and https://issues.apache.org/jira/browse/SPARK-32712) started from Spark 3.3.0, we should also update the documentation to reflect the fact, that we are no longer incompatible with Hive bucketing. ### Why are the changes needed? Update user-facing documentation to avoid confusion. ### Does this PR introduce _any_ user-facing change? Yes, the doc itself. ### How was this patch tested? Manually checked the doc file locally. Closes #37189 from c21/doc. Authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 528b9ebd165cb226e4365b5b17ceae49a3a7aa6f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 July 2022, 00:56:24 UTC
0e2758c [SPARK-39758][SQL][3.3] Fix NPE from the regexp functions on invalid patterns ### What changes were proposed in this pull request? In the PR, I propose to catch `PatternSyntaxException` while compiling the regexp pattern by the `regexp_extract`, `regexp_extract_all` and `regexp_instr`, and substitute the exception by Spark's exception w/ the error class `INVALID_PARAMETER_VALUE`. In this way, Spark SQL will output the error in the form: ```sql org.apache.spark.SparkRuntimeException [INVALID_PARAMETER_VALUE] The value of parameter(s) 'regexp' in `regexp_instr` is invalid: ) ? ``` instead of (on Spark 3.3.0): ```java java.lang.NullPointerException: null ``` Also I propose to set `lastRegex` only after the compilation of the regexp pattern completes successfully. This is a backport of https://github.com/apache/spark/pull/37171. ### Why are the changes needed? The changes fix NPE portrayed by the code on Spark 3.3.0: ```sql spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)'); 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 2b 14m', '(?l)')] java.lang.NullPointerException: null at org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768) ~[spark-catalyst_2.12-3.3.0.jar:3.3.0] ``` This should improve user experience with Spark SQL. ### Does this PR introduce _any_ user-facing change? No. In regular cases, the behavior is the same but users will observe different exceptions (error messages) after the changes. ### How was this patch tested? By running new tests: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql" $ build/sbt "test:testOnly *.RegexpExpressionsSuite" $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 5b96bd5cf8f44eee7a16cd027d37dec552ed5a6a) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #37181 from MaxGekk/pattern-syntax-exception-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 14 July 2022, 14:45:39 UTC
2fe1601 [SPARK-39672][SQL][3.1] Fix removing project before filter with correlated subquery Add more checks to`removeProjectBeforeFilter` in `ColumnPruning` and only remove the project if 1. the filter condition contains correlated subquery 2. same attribute exists in both output of child of Project and subquery This is a legitimate self-join query and should not throw exception when de-duplicating attributes in subquery and outer values. ```sql select * from ( select v1.a, v1.b, v2.c from v1 inner join v2 on v1.a=v2.a) t3 where not exists ( select 1 from v2 where t3.a=v2.a and t3.b=v2.b and t3.c=v2.c ) ``` Here's what happens with the current code. The above query is analyzed into following `LogicalPlan` before `ColumnPruning`. ``` Project [a#250, b#251, c#268] +- Filter NOT exists#272 [(a#250 = a#266) && (b#251 = b#267) && (c#268 = c#268#277)] : +- Project [1 AS 1#273, _1#259 AS a#266, _2#260 AS b#267, _3#261 AS c#268#277] : +- LocalRelation [_1#259, _2#260, _3#261] +- Project [a#250, b#251, c#268] +- Join Inner, (a#250 = a#266) :- Project [a#250, b#251] : +- Project [_1#243 AS a#250, _2#244 AS b#251] : +- LocalRelation [_1#243, _2#244, _3#245] +- Project [a#266, c#268] +- Project [_1#259 AS a#266, _3#261 AS c#268] +- LocalRelation [_1#259, _2#260, _3#261] ``` Then in `ColumnPruning`, the Project before Filter (between Filter and Join) is removed. This changes the `outputSet` of the child of Filter among which the same attribute also exists in the subquery. Later, when `RewritePredicateSubquery` de-duplicates conflicting attributes, it would complain `Found conflicting attributes a#266 in the condition joining outer plan`. No. Add UT. Closes #37074 from manuzhang/spark-39672. Lead-authored-by: tianlzhang <tianlzhang@ebay.com> Co-authored-by: Manu Zhang <OwenZhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 36fc73e7c42b84e05b15b2caecc0f804610dce20) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 July 2022, 05:03:45 UTC
acf8f66 [SPARK-39647][CORE] Register the executor with ESS before registering the BlockManager ### What changes were proposed in this pull request? Currently the executors register with the ESS after the `BlockManager` registration with the `BlockManagerMaster`. This order creates a problem with the push-based shuffle. A registered BlockManager node is picked up by the driver as a merger but the shuffle service on that node is not yet ready to merge the data which causes block pushes to fail until the local executor registers with it. This fix is to reverse the order, that is, register with the ESS before registering the `BlockManager` ### Why are the changes needed? They are needed to fix the issue which causes block pushes to fail. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a UT. Closes #37052 from otterc/SPARK-39647. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 79ba2890f51c5f676b9cd6e3a6682c7969462999) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 12 July 2022, 05:21:20 UTC
016dfeb [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT` https://github.com/apache/spark/pull/35145 compile COVAR_POP, COVAR_SAMP and CORR in H2Dialect. Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with DISTINCT. So https://github.com/apache/spark/pull/35145 introduces a bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT. 'Yes'. Bug will be fix. New test cases. Closes #37090 from beliefer/SPARK-37527_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 14f2bae208c093dea58e3f947fb660e8345fb256) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 July 2022, 01:59:04 UTC
f9e3668 [SPARK-39656][SQL][3.3] Fix wrong namespace in DescribeNamespaceExec backport https://github.com/apache/spark/pull/37049 for branch-3.3 ### What changes were proposed in this pull request? DescribeNamespaceExec change ns.last to ns.quoted ### Why are the changes needed? DescribeNamespaceExec should show the whole namespace rather than last ### Does this PR introduce _any_ user-facing change? yes, a small bug fix ### How was this patch tested? fix test Closes #37071 from ulysses-you/desc-namespace-3.3. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> 05 July 2022, 19:43:27 UTC
2edd344 [SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__ ### What changes were proposed in this pull request? This PR fix the wrong aliases in `__array_ufunc__` ### Why are the changes needed? When running test with numpy 1.23.0 (current latest), hit a bug: `NotImplementedError: pandas-on-Spark objects currently do not support <ufunc 'divide'>.` In `__array_ufunc__` we first call `maybe_dispatch_ufunc_to_dunder_op` to try dunder methods first, and then we try pyspark API. `maybe_dispatch_ufunc_to_dunder_op` is from pandas code. pandas fix a bug https://github.com/pandas-dev/pandas/pull/44822#issuecomment-991166419 https://github.com/pandas-dev/pandas/pull/44822/commits/206b2496bc6f6aa025cb26cb42f52abeec227741 when upgrade to numpy 1.23.0, we need to also sync this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Current CI passed - The exsiting UT `test_series_datetime` already cover this, I also test it in my local env with 1.23.0 ```shell pip install "numpy==1.23.0" python/run-tests --testnames 'pyspark.pandas.tests.test_series_datetime SeriesDateTimeTest.test_arithmetic_op_exceptions' ``` Closes #37078 from Yikun/SPARK-39611. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fb48a14a67940b9270390b8ce74c19ae58e2880e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 July 2022, 11:52:44 UTC
364a4f5 [SPARK-39612][SQL][TESTS] DataFrame.exceptAll followed by count should work ### What changes were proposed in this pull request? This PR adds a test case broken by https://github.com/apache/spark/commit/4b9343593eca780ca30ffda45244a71413577884 which was reverted in https://github.com/apache/spark/commit/161c596cafea9c235b5c918d8999c085401d73a9. ### Why are the changes needed? To prevent a regression in the future. This was a regression in Apache Spark 3.3 that used to work in Apache Spark 3.2. ### Does this PR introduce _any_ user-facing change? Yes, it makes `DataFrame.exceptAll` followed by `count` working. ### How was this patch tested? The unit test was added. Closes #37084 from HyukjinKwon/SPARK-39612. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 947e271402f749f6f58b79fecd59279eaf86db57) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 July 2022, 11:44:50 UTC
2069fd0 [SPARK-39677][SQL][DOCS] Fix args formatting of the regexp and like functions ### What changes were proposed in this pull request? In the PR, I propose to fix args formatting of some regexp functions by adding explicit new lines. That fixes the following items in arg lists. Before: <img width="745" alt="Screenshot 2022-07-05 at 09 48 28" src="https://user-images.githubusercontent.com/1580697/177274234-04209d43-a542-4c71-b5ca-6f3239208015.png"> After: <img width="704" alt="Screenshot 2022-07-05 at 11 06 13" src="https://user-images.githubusercontent.com/1580697/177280718-cb05184c-8559-4461-b94d-dfaaafda7dd2.png"> ### Why are the changes needed? To improve readability of Spark SQL docs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By building docs and checking manually: ``` $ SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 bundle exec jekyll build ``` Closes #37082 from MaxGekk/fix-regexp-docs. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 4e42f8b12e8dc57a15998f22d508a19cf3c856aa) Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 July 2022, 10:37:59 UTC
4512e09 Revert "[SPARK-38531][SQL] Fix the condition of "Prune unrequired child index" branch of ColumnPruning" This reverts commit 17c56fc03b8e7269b293d6957c542eab9d723d52. 05 July 2022, 09:02:37 UTC
3f969ad [SPARK-39676][CORE][TESTS] Add task partition id for TaskInfo assertEquals method in JsonProtocolSuite ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/35185 , task partition id was added in taskInfo. And, JsonProtocolSuite#assertEquals about TaskInfo doesn't have partitionId. ### Why are the changes needed? Should assert partitionId equals or not. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No need to add unit test. Closes #37081 from dcoliversun/SPARK-39676. Authored-by: Qian.Sun <qian.sun2020@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 July 2022, 06:41:30 UTC
3e28f33 [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast ### What changes were proposed in this pull request? Change `currentPhysicalPlan` to `inputPlan ` when we restore the broadcast exchange for DPP. ### Why are the changes needed? The currentPhysicalPlan can be wrapped with broadcast query stage so it is not safe to match it. For example: The broadcast exchange which is added by DPP is running before than the normal broadcast exchange(e.g. introduced by join). ### Does this PR introduce _any_ user-facing change? yes bug fix ### How was this patch tested? add test Closes #36974 from ulysses-you/inputplan. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c320a5d51b2c8427fc5d6648984bfd266891b451) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 July 2022, 03:31:35 UTC
463a24d [SPARK-39650][SS] Fix incorrect value schema in streaming deduplication with backward compatibility ### What changes were proposed in this pull request? This PR proposes to fix the incorrect value schema in streaming deduplication. It stores the empty row having a single column with null (using NullType), but the value schema is specified as all columns, which leads incorrect behavior from state store schema compatibility checker. This PR proposes to set the schema of value as `StructType(Array(StructField("__dummy__", NullType)))` to fit with the empty row. With this change, the streaming queries creating the checkpoint after this fix would work smoothly. To not break the existing streaming queries having incorrect value schema, this PR proposes to disable the check for value schema on streaming deduplication. Disabling the value check was there for the format validation (we have two different checkers for state store), but it has been missing for state store schema compatibility check. To avoid adding more config, this PR leverages the existing config "format validation" is using. ### Why are the changes needed? This is a bug fix. Suppose the streaming query below: ``` # df has the columns `a`, `b`, `c` val df = spark.readStream.format("...").load() val query = df.dropDuplicate("a").writeStream.format("...").start() ``` while the query is running, df can produce a different set of columns (e.g. `a`, `b`, `c`, `d`) from the same source due to schema evolution. Since we only deduplicate the rows with column `a`, the change of schema should not matter for streaming deduplication, but state store schema checker throws error saying "value schema is not compatible" before this fix. ### Does this PR introduce _any_ user-facing change? No, this is basically a bug fix which end users wouldn't notice unless they encountered a bug. ### How was this patch tested? New tests. Closes #37041 from HeartSaVioR/SPARK-39650. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit fe536033bdd00d921b3c86af329246ca55a4f46a) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 02 July 2022, 13:46:20 UTC
27f78e6 [SPARK-39657][YARN] YARN AM client should call the non-static setTokensConf method ### What changes were proposed in this pull request? This fixes a bug in the original SPARK-37205 PR, where we treat the method `setTokensConf` as a static method, but it should be non-static instead. ### Why are the changes needed? The method `setTokensConf` is non-static so the current code will fail: ``` 06/29/2022 - 17:28:16 - Exception in thread "main" java.lang.IllegalArgumentException: object is not an instance of declaring class 06/29/2022 - 17:28:16 - at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 06/29/2022 - 17:28:16 - at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 06/29/2022 - 17:28:16 - at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 06/29/2022 - 17:28:16 - at java.base/java.lang.reflect.Method.invoke(Method.java:566) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested this change internally and it now works. Closes #37050 from sunchao/SPARK-39657. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6624d91c9644526f1cb6fdfb4677604b40aa786f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 July 2022, 20:47:07 UTC
2707a5a Revert "[SPARK-37205][FOLLOWUP] Should call non-static setTokensConf method" This reverts commit 7e2a1827757a8c0e356ab795387f094e81f5f37e. 01 July 2022, 16:45:28 UTC
back to top