https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
b53c341 Preparing Spark release v3.2.3-rc1 14 November 2022, 16:01:29 UTC
ddccb6f [SPARK-41091][BUILD][3.2] Fix Docker release tool for branch-3.2 ### What changes were proposed in this pull request? This tries to fix `do-release-docker.sh` for branch-3.2. ### Why are the changes needed? Currently the following error will occur if running the script in `branch-3.2`: ``` #5 917.4 g++ -std=gnu++14 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o testthat.so init.o reassign.o test-catch.o test-example.o test-runner.o -L/usr/lib/R/lib -lR #5 917.5 installing to /usr/local/lib/R/site-library/00LOCK-testthat/00new/testthat/libs #5 917.5 ** R #5 917.5 ** inst #5 917.5 ** byte-compile and prepare package for lazy loading #5 924.4 ** help #5 924.6 *** installing help indices #5 924.7 *** copying figures #5 924.7 ** building package indices #5 924.9 ** installing vignettes #5 924.9 ** testing if installed package can be loaded from temporary location #5 925.1 ** checking absolute paths in shared objects and dynamic libraries #5 925.1 ** testing if installed package can be loaded from final location #5 925.5 ** testing if installed package keeps a record of temporary installation path #5 925.5 * DONE (testthat) #5 925.8 ERROR: dependency 'pkgdown' is not available for package 'devtools' #5 925.8 * removing '/usr/local/lib/R/site-library/devtools' #5 925.8 #5 925.8 The downloaded source packages are in #5 925.8 '/tmp/Rtmp3nJI60/downloaded_packages' #5 925.8 Warning messages: #5 925.8 1: In install.packages(c("curl", "xml2", "httr", "devtools", "testthat", : #5 925.8 installation of package 'textshaping' had non-zero exit status #5 925.8 2: In install.packages(c("curl", "xml2", "httr", "devtools", "testthat", : #5 925.8 installation of package 'ragg' had non-zero exit status #5 925.8 3: In install.packages(c("curl", "xml2", "httr", "devtools", "testthat", : #5 925.8 installation of package 'pkgdown' had non-zero exit status #5 925.8 4: In install.packages(c("curl", "xml2", "httr", "devtools", "testthat", : #5 925.8 installation of package 'devtools' had non-zero exit status #5 926.0 Error in loadNamespace(x) : there is no package called 'devtools' #5 926.0 Calls: loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart #5 926.0 Execution halted ``` The same error doesn't happen on master. I checked the diff between the two and it seems the following line: ``` $APT_INSTALL libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev && \ ``` introduced in https://github.com/apache/spark/pull/34728 made the difference. I verified that after adding the line, `do-release-docker.sh` (dry run mode) was able to finish successfully. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually Closes #38643 from sunchao/fix-docker-release. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 November 2022, 06:27:12 UTC
f0f83b5 [SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering The `FileFormatWriter` materializes an `AdaptiveQueryPlan` before accessing the plan's `outputOrdering`. This is required for Spark 3.0 to 3.3. Spark 3.4 does not need this because `FileFormatWriter` gets the final plan. `FileFormatWriter` enforces an ordering if the written plan does not provide that ordering. An `AdaptiveQueryPlan` does not know its final ordering (Spark 3.0 to 3.3), in which case `FileFormatWriter` enforces the ordering (e.g. by column `"a"`) even if the plan provides a compatible ordering (e.g. by columns `"a", "b"`). In case of spilling, that order (e.g. by columns `"a", "b"`) gets broken (see SPARK-40588). This fixes SPARK-40588, which was introduced in 3.0. This restores behaviour from Spark 2.4. The final plan that is written to files cannot be extracted from `FileFormatWriter`. The bug explained in [SPARK-40588](https://issues.apache.org/jira/browse/SPARK-40588) can only be asserted on the result files when spilling occurs. This is very hard to control in an unit test scenario. Therefore, this was tested manually. The [example to reproduce this issue](https://issues.apache.org/jira/browse/SPARK-40588?focusedCommentId=17621032&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17621032) given in SPARK-40588 now produces sorted files. The actual plan written into the files changed from ``` Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0 +- AdaptiveSparkPlan isFinalPlan=false +- Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(day#2L, 2), REPARTITION_BY_NUM, [id=#30] +- BroadcastNestedLoopJoin BuildLeft, Inner :- BroadcastExchange IdentityBroadcastMode, [id=#28] : +- Project [id#0L AS day#2L] : +- Range (0, 2, step=1, splits=2) +- Range (0, 10000000, step=1, splits=2) ``` where `FileFormatWriter` enforces order with `Sort [input[0, bigint, false] ASC NULLS FIRST], false, 0`, to ``` *(3) Sort [day#2L ASC NULLS FIRST, id#4L ASC NULLS FIRST], false, 0 +- AQEShuffleRead coalesced +- ShuffleQueryStage 1 +- Exchange hashpartitioning(day#2L, 200), REPARTITION_BY_COL, [id=#68] +- *(2) BroadcastNestedLoopJoin BuildLeft, Inner :- BroadcastQueryStage 0 : +- BroadcastExchange IdentityBroadcastMode, [id=#42] : +- *(1) Project [id#0L AS day#2L] : +- *(1) Range (0, 2, step=1, splits=2) +- *(2) Range (0, 1000000, step=1, splits=2) ``` where the sort given by the user is the outermost sort now. Closes #38358 from EnricoMi/branch-3.3-materialize-aqe-plan. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f0cad7ad6c2618d2d0d8c8598bbd54c2ca366b6b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 November 2022, 08:01:27 UTC
f4ebe8f [SPARK-41035][SQL] Don't patch foldable children of aggregate functions in `RewriteDistinctAggregates` `RewriteDistinctAggregates` doesn't typically patch foldable children of the distinct aggregation functions except in one odd case (and seemingly by accident). This PR extends the policy of not patching foldables to that odd case. This query produces incorrect results: ``` select a, count(distinct 100) as cnt1, count(distinct b, 100) as cnt2 from values (1, 2), (4, 5) as data(a, b) group by a; +---+----+----+ |a |cnt1|cnt2| +---+----+----+ |1 |1 |0 | |4 |1 |0 | +---+----+----+ ``` The values for `cnt2` should be 1 and 1 (not 0 and 0). If you change the literal used in the first aggregate function, the second aggregate function now works correctly: ``` select a, count(distinct 101) as cnt1, count(distinct b, 100) as cnt2 from values (1, 2), (4, 5) as data(a, b) group by a; +---+----+----+ |a |cnt1|cnt2| +---+----+----+ |1 |1 |1 | |4 |1 |1 | +---+----+----+ ``` The bug is in the rule `RewriteDistinctAggregates`. When a distinct aggregation has only foldable children, `RewriteDistinctAggregates` uses the first child as the grouping key (_grouping key_ in this context means the function children of distinct aggregate functions: `RewriteDistinctAggregates` groups distinct aggregations by function children to determine the `Expand` projections it needs to create). Therefore, the first foldable child gets included in the `Expand` projection associated with the aggregation, with a corresponding output attribute that is also included in the map for patching aggregate functions in the final aggregation. The `Expand` projections for all other distinct aggregate groups will have `null` in the slot associated with that output attribute. If the same foldable expression is used in a distinct aggregation associated with a different group, `RewriteDistinctAggregates` will improperly patch the associated aggregate function to use the previous aggregation's output attribute. Since the output attribute is associated with a different group, the value of that slot in the `Expand` projection will always be `null`. In the example above, `count(distinct 100) as cnt1` is the aggregation with only foldable children, and `count(distinct b, 100) as cnt2` is the aggregation that gets inappropriately patched with the wrong group's output attribute. As a result `count(distinct b, 100) as cnt2` (from the first example above) essentially becomes `count(distinct b, null) as cnt2`, which is always zero. `RewriteDistinctAggregates` doesn't typically patch foldable children of the distinct aggregation functions in the final aggregation. It potentially patches foldable expressions only when there is a distinct aggregation with only foldable children, and even then it doesn't patch the aggregation that has only foldable children, but instead some other unlucky aggregate function that happened to use the same foldable expression. This PR skips patching any foldable expressions in the aggregate functions to avoid patching an aggregation with a different group's output attribute. No. New unit test. Closes #38565 from bersprockets/distinct_literal_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0add57a1c0290a158666027afb3e035728d2dcee) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 November 2022, 01:42:38 UTC
b834c3f [SPARK-32380][SQL] Fixing access of HBase table via Hive from Spark This is an update of https://github.com/apache/spark/pull/29178 which was closed because the root cause of the error was just vaguely defined there but here I will give an explanation why `HiveHBaseTableInputFormat` does not work well with the `NewHadoopRDD` (see in the next section). The PR modify `TableReader.scala` to create `OldHadoopRDD` when input format is 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat'. - environments (Cloudera distribution 7.1.7.SP1): hadoop 3.1.1 hive 3.1.300 spark 3.2.1 hbase 2.2.3 With the `NewHadoopRDD` the following exception is raised: ``` java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:253) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:446) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:429) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) at org.apache.spark.sql.Dataset.show(Dataset.scala:806) at org.apache.spark.sql.Dataset.show(Dataset.scala:765) at org.apache.spark.sql.Dataset.show(Dataset.scala:774) ... 47 elided Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:557) at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:248) ... 86 more ``` There are two interfaces: - the new `org.apache.hadoop.mapreduce.InputFormat`: providing a one arg method `getSplits(JobContext context)` (returning `List<InputSplit>`) - the old `org.apache.hadoop.mapred.InputFormat`: providing a two arg method `getSplits(JobConf job, int numSplits)` (returning `InputSplit[]`) And in Hive both are implemented by `HiveHBaseTableInputFormat` but only the old method leads to required initialisation and this why `NewHadoopRDD` fails here. Here all the link refers latest commits of the master branches for the mentioned components at the time of writing this description (to get the right line numbers in the future too as `master` itself is a moving target). Spark in `NewHadoopRDD` uses the new interface providing the one arg method: https://github.com/apache/spark/blob/5556cfc59aa97a3ad4ea0baacebe19859ec0bcb7/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L136 Hive on the other hand binds the initialisation to the two args method coming from the old interface. See [Hive#getSplits](https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L268): ``` Override public InputSplit[] getSplits(final JobConf jobConf, final int numSplits) throws IOException { ``` This calls `getSplitsInternal` which contains the [initialisation](https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L299) too: ``` initializeTable(conn, tableName); ``` Interesting that Hive also uses the one arg method internally within the `getSplitsInternal` [here](https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L356) but the initialisation done earlier. By calling the new interface method (what `NewHadoopRDD` does) the call goes straight to the HBase method: [org.apache.hadoop.hbase.mapreduce.TableInputFormatBase#getSplits](https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L230). Where there would be some `JobContext` based [initialisation](https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L234-L237) which by default is an [empty method](https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L628-L640): ```java /** * Handle subclass specific set up. Each of the entry points used by the MapReduce framework, * {link #createRecordReader(InputSplit, TaskAttemptContext)} and {link #getSplits(JobContext)}, * will call {link #initialize(JobContext)} as a convenient centralized location to handle * retrieving the necessary configuration information and calling * {link #initializeTable(Connection, TableName)}. Subclasses should implement their initialize * call such that it is safe to call multiple times. The current TableInputFormatBase * implementation relies on a non-null table reference to decide if an initialize call is needed, * but this behavior may change in the future. In particular, it is critical that initializeTable * not be called multiple times since this will leak Connection instances. */ protected void initialize(JobContext context) throws IOException { } ``` This is not overridden by Hive and hard to reason why we need that (its an internal Hive class) so it is easier to fix this in Spark. No. 1) create hbase table ``` hbase(main):001:0>create 'hbase_test1', 'cf1' hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' ``` 2) create hive table related to hbase table hive> ``` CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') ```   3): spark-shell query hive table while data in HBase ``` scala> spark.sql("select * from hivetest.hbase_test").show() 22/11/05 01:14:16 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist 22/11/05 01:14:16 WARN client.HiveClientImpl: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = f05b6866-86df-4d88-9eea-f1c45043bb5f +---+-----+ |key|value| +---+-----+ | r1| 123| +---+-----+ ``` Closes #38516 from attilapiros/SPARK-32380. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7009ef0510dae444c72e7513357e681b08379603) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 November 2022, 12:30:35 UTC
5d62f47 [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10 ### What changes were proposed in this pull request? Upgrade Apache commons-text from 1.6 to 1.10.0 ### Why are the changes needed? [CVE-2022-42889](https://nvd.nist.gov/vuln/detail/CVE-2022-42889) this is a [9.8 CRITICAL](https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?name=CVE-2022-42889&vector=AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H&version=3.1&source=NIST) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA Closes #38352 from bjornjorgensen/patch-2. Lead-authored-by: Bjørn Jørgensen <bjornjorgensen@gmail.com> Co-authored-by: Bjørn <bjornjorgensen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> 03 November 2022, 22:05:37 UTC
1aef8b7 [SPARK-40869][K8S] Resource name prefix should not start with a hyphen ### What changes were proposed in this pull request? Strip leading - from resource name prefix ### Why are the changes needed? leading - are not allowed for resource name prefix (especially spark.kubernetes.executor.podNamePrefix) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #38331 from tobiasstadler/fix-SPARK-40869. Lead-authored-by: Tobias Stadler <ts.stadler@gmx.de> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7f3b5987de1f79434a861408e6c8bf55c5598031) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 03 November 2022, 17:34:25 UTC
b9d22ac [MINOR][BUILD] Correct the `files` contend in `checkstyle-suppressions.xml` ### What changes were proposed in this pull request? The pr aims to change the suppress files from `sql/core/src/main/java/org/apache/spark/sql/api.java/*` to `sql/core/src/main/java/org/apache/spark/sql/api/java/*`, the former seems to be a wrong code path. ### Why are the changes needed? Correct the `files` contend in `checkstyle-suppressions.xml` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38469 from LuciferYang/fix-java-supperessions. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 5457193dc095bc6c97259e31fa3df44184822f65) Signed-off-by: Sean Owen <srowen@gmail.com> 01 November 2022, 23:10:59 UTC
c12b4e2 [SPARK-40983][DOC] Remove Hadoop requirements for zstd mentioned in Parquet compression codec ### What changes were proposed in this pull request? Change the doc to remove Hadoop requirements for zstd mentioned in Parquet compression codec. ### Why are the changes needed? This requirement is removed after https://issues.apache.org/jira/browse/PARQUET-1866, and Spark uses Parquet 1.12.3 now. ### Does this PR introduce _any_ user-facing change? Yes, doc updated. ### How was this patch tested? <img width="1144" alt="image" src="https://user-images.githubusercontent.com/26535726/199180625-4e3a2ee1-3e4d-4d61-8842-f1d5b7b9321d.png"> Closes #38458 from pan3793/SPARK-40983. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 9c1bb41ca34229c87b463b4941b4e9c829a0e396) Signed-off-by: Yuming Wang <yumwang@ebay.com> 01 November 2022, 12:33:19 UTC
815f32f [SPARK-40963][SQL] Set nullable correctly in project created by `ExtractGenerator` ### What changes were proposed in this pull request? When creating the project list for the new projection In `ExtractGenerator`, take into account whether the generator is outer when setting nullable on generator-related output attributes. ### Why are the changes needed? This PR fixes an issue that can produce either incorrect results or a `NullPointerException`. It's a bit of an obscure issue in that I am hard-pressed to reproduce without using a subquery that has a inline table. Example: ``` select c1, explode(c4) as c5 from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(1, 2)), (2, array(2, 3)), (3, null) as data(c1, c2) ) ); +---+---+ |c1 |c5 | +---+---+ |1 |1 | |1 |2 | |2 |2 | |2 |3 | |3 |0 | +---+---+ ``` In the last row, `c5` is 0, but should be `NULL`. Another example: ``` select c1, exists(c4, x -> x is null) as c5 from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(1, 2)), (2, array(2, 3)), (3, array()) as data(c1, c2) ) ); +---+-----+ |c1 |c5 | +---+-----+ |1 |false| |1 |false| |2 |false| |2 |false| |3 |false| +---+-----+ ``` In the last row, `false` should be `true`. In both cases, at the time `CreateArray(c3)` is instantiated, `c3`'s nullability is incorrect because the new projection created by `ExtractGenerator` uses `generatorOutput` from `explode_outer(c2)` as a projection list. `generatorOutput` doesn't take into account that `explode_outer(c2)` is an _outer_ explode, so the nullability setting is lost. `UpdateAttributeNullability` will eventually fix the nullable setting for attributes referring to `c3`, but it doesn't fix the `containsNull` setting for `c4` in `explode(c4)` (from the first example) or `exists(c4, x -> x is null)` (from the second example). This example fails with a `NullPointerException`: ``` select c1, inline_outer(c4) from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(named_struct('a', 1, 'b', 2))), (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), (3, array()) as data(c1, c2) ) ); 22/10/30 17:34:42 ERROR Executor: Exception in task 1.0 in stage 8.0 (TID 14) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #38440 from bersprockets/SPARK-40963. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 90d31541fb0313d762cc36067060e6445c04a9b6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 31 October 2022, 01:45:41 UTC
2df8bfb [SPARK-38697][SQL][3.2] Extend SparkSessionExtensions to inject rules into AQE Optimizer ### What changes were proposed in this pull request? Backport SPARK-38697 to Spark 3.2.x ### Why are the changes needed? Allows users to inject logical plan optimizer rules into AQE ### Does this PR introduce _any_ user-facing change? Yes, new API method to inject logical plan optimizer rules into AQE ### How was this patch tested? Backport includes a unit test Closes #38401 from andygrove/backport-SPARK-38697-spark32. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 October 2022, 20:34:33 UTC
e5744b9 [SPARK-40902][MESOS][TESTS] Fix issue with mesos tests failing due to quick submission of drivers ### What changes were proposed in this pull request? ##### Quick submission of drivers in tests to mesos scheduler results in dropping drivers Queued drivers in `MesosClusterScheduler` are ordered based on `MesosDriverDescription` - and the ordering used checks for priority (if different), followed by comparison of submission time. For two driver submissions with same priority, if made in quick succession (such that submission time is same due to millisecond granularity of Date), this results in dropping the second `MesosDriverDescription` from `queuedDrivers` (since `driverOrdering` returns `0` when comparing the descriptions). This PR fixes the more immediate issue with tests. ### Why are the changes needed? Flakey tests, [see here](https://lists.apache.org/thread/jof098qxp0s6qqmt9qwv52f9665b1pjg) for an example. ### Does this PR introduce _any_ user-facing change? No. Fixing only tests for now - as mesos support is deprecated, not changing scheduler itself to address this. ### How was this patch tested? Fixes unit tests Closes #38378 from mridulm/fix_MesosClusterSchedulerSuite. Authored-by: Mridul <mridulatgmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 60b1056307b3ee9d880a936f3a97c5fb16a2b698) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 October 2022, 17:52:07 UTC
8448038 [SPARK-40851][INFRA ][SQL][TESTS][3.2] Make GA run successfully with the latest Java 8/11/17 ### What changes were proposed in this pull request? The main change of this pr as follows: - Replace `Antarctica/Vostok` to `Asia/Urumqi` in Spark code - Replace `Europe/Amsterdam` to `Europe/Brussels` in Spark code - Regenerate `gregorian-julian-rebase-micros.json` using generate 'gregorian-julian-rebase-micros.json' in `RebaseDateTimeSuite` with Java 8u352 - Regenerate `julian-gregorian-rebase-micros.json` using generate 'julian-gregorian-rebase-micros.json' in RebaseDateTimeSuite with Java 8u352 ### Why are the changes needed? Make GA run successfully with the latest Java 8/11/17: - Java 8u352 - Java 11.0.17 - Java 17.0.5 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: the following commands can test pass with Java 8u345, 8u352, 11.0.16, 11.0.17, 17.0.4 and 17.0.5 - `build/sbt "catalyst/test"` - `build/sbt "sql/testOnly *SQLQueryTestSuite -- -t \"timestamp-ltz.sql\""` - `build/sbt "sql/testOnly *SQLQueryTestSuite -- -t \"timestamp-ntz.sql\""` Closes #38365 from LuciferYang/SPARK-40851-32. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 October 2022, 07:32:23 UTC
b6b4945 [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script: ``` bin/pyspark --conf spark.io.encryption.enabled=true ... bar = {"a": "aa", "b": "bb"} foo = spark.sparkContext.broadcast(bar) spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "") spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect() ``` fails with: ``` 22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1] org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command command = serializer._read_with_length(file) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length return self.loads(obj) File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads return cloudpickle.loads(obj, encoding=encoding) EOFError: Ran out of input ``` The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code: https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420 the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side. Please note that the example above works in Spark 3.3 without this fix. That is because https://github.com/apache/spark/pull/36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`: https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242 changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that https://github.com/apache/spark/pull/36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does now, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs). To fix a bug. No. Added new UT. Closes #38334 from peter-toth/SPARK-40874-fix-broadcasts-in-python-udf. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8a96f69bb536729eaa59fae55160f8a6747efbe3) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 October 2022, 01:29:19 UTC
db2974b [SPARK-40829][SQL] STORED AS serde in CREATE TABLE LIKE view does not work ### What changes were proposed in this pull request? After [SPARK-29839](https://issues.apache.org/jira/browse/SPARK-29839), we could create a table with specife based a existing view, but the serde of created is always parquet. However, if we use USING syntax ([SPARK-29421](https://issues.apache.org/jira/browse/SPARK-29421)) to create a table with specified serde based a view, we can get the correct serde. ### Why are the changes needed? We should add specified serde for the created table when using `create table like view stored as` syntax. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit Test Closes #38295 from zhangbutao/SPARK-40829. Authored-by: zhangbutao <zhangbutao@cmss.chinamobile.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4ad29829bf53fff26172845312b334008bc4cb68) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 October 2022, 21:54:59 UTC
3cc2ba9 [SPARK-8731] Beeline doesn't work with -e option when started in background ### What changes were proposed in this pull request? Append jline option "-Djline.terminal=jline.UnsupportedTerminal" to enable the Beeline process to run in background. ### Why are the changes needed? Currently, if we execute spark Beeline in background, the Beeline process stops immediately. <img width="1350" alt="image" src="https://user-images.githubusercontent.com/88070094/194742935-8235b1ba-386e-4470-b182-873ef185e19f.png"> ### Does this PR introduce _any_ user-facing change? User will be able to execute Spark Beeline in background. ### How was this patch tested? 1. Start Spark ThriftServer 2. Execute command `./bin/beeline -u "jdbc:hive2://localhost:10000" -e "select 1;" &` 3. Verify Beeline process output in console: <img width="1407" alt="image" src="https://user-images.githubusercontent.com/88070094/194743153-ff3f1d19-ac23-443b-97a6-f024719008cd.png"> ### Note Beeline works fine on Windows when backgrounded: ![image](https://user-images.githubusercontent.com/88070094/194743797-7dc4fc21-dec6-4056-8b13-21fc96f1476e.png) Closes #38172 from zhouyifan279/SPARK-8731. Authored-by: zhouyifan279 <zhouyifan279@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit cb0d6ed46acee7271597764e018558b86aa8c29b) Signed-off-by: Kent Yao <yao@apache.org> 12 October 2022, 03:35:36 UTC
ad2fb0e [SPARK-40660][SQL][3.3] Switch to XORShiftRandom to distribute elements ### What changes were proposed in this pull request? Cherry-picked from #38106 and reverted changes in RDD.scala: https://github.com/apache/spark/blob/d2952b671a3579759ad9ce326ed8389f5270fd9f/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L507 ### Why are the changes needed? The number of output files has changed since SPARK-40407. [Some downstream projects](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java#L578-L579) use repartition to determine the number of output files in the test. ``` bin/spark-shell --master "local[2]" spark.range(10).repartition(10).write.mode("overwrite").parquet("/tmp/spark/repartition") ``` Before this PR and after SPARK-40407, the number of output files is 8. After this PR or before SPARK-40407, the number of output files is 10. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #38110 from wangyum/branch-3.3-SPARK-40660. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 5fe895a65a4a9d65f81d43af473b5e3a855ed8c8) Signed-off-by: Yuming Wang <yumwang@ebay.com> 06 October 2022, 05:02:51 UTC
b43e38b [SPARK-40617][CORE][3.2] Fix race condition at the handling of ExecutorMetricsPoller's stageTCMP entries ### What changes were proposed in this pull request? Fix a race condition in ExecutorMetricsPoller between `getExecutorUpdates()` and `onTaskStart()` methods by avoiding removing entries when another stage is not started yet. ### Why are the changes needed? Spurious failures are reported because of the following assert: ``` 22/09/29 09:46:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 3063.0 in stage 1997.0 (TID 677249),5,main] java.lang.AssertionError: assertion failed: task count shouldn't below 0 at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.executor.ExecutorMetricsPoller.decrementCount$1(ExecutorMetricsPoller.scala:130) at org.apache.spark.executor.ExecutorMetricsPoller.$anonfun$onTaskCompletion$3(ExecutorMetricsPoller.scala:135) at java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1822) at org.apache.spark.executor.ExecutorMetricsPoller.onTaskCompletion(ExecutorMetricsPoller.scala:135) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:737) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) 22/09/29 09:46:24 INFO MemoryStore: MemoryStore cleared 22/09/29 09:46:24 INFO BlockManager: BlockManager stopped 22/09/29 09:46:24 INFO ShutdownHookManager: Shutdown hook called 22/09/29 09:46:24 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1664443624160_0001/spark-93efc2d4-84de-494b-a3b7-2cb1c3a45426 ``` I have checked the code and the basic assumption to have at least as many `onTaskStart()` calls as `onTaskCompletion()` for the same `stageId` & `stageAttemptId` pair is correct. But there is race condition between `getExecutorUpdates()` and `onTaskStart()`. First of all we have two different threads: - task runner: to execute the task and informs `ExecutorMetricsPoller` about task starts and completion - heartbeater: which uses the `ExecutorMetricsPoller` to get the metrics To show the race condition assume a task just finished which was running on its own (no other tasks was running). So this will decrease the `count` from 1 to 0. On the task runner thread let say a new task starts. So the execution is in the `onTaskStart()` method let's assume the `countAndPeaks` is already computed and here the counter is 0 but the execution is still before incrementing the counter. So we are in between the following two lines: ```scala val countAndPeaks = stageTCMP.computeIfAbsent((stageId, stageAttemptId), _ => TCMP(new AtomicLong(0), new AtomicLongArray(ExecutorMetricType.numMetrics))) val stageCount = countAndPeaks.count.incrementAndGet() ``` Let's look at the other thread (heartbeater) where the `getExecutorUpdates()` is running and it is at the `removeIfInactive()` method: ```scala def removeIfInactive(k: StageKey, v: TCMP): TCMP = { if (v.count.get == 0) { logDebug(s"removing (${k._1}, ${k._2}) from stageTCMP") null } else { v } } ``` And here this entry is removed from `stageTCMP` as the count is 0. Let's go back to the task runner thread where we increase the counter to 1 but that value will be lost as we have no entry in the `stageTCMP` for this stage and attempt. So if a new task comes instead of 2 we will have 1 in the `stageTCMP` and when those two tasks finishes the second one will decrease the counter from 0 to -1. This is when the assert raised. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. I managed to reproduce the issue with a temporary test: ```scala test("reproduce assert failure") { val testMemoryManager = new TestMemoryManager(new SparkConf()) val taskId = new AtomicLong(0) val runFlag = new AtomicBoolean(true) val poller = new ExecutorMetricsPoller(testMemoryManager, 1000, None) val callUpdates = new Thread("getExecutorOpdates") { override def run() { while (runFlag.get()) { poller.getExecutorUpdates.size } } } val taskStartRunner1 = new Thread("taskRunner1") { override def run() { while (runFlag.get) { var l = taskId.incrementAndGet() poller.onTaskStart(l, 0, 0) poller.onTaskCompletion(l, 0, 0) } } } val taskStartRunner2 = new Thread("taskRunner2") { override def run() { while (runFlag.get) { var l = taskId.incrementAndGet() poller.onTaskStart(l, 0, 0) poller.onTaskCompletion(l, 0, 0) } } } val taskStartRunner3 = new Thread("taskRunner3") { override def run() { while (runFlag.get) { var l = taskId.incrementAndGet() var m = taskId.incrementAndGet() poller.onTaskStart(l, 0, 0) poller.onTaskStart(m, 0, 0) poller.onTaskCompletion(l, 0, 0) poller.onTaskCompletion(m, 0, 0) } } } callUpdates.start() taskStartRunner1.start() taskStartRunner2.start() taskStartRunner3.start() Thread.sleep(1000 * 20) runFlag.set(false) callUpdates.join() taskStartRunner1.join() taskStartRunner2.join() taskStartRunner3.join() } ``` The assert which raised is: ``` Exception in thread "taskRunner3" java.lang.AssertionError: assertion failed: task count shouldn't below 0 at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.executor.ExecutorMetricsPoller.decrementCount$1(ExecutorMetricsPoller.scala:130) at org.apache.spark.executor.ExecutorMetricsPoller.$anonfun$onTaskCompletion$3(ExecutorMetricsPoller.scala:135) at java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1828) at org.apache.spark.executor.ExecutorMetricsPoller.onTaskCompletion(ExecutorMetricsPoller.scala:135) at org.apache.spark.executor.ExecutorMetricsPollerSuite$$anon$4.run(ExecutorMetricsPollerSuite.scala:64) ``` But when I switch off `removeIfInactive()` by using the following code: ```scala if (false && v.count.get == 0) { logDebug(s"removing (${k._1}, ${k._2}) from stageTCMP") null } else { v } ``` Then no assert is raised. Closes #38056 from attilapiros/SPARK-40617. Authored-by: attilapiros <piros.attila.zsoltgmail.com> Signed-off-by: attilapiros <piros.attila.zsoltgmail.com> (cherry picked from commit 564a51b64e71f7402c2674de073b3b18001df56f) Signed-off-by: attilapiros <piros.attila.zsoltgmail.com> (cherry picked from commit 90a27757ec17c2511049114a437f365326e51225) Closes #38083 from attilapiros/SPARK-40617-3.2. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com> 04 October 2022, 21:41:47 UTC
4b1e06b [SPARK-40636][CORE] Fix wrong remained shuffles log in BlockManagerDecommissioner ### What changes were proposed in this pull request Fix wrong remained shuffles log in BlockManagerDecommissioner ### Why are the changes needed? BlockManagerDecommissioner should log correct remained shuffles. Current log used all shuffles num as remained. ``` 4 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:15.035 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. 2022-09-30 17:42:45.069 PDT 0 of 24 local shuffles are added. In total, 24 shuffles are remained. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #38078 from warrenzhu25/deco-log. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b39f2d6acf25726d99bf2c2fa84ba6a227d0d909) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 04 October 2022, 20:38:37 UTC
ac772e9 [SPARK-40612][CORE] Fixing the principal used for delegation token renewal on non-YARN resource managers ### What changes were proposed in this pull request? When the delegation token is fetched at the first time (see the `fetchDelegationTokens()` call at `HadoopFSDelegationTokenProvider#getTokenRenewalInterval()`) the principal is the current user but at the subsequent token renewals (see `obtainDelegationTokens()` where `getTokenRenewer()` is used to identify the principal) are using a MapReduce/Yarn specific principal even on resource managers different from YARN. This PR fixes `getTokenRenewer()` to use the current user instead of `org.apache.hadoop.mapred.Master.getMasterPrincipal(hadoopConf)` when the resource manager is not YARN. The condition `(master != null && master.contains("yarn"))` is the very same what we already have in `hadoopFSsToAccess()`. I would like to say thank you for squito who have done the investigation regarding of the problem which lead to this PR. ### Why are the changes needed? To avoid `org.apache.hadoop.security.AccessControlException: Permission denied.` for long running applications. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. Closes #38048 from attilapiros/SPARK-40612. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6484992535767ae8dc93df1c79efc66420728155) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 September 2022, 21:53:00 UTC
d046b98 [SPARK-40574][DOCS] Enhance DROP TABLE documentation ### What changes were proposed in this pull request? This PR adds `PURGE` in `DROP TABLE` documentation. Related documentation and code: 1. Hive `DROP TABLE` documentation: https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl <img width="877" alt="image" src="https://user-images.githubusercontent.com/5399861/192425153-63ac5373-dd34-48b3-864c-324cf5ba5db9.png"> 2. Hive code: https://github.com/apache/hive/blob/rel/release-2.3.9/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1185-L1209 3. Spark code: https://github.com/apache/spark/blob/v3.3.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1317-L1327 ### Why are the changes needed? Enhance documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test. Closes #38011 from wangyum/SPARK-40574. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 11eefc81e5c1f3ec7db6df8ba068a7155f7abda3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 September 2022, 22:32:56 UTC
b441d43 [SPARK-40583][DOCS] Fixing artifactId name in `cloud-integration.md` ### What changes were proposed in this pull request? I am changing the name of the artifactId that enables the integration with several cloud infrastructures. ### Why are the changes needed? The name of the package is wrong and it does not exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It is not needed. Closes #38021 from danitico/fix/SPARK-40583. Authored-by: Daniel Ranchal Parrado <daniel.ranchal@vlex.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dac58f82d1c10fb91f85fd9670f88d88dbe2feea) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 September 2022, 22:25:34 UTC
dc9041a [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy` This PR aims to add a new legacy configuration to keep `grouping__id` value like the released Apache Spark 3.2 and 3.3. Please note that this syntax is non-SQL standard and even Hive doesn't support it. ```SQL hive> SELECT version(); OK 3.1.3 r4df4d75bf1e16fe0af75aad0b4179c34c07fc975 Time taken: 0.111 seconds, Fetched: 1 row(s) hive> SELECT count(*), grouping__id from t GROUP BY a GROUPING SETS (b); FAILED: SemanticException 1:63 [Error 10213]: Grouping sets expression is not in GROUP BY key. Error encountered near token 'b' ``` SPARK-40218 fixed a bug caused by SPARK-34932 (at Apache Spark 3.2.0). As a side effect, `grouping__id` values are changed. - Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.3.0. ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 1| | 1| 1| +--------+------------+ ``` - After SPARK-40218, Apache Spark 3.4.0, 3.3.1, 3.2.3 ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 2| | 1| 2| +--------+------------+ ``` - This PR (Apache Spark 3.4.0, 3.3.1, 3.2.3) ```scala scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 2| | 1| 2| +--------+------------+ scala> sql("set spark.sql.legacy.groupingIdWithAppendedUserGroupBy=true") res1: org.apache.spark.sql.DataFrame = [key: string, value: string]scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show() +--------+------------+ |count(1)|grouping__id| +--------+------------+ | 1| 1| | 1| 1| +--------+------------+ ``` No, this simply added back the previous behavior by the legacy configuration. Pass the CIs. Closes #38001 from dongjoon-hyun/SPARK-40562. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5c0ebf3d97ae49b6e2bd2096c2d590abf4d725bd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 September 2022, 09:16:44 UTC
f9a05dd [SPARK-39200][CORE] Make Fallback Storage readFully on content Looks like from bug description, fallback storage doesn't readFully and then cause `org.apache.spark.shuffle.FetchFailedException: Decompression error: Corrupted block detected`. This is an attempt to fix this by read the underlying stream fully. Fix a bug documented in SPARK-39200 No Wrote a unit test Closes #37960 from ukby1234/SPARK-39200. Authored-by: Frank Yin <franky@ziprecruiter.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 07061f1a07a96f59ae42c9df6110eb784d2f3dab) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 September 2022, 11:27:16 UTC
18e83f9 [SPARK-40407][SQL] Fix the potential data skew caused by df.repartition ### What changes were proposed in this pull request? ``` scala val df = spark.range(0, 100, 1, 50).repartition(4) val v = df.rdd.mapPartitions { iter => { Iterator.single(iter.length) }.collect() println(v.mkString(",")) ``` The above simple code outputs `50,0,0,50`, which means there is no data in partition 1 and partition 2. The RoundRobin seems to ensure to distribute the records evenly *in the same partition*, and not guarantee it between partitions. Below is the code to generate the key ``` scala case RoundRobinPartitioning(numPartitions) => // Distributes elements evenly across output partitions, starting from a random partition. var position = new Random(TaskContext.get().partitionId()).nextInt(numPartitions) (row: InternalRow) => { // The HashPartitioner will handle the `mod` by the number of partitions position += 1 position } ``` In this case, There are 50 partitions, each partition will only compute 2 elements. The issue for RoundRobin here is it always starts with position=2 to do the Roundrobin. See the output of Random ``` scala scala> (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(4) + " ")) // the position is always 2. 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ``` Similarly, the below Random code also outputs the same value, ``` scala (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(2) + " ")) (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(4) + " ")) (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(8) + " ")) (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(16) + " ")) (1 to 200).foreach(partitionId => print(new Random(partitionId).nextInt(32) + " ")) ``` Consider partition 0, the total elements are [0, 1], so when shuffle writes, for element 0, the key will be (position + 1) = 2 + 1 = 3%4=3, the element 1, the key will be (position + 1)=(3+1)=4%4 = 0 consider partition 1, the total elements are [2, 3], so when shuffle writes, for element 2, the key will be (position + 1) = 2 + 1 = 3%4=3, the element 3, the key will be (position + 1)=(3+1)=4%4 = 0 The calculation is also applied for other left partitions since the starting position is always 2 for this case. So, as you can see, each partition will write its elements to Partition [0, 3], which results in Partition [1, 2] without any data. This PR changes the starting position of RoundRobin. The default position calculated by `new Random(partitionId).nextInt(numPartitions)` may always be the same for different partitions, which means each partition will output the data into the same keys when shuffle writes, and some keys may not have any data in some special cases. ### Why are the changes needed? The PR can fix the data skew issue for the special cases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Will add some tests and watch CI pass Closes #37855 from wbo4958/roundrobin-data-skew. Authored-by: Bobby Wang <wbo4958@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f6c4e58b85d7486c70cd6d58aae208f037e657fa) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 September 2022, 13:00:11 UTC
3ac1c3d [SPARK-40490][YARN][TESTS][3.2] Ensure YarnShuffleIntegrationSuite tests registeredExecFile reload scenarios ### What changes were proposed in this pull request? After SPARK-17321, `YarnShuffleService` will persist data to local shuffle state db/reload data from local shuffle state db only when Yarn NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`. `YarnShuffleIntegrationSuite` not set `YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the configuration is false, so `YarnShuffleIntegrationSuite` will neither trigger data persistence to the db nor verify the reload of data. This pr aims to let `YarnShuffleIntegrationSuite` restart the verification of registeredExecFile reload scenarios, to achieve this goal, this pr make the following changes: 1. Add a new un-document configuration `spark.yarn.shuffle.testing` to `YarnShuffleService`, and Initialize `_recoveryPath` when `_recoveryPath == null && spark.yarn.shuffle.testing == true`. 2. Only set `spark.yarn.shuffle.testing = true` in `YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` is not null to ensure that registeredExecFile reload scenarios will be verified. ### Why are the changes needed? Fix registeredExecFile reload test scenarios. Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true? This configuration has been tried **Hadoop 3.3.1** ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite -Phadoop-3 ``` ``` YarnShuffleIntegrationSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/org/iq80/leveldb/DBException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:105) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) ... Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109) ``` **Hadoop 2.7.4** ``` build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite -Phadoop-2 ``` ``` YarnShuffleIntegrationSuite: org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite *** ABORTED *** java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... Run completed in 3 seconds, 992 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 *** 1 SUITE ABORTED *** ``` From the above test, we need to use a fixed port to enable Yarn NodeManager recovery, but this is difficult to be guaranteed in UT, so this pr try a workaround way. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #37963 from LuciferYang/SPARK-40490-32. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 September 2022, 11:14:50 UTC
e90a57e [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema ### What changes were proposed in this pull request? Currently in Parquet V1 read path, Spark will pushdown data filters even if they have no reference in the Parquet read schema. This can cause correctness issues as described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). The root cause, it seems, is because in the V1 path, we first use `AttributeReference` equality to filter out data columns without partition columns, and then use `AttributeSet` equality to filter out filters with only references to data columns. There's inconsistency in the two steps, when case sensitive check is false. Take the following scenario as example: - data column: `[COL, a]` - partition column: `[col]` - filter: `col > 10` With `AttributeReference` equality, `COL` is not considered equal to `col` (because their names are different), and thus the filtered out data column set is still `[COL, a]`. However, when calculating filters with only reference to data columns, `COL` is **considered equal** to `col`. Consequently, the filter `col > 10`, when checking with `[COL, a]`, is considered to have reference to data columns, and thus will be pushed down to Parquet as data filter. On the Parquet side, since `col` doesn't exist in the file schema (it only has `COL`), when column index enabled, it will incorrectly return wrong number of rows. See [PARQUET-2170](https://issues.apache.org/jira/browse/PARQUET-2170) for more detail. In general, where data columns overlap with partition columns and case sensitivity is false, partition filters will not be filter out before we calculate filters with only reference to data columns, which is incorrect. ### Why are the changes needed? This fixes the correctness bug described in [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? There are existing test cases for this issue from [SPARK-39833](https://issues.apache.org/jira/browse/SPARK-39833). This also modified them to test the scenarios when case sensitivity is on or off. Closes #37881 from sunchao/SPARK-40169. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> 16 September 2022, 17:50:45 UTC
1b84e44 [SPARK-40470][SQL] Handle GetArrayStructFields and GetMapValue in "arrays_zip" function ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/37833. The PR fixes column names in `arrays_zip` function for the cases when `GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests for more details). Before the patch, the column names would be indexes or an AnalysisException would be thrown in the case of `GetArrayStructFields` example. ### Why are the changes needed? Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would be labeled as indexes instead of column names. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added unit tests that reproduce the issue and confirmed that the patch fixes them. Closes #37911 from sadikovi/SPARK-40470. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9b0f979141ba2c4124d96bc5da69ea5cac51df0d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 September 2022, 13:05:27 UTC
8068bd3 [SPARK-40461][INFRA] Set upperbound for pyzmq 24.0.0 for Python linter This PR sets the upperbound for `pyzmq` as `<24.0.0` in our CI Python linter job. The new release seems having a problem (https://github.com/zeromq/pyzmq/commit/2d3327d2e50c2510d45db2fc51488578a737b79b). To fix the linter build failure. See https://github.com/apache/spark/actions/runs/3063515551/jobs/4947782771 ``` /tmp/timer_created_0ftep6.c: In function ‘main’: /tmp/timer_created_0ftep6.c:2:5: warning: implicit declaration of function ‘timer_create’ [-Wimplicit-function-declaration] 2 | timer_create(); | ^~~~~~~~~~~~ x86_64-linux-gnu-gcc -pthread tmp/timer_created_0ftep6.o -L/usr/lib/x86_64-linux-gnu -o a.out /usr/bin/ld: tmp/timer_created_0ftep6.o: in function `main': /tmp/timer_created_0ftep6.c:2: undefined reference to `timer_create' collect2: error: ld returned 1 exit status no timer_create, linking librt ************************************************ building 'zmq.libzmq' extension creating build/temp.linux-x86_64-cpython-39/buildutils creating build/temp.linux-x86_64-cpython-39/bundled creating build/temp.linux-x86_64-cpython-39/bundled/zeromq creating build/temp.linux-x86_64-cpython-39/bundled/zeromq/src x86_64-linux-gnu-g++ -pthread -std=c++11 -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DZMQ_HAVE_CURVE=1 -DZMQ_USE_TWEETNACL=1 -DZMQ_USE_EPOLL=1 -DZMQ_IOTHREADS_USE_EPOLL=1 -DZMQ_POLL_BASED_ON_POLL=1 -Ibundled/zeromq/include -Ibundled -I/usr/include/python3.9 -c buildutils/initlibzmq.cpp -o build/temp.linux-x86_64-cpython-39/buildutils/initlibzmq.o buildutils/initlibzmq.cpp:10:10: fatal error: Python.h: No such file or directory 10 | #include "Python.h" | ^~~~~~~~~~ compilation terminated. error: command '/usr/bin/x86_64-linux-gnu-g++' failed with exit code 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for pyzmq ERROR: Could not build wheels for pyzmq, which is required to install pyproject.toml-based projects ``` No, test-only. CI in this PRs should validate it. Closes #37904 from HyukjinKwon/fix-linter. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 254bd80278843b3bc13584ca2f04391a770a78c7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 September 2022, 01:13:14 UTC
a1b1ac7 [SPARK-40459][K8S] `recoverDiskStore` should not stop by existing recomputed files ### What changes were proposed in this pull request? This PR aims to ignore `FileExistsException` during `recoverDiskStore` processing. ### Why are the changes needed? Although `recoverDiskStore` is already wrapped by `tryLogNonFatalError`, a single file recovery exception should not block the whole `recoverDiskStore` . https://github.com/apache/spark/blob/5938e84e72b81663ccacf0b36c2f8271455de292/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/shuffle/KubernetesLocalDiskShuffleExecutorComponents.scala#L45-L47 ``` org.apache.commons.io.FileExistsException: ... at org.apache.commons.io.FileUtils.requireAbsent(FileUtils.java:2587) at org.apache.commons.io.FileUtils.moveFile(FileUtils.java:2305) at org.apache.commons.io.FileUtils.moveFile(FileUtils.java:2283) at org.apache.spark.storage.DiskStore.moveFileToBlock(DiskStore.scala:150) at org.apache.spark.storage.BlockManager$TempFileBasedBlockStoreUpdater.saveToDiskStore(BlockManager.scala:487) at org.apache.spark.storage.BlockManager$BlockStoreUpdater.$anonfun$save$1(BlockManager.scala:407) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1445) at org.apache.spark.storage.BlockManager$BlockStoreUpdater.save(BlockManager.scala:380) at org.apache.spark.storage.BlockManager$TempFileBasedBlockStoreUpdater.save(BlockManager.scala:490) at org.apache.spark.shuffle.KubernetesLocalDiskShuffleExecutorComponents$.$anonfun$recoverDiskStore$14(KubernetesLocalDiskShuffleExecutorComponents.scala:95) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at org.apache.spark.shuffle.KubernetesLocalDiskShuffleExecutorComponents$.recoverDiskStore(KubernetesLocalDiskShuffleExecutorComponents.scala:91) ``` ### Does this PR introduce _any_ user-facing change? No, this will improve the recover rate. ### How was this patch tested? Pass the CIs. Closes #37903 from dongjoon-hyun/SPARK-40459. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f24bb430122eaa311070cfdefbc82d34b0341701) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 September 2022, 01:04:28 UTC
ce55a8f [SPARK-38017][FOLLOWUP][3.2] Hide TimestampNTZ in the doc ### What changes were proposed in this pull request? This PR removes `TimestampNTZ` from the doc about `TimeWindow` and `SessionWIndow`. ### Why are the changes needed? As we discussed, it's better to hide `TimestampNTZ` from the doc. https://github.com/apache/spark/pull/35313#issuecomment-1185192162 ### Does this PR introduce _any_ user-facing change? The document will be changed, but there is no compatibility problem. ### How was this patch tested? Built the doc with `SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build` at `doc` directory. Then, confirmed the generated HTML. Closes #37883 from sarutak/fix-window-doc-3.2. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 15 September 2022, 00:27:20 UTC
b39b721 [SPARK-40292][SQL] Fix column names in "arrays_zip" function when arrays are referenced from nested structs ### What changes were proposed in this pull request? This PR fixes an issue in `arrays_zip` function where a field index was used as a column name in the resulting schema which was a regression from Spark 3.1. With this change, the original behaviour is restored: a corresponding struct field name will be used instead of a field index. Example: ```sql with q as ( select named_struct( 'my_array', array(1, 2, 3), 'my_array2', array(4, 5, 6) ) as my_struct ) select arrays_zip(my_struct.my_array, my_struct.my_array2) from q ``` would return schema: ``` root |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false) | |-- element: struct (containsNull = false) | | |-- 0: integer (nullable = true) | | |-- 1: integer (nullable = true) ``` which is somewhat inaccurate. PR adds handling of `GetStructField` expression to return the struct field names like this: ``` root |-- arrays_zip(my_struct.my_array, my_struct.my_array2): array (nullable = false) | |-- element: struct (containsNull = false) | | |-- my_array: integer (nullable = true) | | |-- my_array2: integer (nullable = true) ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, `arrays_zip` function returns struct field names now as in Spark 3.1 instead of field indices. Some users might have worked around this issue so this patch would affect them by bringing back the original behaviour. ### How was this patch tested? Existing unit tests. I also added a test case that reproduces the problem. Closes #37833 from sadikovi/SPARK-40292. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 443eea97578c41870c343cdb88cf69bfdf27033a) Signed-off-by: Max Gekk <max.gekk@gmail.com> 12 September 2022, 04:34:11 UTC
8a882d5 [SPARK-40280][SQL][FOLLOWUP][3.2] Fix 'ParquetFilterSuite' issue ### What changes were proposed in this pull request? Fix 'ParquetFilterSuite' issue after merging #37747 : The `org.apache.parquet.filter2.predicate.Operators.In` was added in the parquet 1.12.3, but spark branch-3.2 uses the parquet 1.12.2. Use `Operators.And` instead of `Operators.In`. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #37846 from zzcclp/SPARK-40280-hotfix. Authored-by: Zhichao Zhang <zhangzc@apache.org> Signed-off-by: huaxingao <huaxin_gao@apple.com> 09 September 2022, 18:28:54 UTC
24818bf [SPARK-40280][SQL] Add support for parquet push down for annotated int and long ### What changes were proposed in this pull request? This fixes SPARK-40280 by normalizing a parquet int/long that has optional metadata with it to look like the expected version that does not have the extra metadata. ## Why are the changes needed? This allows predicate push down in parquet to work when reading files that are complaint with the parquet specification, but different from what Spark writes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I added unit tests that cover this use case. I also did some manual testing on some queries to verify that less data is actually read after this change. Closes #37747 from revans2/normalize_int_long_parquet_push. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 24b3baf0177fc1446bf59bb34987296aefd4b318) Signed-off-by: Thomas Graves <tgraves@apache.org> 08 September 2022, 13:56:26 UTC
d566017 [SPARK-40149][SQL][3.2] Propagate metadata columns through Project backport https://github.com/apache/spark/pull/37758 to 3.2 ### What changes were proposed in this pull request? This PR fixes a regression caused by https://github.com/apache/spark/pull/32017 . In https://github.com/apache/spark/pull/32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of https://github.com/apache/spark/pull/32017 . After propagating metadata columns, a problem from https://github.com/apache/spark/pull/31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER BY key` will be resolved to `t2.key`. To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. ### How was this patch tested? new tests Closes #37818 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 September 2022, 15:44:54 UTC
c7cc0ae [SPARK-40315][SQL] Add hashCode() for Literal of ArrayBasedMapData ### What changes were proposed in this pull request? There is no explicit `hashCode()` function override for `ArrayBasedMapData`. As a result, there is a non-deterministic error where the `hashCode()` computed for `Literal`s of `ArrayBasedMapData` can be different for two equal objects (`Literal`s of `ArrayBasedMapData` with equal keys and values). In this PR, we add a `hashCode` function so that it works exactly as we expect. ### Why are the changes needed? This is a bug fix for a non-deterministic error. It is also more consistent with the rest of Spark if we implement the `hashCode` method instead of relying on defaults. We can't add the `hashCode` directly to `ArrayBasedMapData` because of SPARK-9415. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A simple unit test was added. Closes #37807 from c27kwan/SPARK-40315-lit. Authored-by: Carmen Kwan <carmen.kwan@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e85a4ffbdfa063c8da91b23dfbde77e2f9ed62e9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 September 2022, 13:14:04 UTC
efe12ae Revert "[SPARK-33861][SQL] Simplify conditional in predicate" This reverts commit 32d4a2b and 3aa4e11. Closes #37729 from wangyum/SPARK-33861. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 43cbdc6ec9dbcf9ebe0b48e14852cec4af18b4ec) Signed-off-by: Yuming Wang <yumwang@ebay.com> 03 September 2022, 08:41:03 UTC
58375a8 [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style This PR make `compute.max_rows` option as `None` working in `DataFrame.style`, as expected instead of throwing an exception., by collecting it all to a pandas DataFrame. To make the configuration working as expected. Yes. ```python import pyspark.pandas as ps ps.set_option("compute.max_rows", None) ps.get_option("compute.max_rows") ps.range(1).style ``` **Before:** ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style pdf = self.head(max_results + 1)._to_internal_pandas() TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' ``` **After:** ``` <pandas.io.formats.style.Styler object at 0x7fdf78250430> ``` Manually tested, and unittest was added. Closes #37718 from HyukjinKwon/SPARK-40270. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 August 2022, 07:28:25 UTC
817c203 [SPARK-40124][SQL][TEST][3.2] Update TPCDS v1.4 q32 for Plan Stability tests ### What changes were proposed in this pull request? This is port of SPARK-40124 to Spark 3.2. Fix query 32 for TPCDS v1.4 ### Why are the changes needed? Current q32.sql seems to be wrong. It is just selection `1`. Reference for query template: https://github.com/databricks/tpcds-kit/blob/eff5de2c30337b71cc0dc1976147742d2c65d378/query_templates/query32.tpl#L41 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test change only Closes #37675 from mskapilks/change-q32-3.2. Authored-by: Kapil Kumar Singh <kapilsingh@microsoft.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 August 2022, 19:59:11 UTC
96dc433 [SPARK-40241][DOCS] Correct the link of GenericUDTF ### What changes were proposed in this pull request? Correct the link ### Why are the changes needed? existing link was wrong ### Does this PR introduce _any_ user-facing change? yes, a link was updated ### How was this patch tested? Manually check Closes #37685 from zhengruifeng/doc_fix_udtf. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 8ffcecb68fafd0466e839281588aab50cd046b49) Signed-off-by: Yuming Wang <yumwang@ebay.com> 27 August 2022, 10:40:20 UTC
de649f0 [SPARK-40218][SQL] GROUPING SETS should preserve the grouping columns ### What changes were proposed in this pull request? This PR fixes a bug caused by https://github.com/apache/spark/pull/32022 . Although we deprecate `GROUP BY ... GROUPING SETS ...`, it should still work if it worked before. https://github.com/apache/spark/pull/32022 made a mistake that it didn't preserve the order of user-specified group by columns. Usually it's not a problem, as `GROUP BY a, b` is no different from `GROUP BY b, a`. However, the `grouping_id(...)` function requires the input to be exactly the same with the group by columns. This PR fixes the problem by preserve the order of user-specified group by columns. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now a query that worked before 3.2 can work again. ### How was this patch tested? new test Closes #37655 from cloud-fan/grouping. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1ed592ef28abdb14aa1d8c8a129d6ac3084ffb0c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 August 2022, 07:25:06 UTC
94f3b11 [SPARK-40213][SQL] Support ASCII value conversion for Latin-1 characters ### What changes were proposed in this pull request? This PR proposes to support ASCII value conversion for Latin-1 Supplement characters. ### Why are the changes needed? `ascii()` should be the inverse of `chr()`. But for latin-1 char, we get incorrect ascii value. For example: ```sql select ascii('§') -- output: -62, expect: 167 select chr(167) -- output: '§' ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes the incorrect ASCII conversion for Latin-1 Supplement characters ### How was this patch tested? UT Closes #37651 from linhongliu-db/SPARK-40213. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c07852380471f02955d6d17cddb3150231daa71f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 August 2022, 04:19:13 UTC
0a2a568 [SPARK-40172][ML][TESTS] Temporarily disable flaky test cases in ImageFileFormatSuite ### What changes were proposed in this pull request? 3 test cases in ImageFileFormatSuite become flaky in the GitHub action tests: https://github.com/apache/spark/runs/7941765326?check_suite_focus=true https://github.com/gengliangwang/spark/runs/7928658069 Before they are fixed(https://issues.apache.org/jira/browse/SPARK-40171), I suggest disabling them in OSS. ### Why are the changes needed? Disable flaky tests before they are fixed. The test cases keep failing from time to time, while they always pass on local env. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing CI Closes #37605 from gengliangwang/disableFlakyTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 50f2f506327b7d51af9fb0ae1316135905d2f87d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6572c66d01e3db00858f0b4743670a1243d3c44f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 August 2022, 17:55:56 UTC
19e9085 [SPARK-39184][SQL][FOLLOWUP] Make interpreted and codegen paths for date/timestamp sequences the same ### What changes were proposed in this pull request? Change how the length of the new result array is calculated in `InternalSequenceBase.eval` to match how the same is calculated in the generated code. ### Why are the changes needed? This change brings the interpreted mode code in line with the generated code. Although I am not aware of any case where the current interpreted mode code fails, the generated code is more correct (it handles the case where the result array must grow more than once, whereas the current interpreted mode code does not). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #37542 from bersprockets/date_sequence_array_size_issue_follow_up. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit d718867a16754c62cb8c30a750485f4856481efc) Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 August 2022, 16:17:58 UTC
f14db2c [SPARK-40089][SQL] Fix sorting for some Decimal types ### What changes were proposed in this pull request? This fixes https://issues.apache.org/jira/browse/SPARK-40089 where the prefix can overflow in some cases and the code assumes that the overflow is always on the negative side, not the positive side. ### Why are the changes needed? This adds a check when the overflow does happen to know what is the proper prefix to return. ### Does this PR introduce _any_ user-facing change? No, unless you consider getting the sort order correct a user facing change. ### How was this patch tested? I tested manually with the file in the JIRA and I added a small unit test. Closes #37540 from revans2/fix_dec_sort. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8dfd3dfc115d6e249f00a9a434b866d28e2eae45) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 August 2022, 08:35:06 UTC
e8a578a [SPARK-39833][SQL] Disable Parquet column index in DSv1 to fix a correctness issue in the case of overlapping partition and data columns ### What changes were proposed in this pull request? This PR fixes a correctness issue in Parquet DSv1 FileFormat when projection does not contain columns referenced in pushed filters. This typically happens when partition columns and data columns overlap. This could result in empty result when in fact there were records matching predicate as can be seen in the provided fields. The problem is especially visible with `count()` and `show()` reporting different results, for example, show() would return 1+ records where the count() would return 0. In Parquet, when the predicate is provided and column index is enabled, we would try to filter row ranges to figure out what the count should be. Unfortunately, there is an issue that if the projection is empty or is not in the set of filter columns, any checks on columns would fail and 0 rows are returned (`RowRanges.EMPTY`) even though there is data matching the filter. Note that this is rather a mitigation, a quick fix. The actual fix needs to go into Parquet-MR: https://issues.apache.org/jira/browse/PARQUET-2170. The fix is not required in DSv2 where the overlapping columns are removed in `FileScanBuilder::readDataSchema()`. ### Why are the changes needed? Fixes a correctness issue when projection columns are not referenced by columns in pushed down filters or the schema is empty in Parquet DSv1. Downsides: Parquet column filter would be disabled if it had not been explicitly enabled which could affect performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces this behaviour. The test fails without the fix and passes with the fix. Closes #37419 from sadikovi/SPARK-39833. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit cde71aaf173aadd14dd6393b09e9851b5caad903) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 August 2022, 10:00:15 UTC
d5f0be3 [SPARK-40065][K8S] Mount ConfigMap on executors with non-default profile as well ### What changes were proposed in this pull request? This fixes a bug where ConfigMap is not mounted on executors if they are under a non-default resource profile. ### Why are the changes needed? When `spark.kubernetes.executor.disableConfigMap` is `false`, expected behavior is that the ConfigMap is mounted regardless of executor's resource profile. However, it is not mounted if the resource profile is non-default. ### Does this PR introduce _any_ user-facing change? Executors with non-default resource profile will have the ConfigMap mounted that was missing before if `spark.kubernetes.executor.disableConfigMap` is `false` or default. If certain users need to keep that behavior for some reason, they would need to explicitly set `spark.kubernetes.executor.disableConfigMap` to `true`. ### How was this patch tested? A new test case is added just below the existing ConfigMap test case. Closes #37504 from nsuke/SPARK-40065. Authored-by: Aki Sukegawa <nsuke@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 41ca6299eff4155aa3ac28656fe96501a7573fb0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 August 2022, 19:33:57 UTC
3943427 [SPARK-35542][ML] Fix: Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? Fix: Bucketizer created for multiple columns with parameters splitsArray, inputCols and outputCols can not be loaded after saving it ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #37568 from WeichenXu123/SPARK-35542. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 876ce6a5df118095de51c3c4789d6db6da95eb23) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 19 August 2022, 04:27:23 UTC
fcec11a [SPARK-40121][PYTHON][SQL] Initialize projection used for Python UDF This PR proposes to initialize the projection so non-deterministic expressions can be evaluated with Python UDFs. To make the Python UDF working with non-deterministic expressions. Yes. ```python from pyspark.sql.functions import udf, rand spark.range(10).select(udf(lambda x: x, "double")(rand())).show() ``` **Before** ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:126) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1161) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1213) ``` **After** ``` +----------------------------------+ |<lambda>rand(-2507211707257730645)| +----------------------------------+ | 0.7691724424045242| | 0.09602244075319044| | 0.3006471278112862| | 0.4182649571961977| | 0.29349096650900974| | 0.7987097908937618| | 0.5324802583101007| | 0.72460930912789| | 0.1367749768412846| | 0.17277322931919348| +----------------------------------+ ``` Manually tested, and unittest was added. Closes #37552 from HyukjinKwon/SPARK-40121. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 336c9bc535895530cc3983b24e7507229fa9570d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 18 August 2022, 03:24:40 UTC
f96fc06 [SPARK-39887][SQL][FOLLOW-UP] Do not exclude Union's first child attributes when traversing other children in RemoveRedundantAliases ### What changes were proposed in this pull request? Do not exclude `Union`'s first child attributes when traversing other children in `RemoveRedundantAliases`. ### Why are the changes needed? We don't need to exclude those attributes that `Union` inherits from its first child. See discussion here: https://github.com/apache/spark/pull/37496#discussion_r944509115 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #37534 from peter-toth/SPARK-39887-keep-attributes-of-unions-first-child-follow-up. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e732232dac420826af269d8cf5efacb52933f59a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 August 2022, 06:58:18 UTC
6ec86e5 [SPARK-40117][PYTHON][SQL] Convert condition to java in DataFrameWriterV2.overwrite ### What changes were proposed in this pull request? Fix DataFrameWriterV2.overwrite() fails to convert the condition parameter to java. This prevents the function from being called. It is caused by the following commit that deleted the `_to_java_column` call instead of fixing it: https://github.com/apache/spark/commit/a1e459ed9f6777fb8d5a2d09fda666402f9230b9 ### Why are the changes needed? DataFrameWriterV2.overwrite() cannot be called. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually checked whether the arguments are sent to JVM or not. Closes #37547 from looi/fix-overwrite. Authored-by: Wenli Looi <wlooi@ucalgary.ca> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 46379863ab0dd2ee8fcf1e31e76476ff18397f60) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 August 2022, 06:29:23 UTC
06857f3 [SPARK-39184][SQL] Handle undersized result array in date and timestamp sequences ### What changes were proposed in this pull request? Add code to defensively check if the pre-allocated result array is big enough to handle the next element in a date or timestamp sequence. ### Why are the changes needed? `InternalSequenceBase.getSequenceLength` is a fast method for estimating the size of the result array. It uses an estimated step size in micros which is not always entirely accurate for the date/time/time-zone combination. As a result, `getSequenceLength` occasionally overestimates the size of the result array and also occasionally underestimates the size of the result array. `getSequenceLength` sometimes overestimates the size of the result array when the step size is in months (because `InternalSequenceBase` assumes 28 days per month). This case is handled: `InternalSequenceBase` will slice the array, if needed. `getSequenceLength` sometimes underestimates the size of the result array when the sequence crosses a DST "spring forward" without a corresponding "fall backward". This case is not handled (thus, this PR). For example: ``` select sequence( timestamp'2022-03-13 00:00:00', timestamp'2022-03-14 00:00:00', interval 1 day) as x; ``` In the America/Los_Angeles time zone, this results in the following error: ``` java.lang.ArrayIndexOutOfBoundsException: 1 at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:77) ``` This happens because `InternalSequenceBase` calculates an estimated step size of 24 hours. If you add 24 hours to 2022-03-13 00:00:00 in the America/Los_Angeles time zone, you get 2022-03-14 01:00:00 (because 2022-03-13 has only 23 hours due to "spring forward"). Since 2022-03-14 01:00:00 is later than the specified stop value, `getSequenceLength` assumes the stop value is not included in the result. Therefore, `getSequenceLength` estimates an array size of 1. However, when actually creating the sequence, `InternalSequenceBase` does not use a step of 24 hours, but of 1 day. When you add 1 day to 2022-03-13 00:00:00, you get 2022-03-14 00:00:00. Now the stop value *is* included, and we overrun the end of the result array. The new unit test includes examples of problematic date sequences. This PR adds code to to handle the underestimation case: it checks if we're about to overrun the array, and if so, gets a new array that's larger by 1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #37513 from bersprockets/date_sequence_array_size_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 3a1136aa05dd5e16de81c7ec804416b3498ca967) Signed-off-by: Max Gekk <max.gekk@gmail.com> 16 August 2022, 08:54:09 UTC
2b54b48 [SPARK-40079] Add Imputer inputCols validation for empty input case Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? Add Imputer inputCols validation for empty input case ### Why are the changes needed? If Imputer inputCols is empty, the `fit` works fine but when saving model, error will be raised: > AnalysisException: Datasource does not support writing empty or nested empty schemas. Please make sure the data schema has at least one or more column(s). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37518 from WeichenXu123/imputer-param-validation. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit 87094f89655b7df09cdecb47c653461ae855b0ac) Signed-off-by: Weichen Xu <weichen.xu@databricks.com> 15 August 2022, 10:04:47 UTC
d04c788 [SPARK-39887][SQL][3.2] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique ### What changes were proposed in this pull request? Keep the output attributes of a `Union` node's first child in the `RemoveRedundantAliases` rule to avoid correctness issues. ### Why are the changes needed? To fix the result of the following query: ``` SELECT a, b AS a FROM ( SELECT a, a AS b FROM (SELECT a FROM VALUES (1) AS t(a)) UNION ALL SELECT a, b FROM (SELECT a, b FROM VALUES (1, 2) AS t(a, b)) ) ``` Before this PR the query returns the incorrect result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 2| 2| +---+---+ ``` After this PR it returns the expected result: ``` +---+---+ | a| a| +---+---+ | 1| 1| | 1| 2| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UTs. Closes #37491 from peter-toth/SPARK-39887-keep-attributes-of-unions-first-child-3.2. Authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 August 2022, 00:01:41 UTC
45ab804 Revert "[SPARK-40047][TEST] Exclude unused `xalan` transitive dependency from `htmlunit`" ### What changes were proposed in this pull request? This pr revert SPARK-40047 due to mvn test still need this dependency. ### Why are the changes needed? mvn test still need `xalan` dependency although GA passed before this pr. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: ``` mvn clean install -DskipTests -pl core -am build/mvn clean test -pl core -Dtest=noen -DwildcardSuites=org.apache.spark.ui.UISeleniumSuite ``` **Before** ``` UISeleniumSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/apache/xml/utils/PrefixResolver at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2583) at java.lang.Class.getField0(Class.java:2975) at java.lang.Class.getField(Class.java:1701) at com.gargoylesoftware.htmlunit.svg.SvgElementFactory.<clinit>(SvgElementFactory.java:64) at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.<clinit>(HtmlUnitNekoHtmlParser.java:77) at com.gargoylesoftware.htmlunit.DefaultPageCreator.<clinit>(DefaultPageCreator.java:93) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:191) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:273) at com.gargoylesoftware.htmlunit.WebClient.<init>(WebClient.java:263) ... Cause: java.lang.ClassNotFoundException: org.apache.xml.utils.PrefixResolver at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at java.lang.Class.getDeclaredFields0(Native Method) at java.lang.Class.privateGetDeclaredFields(Class.java:2583) at java.lang.Class.getField0(Class.java:2975) at java.lang.Class.getField(Class.java:1701) at com.gargoylesoftware.htmlunit.svg.SvgElementFactory.<clinit>(SvgElementFactory.java:64) at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.<clinit>(HtmlUnitNekoHtmlParser.java:77) ... ``` **After** ``` UISeleniumSuite: - all jobs page should be rendered even though we configure the scheduling mode to fair - effects of unpersist() / persist() should be reflected - failed stages should not appear to be active - spark.ui.killEnabled should properly control kill button display - jobs page should not display job group name unless some job was submitted in a job group - job progress bars should handle stage / task failures - job details page should display useful information for stages that haven't started - job progress bars / cells reflect skipped stages / tasks - stages that aren't run appear as 'skipped stages' after a job finishes - jobs with stages that are skipped should show correct link descriptions on all jobs page - attaching and detaching a new tab - kill stage POST/GET response is correct - kill job POST/GET response is correct - stage & job retention - live UI json application list - job stages should have expected dotfile under DAG visualization - stages page should show skipped stages - Staleness of Spark UI should not last minutes or hours - description for empty jobs - Support disable event timeline Run completed in 17 seconds, 986 milliseconds. Total number of tests run: 20 Suites: completed 2, aborted 0 Tests: succeeded 20, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #37508 from LuciferYang/revert-40047. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit afd7098c7fb6c95aece39acc32cdad764c984cd2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 13 August 2022, 18:42:26 UTC
45d42e1 [SPARK-40047][TEST] Exclude unused `xalan` transitive dependency from `htmlunit` ### What changes were proposed in this pull request? This pr exclude `xalan` from `htmlunit` to clean warning of CVE-2022-34169: ``` Provides transitive vulnerable dependency xalan:xalan:2.7.2 CVE-2022-34169 7.5 Integer Coercion Error vulnerability with medium severity found Results powered by Checkmarx(c) ``` `xalan:xalan:2.7.2` is the latest version, the code base has not been updated for 5 years, so can't solve by upgrading `xalan`. ### Why are the changes needed? The vulnerability is described is [CVE-2022-34169](https://github.com/advisories/GHSA-9339-86wc-4qgf), better to exclude it although it's just test dependency for Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Manual test: run `mvn dependency:tree -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive | grep xalan` to check that `xalan` is not matched after this pr Closes #37481 from LuciferYang/exclude-xalan. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7f3baa77acbf7747963a95d0f24e3b8868c7b16a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 August 2022, 22:12:28 UTC
0f3107d [SPARK-40043][PYTHON][SS][DOCS] Document DataStreamWriter.toTable and DataStreamReader.table ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30835 that adds `DataStreamWriter.toTable` and `DataStreamReader.table` into PySpark documentation. ### Why are the changes needed? To document both features. ### Does this PR introduce _any_ user-facing change? Yes, both API will be shown in PySpark reference documentation. ### How was this patch tested? Manually built the documentation and checked. Closes #37477 from HyukjinKwon/SPARK-40043. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 447003324d2cf9f2bfa799ef3a1e744a5bc9277d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 11 August 2022, 06:01:38 UTC
a1fd1a2 [SPARK-40022][YARN][TESTS] Ignore pyspark suites in `YarnClusterSuite` when python3 is unavailable ### What changes were proposed in this pull request? This pr adds `assume(isPythonAvailable)` to `testPySpark` method in `YarnClusterSuite` to make `YarnClusterSuite` test succeeded in an environment without Python 3 configured. ### Why are the changes needed? `YarnClusterSuite` should not `ABORTED` when `python3` is not configured. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GitHub Actions - Manual test Run ``` mvn clean test -pl resource-managers/yarn -am -Pyarn -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -Dtest=none ``` in an environment without Python 3 configured: **Before** ``` YarnClusterSuite: org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** java.lang.RuntimeException: Unable to load a Suite class that was discovered in the runpath: org.apache.spark.deploy.yarn.YarnClusterSuite at org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81) at org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:286) ... Run completed in 833 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 *** 1 SUITE ABORTED *** ``` **After** ``` YarnClusterSuite: - run Spark in yarn-client mode - run Spark in yarn-cluster mode - run Spark in yarn-client mode with unmanaged am - run Spark in yarn-client mode with different configurations, ensuring redaction - run Spark in yarn-cluster mode with different configurations, ensuring redaction - yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630) - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'local' and gateway-replacement path - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'local' and gateway-replacement path containing an environment variable - SPARK-35672: run Spark in yarn-client mode with additional jar using URI scheme 'file' - SPARK-35672: run Spark in yarn-cluster mode with additional jar using URI scheme 'file' - run Spark in yarn-cluster mode unsuccessfully - run Spark in yarn-cluster mode failure after sc initialized - run Python application in yarn-client mode !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - run Python application in yarn-cluster mode !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar !!! CANCELED !!! YarnClusterSuite.this.isPythonAvailable was false (YarnClusterSuite.scala:376) - user class path first in client mode - user class path first in cluster mode - monitor app using launcher library - running Spark in yarn-cluster mode displays driver log links - timeout to get SparkContext in cluster mode triggers failure - executor env overwrite AM env in client mode - executor env overwrite AM env in cluster mode - SPARK-34472: ivySettings file with no scheme or file:// scheme should be localized on driver in cluster mode - SPARK-34472: ivySettings file with no scheme or file:// scheme should retain user provided path in client mode - SPARK-34472: ivySettings file with non-file:// schemes should throw an error Run completed in 7 minutes, 2 seconds. Total number of tests run: 25 Suites: completed 2, aborted 0 Tests: succeeded 25, failed 0, canceled 3, ignored 0, pending 0 All tests passed. ``` Closes #37454 from LuciferYang/yarnclustersuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8e472443081342a0e0dc37aa154e30a0a6df39b7) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 10 August 2022, 01:00:03 UTC
16a6788 [SPARK-40002][SQL] Don't push down limit through window using ntile ### What changes were proposed in this pull request? Change `LimitPushDownThroughWindow` so that it no longer supports pushing down a limit through a window using ntile. ### Why are the changes needed? In an unpartitioned window, the ntile function is currently applied to the result of the limit. This behavior produces results that conflict with Spark 3.1.3, Hive 2.3.9 and Prestodb 0.268 #### Example Assume this data: ``` create table t1 stored as parquet as select * from range(101); ``` Also assume this query: ``` select id, ntile(10) over (order by id) as nt from t1 limit 10; ``` With Spark 3.2.2, Spark 3.3.0, and master, the limit is applied before the ntile function: ``` +---+---+ |id |nt | +---+---+ |0 |1 | |1 |2 | |2 |3 | |3 |4 | |4 |5 | |5 |6 | |6 |7 | |7 |8 | |8 |9 | |9 |10 | +---+---+ ``` With Spark 3.1.3, and Hive 2.3.9, and Prestodb 0.268, the limit is applied _after_ ntile. Spark 3.1.3: ``` +---+---+ |id |nt | +---+---+ |0 |1 | |1 |1 | |2 |1 | |3 |1 | |4 |1 | |5 |1 | |6 |1 | |7 |1 | |8 |1 | |9 |1 | +---+---+ ``` Hive 2.3.9: ``` +-----+-----+ | id | nt | +-----+-----+ | 0 | 1 | | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6 | 1 | | 7 | 1 | | 8 | 1 | | 9 | 1 | +-----+-----+ 10 rows selected (1.72 seconds) ``` Prestodb 0.268: ``` id | nt ----+---- 0 | 1 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 6 | 1 7 | 1 8 | 1 9 | 1 (10 rows) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Two new unit tests. Closes #37443 from bersprockets/pushdown_ntile. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c9156e5a3b9cb290c7cdda8db298c9875e67aa5e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 August 2022, 02:40:25 UTC
509bf3b [SPARK-39965][K8S] Skip PVC cleanup when driver doesn't own PVCs ### What changes were proposed in this pull request? This PR aims to skip PVC cleanup logic when `spark.kubernetes.driver.ownPersistentVolumeClaim=false`. ### Why are the changes needed? To simplify Spark termination log by removing unnecessary log containing Exception message when Spark jobs have no PVC permission and at the same time `spark.kubernetes.driver.ownPersistentVolumeClaim` is `false`. ### Does this PR introduce _any_ user-facing change? Only in the termination logs of Spark jobs that has no PVC permission. ### How was this patch tested? Manually. Closes #37433 from dongjoon-hyun/SPARK-39965. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: pralabhkumar <pralabhkumar@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 87b312a9c9273535e22168c3da73834c22e1fbbb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 August 2022, 16:58:44 UTC
0f609ff [SPARK-38034][SQL] Optimize TransposeWindow rule ### What changes were proposed in this pull request? Optimize the TransposeWindow rule to extend applicable cases and optimize time complexity. TransposeWindow rule will try to eliminate unnecessary shuffle: but the function compatiblePartitions will only take the first n elements of the window2 partition sequence, for some cases, this will not take effect, like the case below:  val df = spark.range(10).selectExpr("id AS a", "id AS b", "id AS c", "id AS d") df.selectExpr( "sum(`d`) OVER(PARTITION BY `b`,`a`) as e", "sum(`c`) OVER(PARTITION BY `a`) as f" ).explain Current plan == Physical Plan == *(5) Project [e#10L, f#11L] +- Window [sum(c#4L) windowspecdefinition(a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#11L], [a#2L]    +- *(4) Sort [a#2L ASC NULLS FIRST], false, 0       +- Exchange hashpartitioning(a#2L, 200), true, [id=#41]          +- *(3) Project [a#2L, c#4L, e#10L]             +- Window [sum(d#5L) windowspecdefinition(b#3L, a#2L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#10L], [b#3L, a#2L]                +- *(2) Sort [b#3L ASC NULLS FIRST, a#2L ASC NULLS FIRST], false, 0                   +- Exchange hashpartitioning(b#3L, a#2L, 200), true, [id=#33]                      +- *(1) Project [id#0L AS d#5L, id#0L AS b#3L, id#0L AS a#2L, id#0L AS c#4L]                         +- *(1) Range (0, 10, step=1, splits=10) Expected plan: == Physical Plan == *(4) Project [e#924L, f#925L] +- Window [sum(d#43L) windowspecdefinition(b#41L, a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS e#924L], [b#41L, a#40L]  +- *(3) Sort [b#41L ASC NULLS FIRST, a#40L ASC NULLS FIRST], false, 0       +- *(3) Project [d#43L, b#41L, a#40L, f#925L]          +- Window [sum(c#42L) windowspecdefinition(a#40L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS f#925L], [a#40L]             +- *(2) Sort [a#40L ASC NULLS FIRST], false, 0                +- Exchange hashpartitioning(a#40L, 200), true, [id=#282]                   +- *(1) Project [id#38L AS d#43L, id#38L AS b#41L, id#38L AS a#40L, id#38L AS c#42L]                      +- *(1) Range (0, 10, step=1, splits=10) Also the permutations method has a O(n!) time complexity, which is very expensive when there are many partition columns, we could try to optimize it. ### Why are the changes needed? We could apply the rule for more cases, which could improve the execution performance by eliminate unnecessary shuffle, and by reducing the time complexity from O(n!) to O(n2), the performance for the rule itself could improve ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? UT Closes #35334 from constzhou/SPARK-38034_optimize_transpose_window_rule. Authored-by: xzhou <15210830305@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0cc331dc7e51e53000063052b0c8ace417eb281b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2022, 10:43:37 UTC
7a254d9 [SPARK-39867][SQL][3.2] Global limit should not inherit OrderPreservingUnaryNode backport branch-3.2 for https://github.com/apache/spark/pull/37284 ### What changes were proposed in this pull request? Make GlobalLimit inherit UnaryNode rather than OrderPreservingUnaryNode ### Why are the changes needed? Global limit can not promise the output ordering is same with child, it actually depend on the certain physical plan. For all physical plan with gobal limits: - CollectLimitExec: it does not promise output ordering - GlobalLimitExec: it required all tuples so it can assume the child is shuffle or child is single partition. Then it can use output ordering of child - TakeOrderedAndProjectExec: it do sort inside it's implementation This bug get worse since we pull out v1 write require ordering. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? fix test and add test Closes #37397 from ulysses-you/SPARK-39867-3.2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2022, 10:23:46 UTC
0e5812c [SPARK-39775][CORE][AVRO] Disable validate default values when parsing Avro schemas ### What changes were proposed in this pull request? This PR disables validate default values when parsing Avro schemas. ### Why are the changes needed? Spark will throw exception if upgrade to Spark 3.2. We have fixed the Hive serde tables before: SPARK-34512. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37191 from wangyum/SPARK-39775. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5c1b99f441ec5e178290637a9a9e7902aaa116e1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 August 2022, 03:26:47 UTC
bc9a5d7 [SPARK-39972][PYTHON][SQL][TESTS] Revert the test case of SPARK-39962 in branch-3.2 and branch-3.1 ### What changes were proposed in this pull request? This PR reverts the test in https://github.com/apache/spark/pull/37390 in branch-3.2 and branch-3.1 because testing util does not exist in branch-3.2 and branch-3.1. ### Why are the changes needed? See https://github.com/apache/spark/pull/37390#issuecomment-1204658808 ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Logically clean revert. Closes #37401 from HyukjinKwon/SPARK-39972. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 04 August 2022, 01:51:11 UTC
7d9cbf5 [SPARK-39952][SQL] SaveIntoDataSourceCommand should recache result relation ### What changes were proposed in this pull request? recacheByPlan the result relation inside `SaveIntoDataSourceCommand` ### Why are the changes needed? The behavior of `SaveIntoDataSourceCommand` is similar with `InsertIntoDataSourceCommand` which supports append or overwirte data. In order to keep data consistent, we should always do recacheByPlan the relation on post hoc. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #37380 from ulysses-you/refresh. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5fe0b245f7891a05bc4e1e641fd0aa9130118ea4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 August 2022, 17:05:06 UTC
45a9e03 [SPARK-39900][SQL] Address partial or negated condition in binary format's predicate pushdown ### What changes were proposed in this pull request? fix `BinaryFileFormat` filter push down bug. Before modification, when Filter tree is: ```` -Not - - IsNotNull ```` Since `IsNotNull` cannot be matched, `IsNotNull` will return a result that is always true (that is, `case _ => (_ => true)`), that is, no filter pushdown is performed. But because there is still a `Not`, after negation, it will return a result that is always False, that is, no result can be returned. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test suit in `BinaryFileFormatSuite` ``` testCreateFilterFunction( Seq(Not(IsNull(LENGTH))), Seq((t1, true), (t2, true), (t3, true))) ``` Closes #37350 from zzzzming95/SPARK-39900. Lead-authored-by: zzzzming95 <505306252@qq.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a0dc7d9117b66426aaa2257c8d448a2f96882ecd) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2022, 12:23:17 UTC
1ac74a1 [SPARK-39962][PYTHON][SQL] Apply projection when group attributes are empty ### What changes were proposed in this pull request? This PR proposes to apply the projection to respect the reordered columns in its child when group attributes are empty. ### Why are the changes needed? To respect the column order in the child. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug as below: ```python import pandas as pd from pyspark.sql import functions as f f.pandas_udf("double") def AVG(x: pd.Series) -> float: return x.mean() abc = spark.createDataFrame([(1.0, 5.0, 17.0)], schema=["a", "b", "c"]) abc.agg(AVG("a"), AVG("c")).show() abc.select("c", "a").agg(AVG("a"), AVG("c")).show() ``` **Before** ``` +------+------+ |AVG(a)|AVG(c)| +------+------+ | 17.0| 1.0| +------+------+ ``` **After** ``` +------+------+ |AVG(a)|AVG(c)| +------+------+ | 1.0| 17.0| +------+------+ ``` ### How was this patch tested? Manually tested, and added an unittest. Closes #37390 from HyukjinKwon/SPARK-39962. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5335c784ae76c9cc0aaa7a4b57b3cd6b3891ad9a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 August 2022, 07:11:56 UTC
265bd21 [SPARK-39835][SQL][3.2] Fix EliminateSorts remove global sort below the local sort backport https://github.com/apache/spark/pull/37250 into branch-3.2 ### What changes were proposed in this pull request? Correct the `EliminateSorts` follows: - If the upper sort is global then we can remove the global or local sort recursively. - If the upper sort is local then we can only remove the local sort recursively. ### Why are the changes needed? If a global sort below locol sort, we should not remove the global sort becuase the output partitioning can be affected. This issue is going to worse since we pull out the V1 Write sort to logcial side. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #37275 from ulysses-you/SPARK-39835-3.2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 August 2022, 03:15:55 UTC
d4f6541 [SPARK-39932][SQL] WindowExec should clear the final partition buffer ### What changes were proposed in this pull request? Explicitly clear final partition buffer if can not find next in `WindowExec`. The same fix in `WindowInPandasExec` ### Why are the changes needed? We do a repartition after a window, then we need do a local sort after window due to RoundRobinPartitioning shuffle. The error stack: ```java ExternalAppendOnlyUnsafeRowArray INFO - Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:97) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:352) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.allocateMemoryForRecordIfNecessary(UnsafeExternalSorter.java:435) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:455) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:138) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:226) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:355) ``` `WindowExec` only clear buffer in `fetchNextPartition` so the final partition buffer miss to clear. It is not a big problem since we have task completion listener. ```scala taskContext.addTaskCompletionListener(context -> { cleanupResources(); }); ``` This bug only affects if the window is not the last operator for this task and the follow operator like sort. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? N/A Closes #37358 from ulysses-you/window. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1fac870126c289a7ec75f45b6b61c93b9a4965d4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 August 2022, 09:06:23 UTC
57f1bb7 [SPARK-39839][SQL] Handle special case of null variable-length Decimal with non-zero offsetAndSize in UnsafeRow structural integrity check ### What changes were proposed in this pull request? Update the `UnsafeRow` structural integrity check in `UnsafeRowUtils.validateStructuralIntegrity` to handle a special case with null variable-length DecimalType value. ### Why are the changes needed? The check should follow the format that `UnsafeRowWriter` produces. In general, `UnsafeRowWriter` clears out a field with zero when the field is set to be null, c.f. `UnsafeRowWriter.setNullAt(ordinal)` and `UnsafeRow.setNullAt(ordinal)`. But there's a special case for `DecimalType` values: this is the only type that is both: - can be fixed-length or variable-length, depending on the precision, and - is mutable in `UnsafeRow`. To support a variable-length `DecimalType` to be mutable in `UnsafeRow`, the `UnsafeRowWriter` always leaves a 16-byte space in the variable-length section of the `UnsafeRow` (tail end of the row), regardless of whether the `Decimal` value being written is null or not. In the fixed-length part of the field, it would be an "OffsetAndSize", and the `offset` part always points to the start offset of the variable-length part of the field, while the `size` part will either be `0` for the null value, or `1` to at most `16` for non-null values. When `setNullAt(ordinal)` is called instead of passing a null value to `write(int, Decimal, int, int)`, however, the `offset` part gets zero'd out and this field stops being mutable. There's a comment on `UnsafeRow.setDecimal` that mentions to keep this field able to support updates, `setNullAt(ordinal)` cannot be called, but there's no code enforcement of that. So we need to recognize that in the structural integrity check and allow variable-length `DecimalType` to have non-zero field even for null. Note that for non-null values, the existing check does conform to the format from `UnsafeRowWriter`. It's only null value of variable-length `DecimalType` that'd trigger a bug, which can affect Structured Streaming's checkpoint file read where this check is applied. ### Does this PR introduce _any_ user-facing change? Yes, previously the `UnsafeRow` structural integrity validation will return false positive for correct data, when there's a null value in a variable-length `DecimalType` field. The fix will no longer return false positive. Because the Structured Streaming checkpoint file validation uses this check, previously a good checkpoint file may be rejected by the check, and the only workaround is to disable the check; with the fix, the correct checkpoint file will be allowed to load. ### How was this patch tested? Added new test case in `UnsafeRowUtilsSuite` Closes #37252 from rednaxelafx/fix-unsaferow-validation. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit c608ae2fc6a3a50f2e67f2a3dad8d4e4be1aaf9f) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 28 July 2022, 00:05:33 UTC
421918d [SPARK-39879][SQL][TESTS] Reduce local-cluster maximum memory size in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` ### What changes were proposed in this pull request? This pr change `local-cluster[2, 1, 1024]` in `BroadcastJoinSuite*` and `HiveSparkSubmitSuite` to `local-cluster[2, 1, 512]` to reduce test maximum memory usage. ### Why are the changes needed? Reduce the maximum memory usage of test cases. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Should monitor CI Closes #37298 from LuciferYang/reduce-local-cluster-memory. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 01d41e7de418d0a40db7b16ddd0d8546f0794d17) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 27 July 2022, 00:27:09 UTC
0d11085 Revert "[SPARK-39856][SQL][TESTS] Increase the number of partitions in TPC-DS build to avoid out-of-memory" This reverts commit e7aa9671276c435cca851fc53931db302b64bbac. 25 July 2022, 06:55:59 UTC
e7aa967 [SPARK-39856][SQL][TESTS] Increase the number of partitions in TPC-DS build to avoid out-of-memory This PR proposes to avoid out-of-memory in TPC-DS build at GitHub Actions CI by: - Increasing the number of partitions being used in shuffle. - Truncating precisions after 10th in floats. The number of partitions was previously set to 1 because of different results in precisions that generally we can just ignore. - Sort the results regardless of join type since Apache Spark does not guarantee the order of results One of the reasons for the large memory usage seems to be single partition that's being used in the shuffle. No, test-only. GitHub Actions in this CI will test it out. Closes #37270 from HyukjinKwon/deflake-tpcds. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7358253755762f9bfe6cedc1a50ec14616cfeace) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 July 2022, 03:48:11 UTC
f80559f [SPARK-39847][SS] Fix race condition in RocksDBLoader.loadLibrary() if caller thread is interrupted ### What changes were proposed in this pull request? This PR fixes a race condition in `RocksDBLoader.loadLibrary()`, which can occur if the thread which calls that method is interrupted. One of our jobs experienced a failure in `RocksDBLoader`: ``` Caused by: java.lang.IllegalThreadStateException at java.lang.Thread.start(Thread.java:708) at org.apache.spark.sql.execution.streaming.state.RocksDBLoader$.loadLibrary(RocksDBLoader.scala:51) ``` After investigation, we determined that this was due to task cancellation/interruption: if the task which starts the RocksDB library loading is interrupted, another thread may begin a load and crash with the thread state exception: - Although the `loadLibraryThread` child thread is is uninterruptible, the task thread which calls loadLibrary is still interruptible. - Let's say we have two tasks, A and B, both of which will call `RocksDBLoader.loadLibrary()` - Say that Task A wins the race to perform the load and enters the `synchronized` block in `loadLibrary()`, starts the `loadLibraryThread`, then blocks in the `loadLibraryThread.join()` call. - If Task A is interrupted, an `InterruptedException` will be thrown and it will exit the loadLibrary synchronized block. - At this point, Task B enters the synchronized block of its `loadLibrary() call and sees that `exception == null` because the `loadLibraryThread` started by the other task is still running, so Task B calls `loadLibraryThread.start()` and hits the thread state error because it tries to start an already-started thread. This PR fixes this issue by adding code to check `loadLibraryThread`'s state before calling `start()`: if the thread has already been started then we will skip the `start()` call and proceed directly to the `join()`. I also modified the logging so that we can detect when this case occurs. ### Why are the changes needed? Fix a bug that can lead to task or job failures. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I reproduced the original race condition by adding a `Thread.sleep(10000)` to `loadLibraryThread.run()` (so it wouldn't complete instantly), then ran ```scala test("multi-threaded RocksDBLoader calls with interruption") { val taskThread = new Thread("interruptible Task Thread 1") { override def run(): Unit = { RocksDBLoader.loadLibrary() } } taskThread.start() // Give the thread time to enter the `loadLibrary()` call: Thread.sleep(1000) taskThread.interrupt() // Check that the load hasn't finished: assert(RocksDBLoader.exception == null) assert(RocksDBLoader.loadLibraryThread.getState != Thread.State.NEW) // Simulate the second task thread starting the load: RocksDBLoader.loadLibrary() // The load should finish successfully: RocksDBLoader.exception.isEmpty } ``` This test failed prior to my changes and succeeds afterwards. I don't want to actually commit this test because I'm concerned about flakiness and false-negatives: in order to ensure that the test would have failed before my change, we need to carefully control the thread interleaving. This code rarely changes and is relatively simple, so I think the ROI on spending time to write and commit a reliable test is low. Closes #37260 from JoshRosen/rocksdbloader-fix. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9cee1bb2527a496943ffedbd935dc737246a2d89) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 July 2022, 22:46:37 UTC
03eba09 [SPARK-39831][BUILD] Fix R dependencies installation failure ### What changes were proposed in this pull request? move `libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev` from Install dependencies for documentation generation to Install R linter dependencies and SparkR Update after https://github.com/apache/spark/pull/37243: **add `apt update` before installation.** ### Why are the changes needed? to make CI happy Install R linter dependencies and SparkR started to fail after devtools_2.4.4 was released. ``` --------------------------- [ANTICONF] -------------------------------- Configuration failed to find the fontconfig freetype2 library. Try installing: * deb: libfontconfig1-dev (Debian, Ubuntu, etc) * rpm: fontconfig-devel (Fedora, EPEL) * csw: fontconfig_dev (Solaris) * brew: freetype (OSX) it seems that libfontconfig1-dev is needed now. ``` also refer to https://github.com/r-lib/systemfonts/issues/35#issuecomment-633560151 ### Does this PR introduce any user-facing change? No ### How was this patch tested? CI passed Closes #37247 from Yikun/patch-25. Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ffa82c219029a7f6f3caf613dd1d0ab56d0c599e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 July 2022, 03:12:21 UTC
f344bf9 Revert "[SPARK-39831][BUILD] Fix R dependencies installation failure" This reverts commit 29290306749f75eb96f51fc5b61114e9b8a3bf53. 22 July 2022, 00:05:53 UTC
2929030 [SPARK-39831][BUILD] Fix R dependencies installation failure ### What changes were proposed in this pull request? move `libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev` from `Install dependencies for documentation generation` to `Install R linter dependencies and SparkR` ### Why are the changes needed? to make CI happy `Install R linter dependencies and SparkR` started to fail after `devtools_2.4.4` was released. ``` --------------------------- [ANTICONF] -------------------------------- Configuration failed to find the fontconfig freetype2 library. Try installing: * deb: libfontconfig1-dev (Debian, Ubuntu, etc) * rpm: fontconfig-devel (Fedora, EPEL) * csw: fontconfig_dev (Solaris) * brew: freetype (OSX) ``` it seems that `libfontconfig1-dev` is needed now. also refer to https://github.com/r-lib/systemfonts/issues/35#issuecomment-633560151 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #37243 from zhengruifeng/ci_add_dep. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 67efa318ec8cababdb5683ac262a8ebc3b3beefb) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 July 2022, 13:29:40 UTC
7774019 [MINOR][PYTHON][DOCS] Fix broken Example section in col/column functions This PR fixes a bug in the documentation. Trailing `'` breaks Example section in Python reference documentation. This PR removes it. To render the documentation as intended in NumPy documentation style. Yes, the documentation is updated. **Before** <img width="789" alt="Screen Shot 2022-07-19 at 12 20 55 PM" src="https://user-images.githubusercontent.com/6477701/179661216-715dec96-bff2-474f-ab48-41577bf4c15c.png"> **After** <img width="633" alt="Screen Shot 2022-07-19 at 12 48 04 PM" src="https://user-images.githubusercontent.com/6477701/179661245-72d15184-aeed-43c2-b9c9-5f3cab1ae28d.png"> Manually built the documentation and tested. Closes #37223 from HyukjinKwon/minor-doc-fx. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 2bdb5bfa48d1fc44358c49f7e379c2afc4a1a32f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 19 July 2022, 08:54:19 UTC
b36b214 [SPARK-39647][CORE][3.2] Register the executor with ESS before registering the BlockManager ### What changes were proposed in this pull request? Backport of https://github.com/apache/spark/pull/37052 to branch-3.2. Currently the executors register with the ESS after the `BlockManager` registration with the `BlockManagerMaster`. This order creates a problem with the push-based shuffle. A registered BlockManager node is picked up by the driver as a merger but the shuffle service on that node is not yet ready to merge the data which causes block pushes to fail until the local executor registers with it. This fix is to reverse the order, that is, register with the ESS before registering the `BlockManager` ### Why are the changes needed? They are needed to fix the issue which causes block pushes to fail. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a UT. Closes #37218 from otterc/branch-3.2. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 July 2022, 21:47:50 UTC
ea5af85 [SPARK-39758][SQL][3.2] Fix NPE from the regexp functions on invalid patterns ### What changes were proposed in this pull request? In the PR, I propose to catch `PatternSyntaxException` while compiling the regexp pattern by the `regexp_extract`, `regexp_extract_all` and `regexp_instr`, and substitute the exception by Spark's exception w/ the error class `INVALID_PARAMETER_VALUE`. In this way, Spark SQL will output the error in the form: ```sql org.apache.spark.SparkRuntimeException The value of parameter(s) 'regexp' in `regexp_instr` is invalid: ') ?' ``` instead of (on Spark 3.3.0): ```java java.lang.NullPointerException: null ``` Also I propose to set `lastRegex` only after the compilation of the regexp pattern completes successfully. This is a backport of https://github.com/apache/spark/pull/37171. ### Why are the changes needed? The changes fix NPE portrayed by the code on Spark 3.3.0: ```sql spark-sql> SELECT regexp_extract('1a 2b 14m', '(?l)'); 22/07/12 19:07:21 ERROR SparkSQLDriver: Failed in [SELECT regexp_extract('1a 2b 14m', '(?l)')] java.lang.NullPointerException: null at org.apache.spark.sql.catalyst.expressions.RegExpExtractBase.getLastMatcher(regexpExpressions.scala:768) ~[spark-catalyst_2.12-3.3.0.jar:3.3.0] ``` This should improve user experience with Spark SQL. ### Does this PR introduce _any_ user-facing change? No. In regular cases, the behavior is the same but users will observe different exceptions (error messages) after the changes. ### How was this patch tested? By running new tests: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql" $ build/sbt "test:testOnly *.RegexpExpressionsSuite" $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 5b96bd5cf8f44eee7a16cd027d37dec552ed5a6a) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #37182 from MaxGekk/pattern-syntax-exception-3.2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 14 July 2022, 14:49:39 UTC
816682e [SPARK-39672][SQL][3.1] Fix removing project before filter with correlated subquery Add more checks to`removeProjectBeforeFilter` in `ColumnPruning` and only remove the project if 1. the filter condition contains correlated subquery 2. same attribute exists in both output of child of Project and subquery This is a legitimate self-join query and should not throw exception when de-duplicating attributes in subquery and outer values. ```sql select * from ( select v1.a, v1.b, v2.c from v1 inner join v2 on v1.a=v2.a) t3 where not exists ( select 1 from v2 where t3.a=v2.a and t3.b=v2.b and t3.c=v2.c ) ``` Here's what happens with the current code. The above query is analyzed into following `LogicalPlan` before `ColumnPruning`. ``` Project [a#250, b#251, c#268] +- Filter NOT exists#272 [(a#250 = a#266) && (b#251 = b#267) && (c#268 = c#268#277)] : +- Project [1 AS 1#273, _1#259 AS a#266, _2#260 AS b#267, _3#261 AS c#268#277] : +- LocalRelation [_1#259, _2#260, _3#261] +- Project [a#250, b#251, c#268] +- Join Inner, (a#250 = a#266) :- Project [a#250, b#251] : +- Project [_1#243 AS a#250, _2#244 AS b#251] : +- LocalRelation [_1#243, _2#244, _3#245] +- Project [a#266, c#268] +- Project [_1#259 AS a#266, _3#261 AS c#268] +- LocalRelation [_1#259, _2#260, _3#261] ``` Then in `ColumnPruning`, the Project before Filter (between Filter and Join) is removed. This changes the `outputSet` of the child of Filter among which the same attribute also exists in the subquery. Later, when `RewritePredicateSubquery` de-duplicates conflicting attributes, it would complain `Found conflicting attributes a#266 in the condition joining outer plan`. No. Add UT. Closes #37074 from manuzhang/spark-39672. Lead-authored-by: tianlzhang <tianlzhang@ebay.com> Co-authored-by: Manu Zhang <OwenZhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 36fc73e7c42b84e05b15b2caecc0f804610dce20) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 July 2022, 05:02:25 UTC
1cae70d Preparing development version 3.2.3-SNAPSHOT 11 July 2022, 15:13:06 UTC
78a5825 Preparing Spark release v3.2.2-rc1 11 July 2022, 15:13:02 UTC
ba978b3 [SPARK-39099][BUILD] Add dependencies to Dockerfile for building Spark releases Add missed dependencies to `dev/create-release/spark-rm/Dockerfile`. To be able to build Spark releases. No. By building the Spark 3.3 release via: ``` $ dev/create-release/do-release-docker.sh -d /home/ubuntu/max/spark-3.3-rc1 ``` Closes #36449 from MaxGekk/deps-Dockerfile. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 4b1c2fb7a27757ebf470416c8ec02bb5c1f7fa49) Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 6a61f95a359e6aa9d09f8044019074dc7effcf30) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 July 2022, 00:36:26 UTC
001d8b0 [SPARK-37554][BUILD] Add PyArrow, pandas and plotly to release Docker image dependencies ### What changes were proposed in this pull request? This PR proposes to add plotly, pyarrow and pandas dependencies for generating the API documentation for pandas API on Spark. The versions of `pandas==1.1.5 pyarrow==3.0.0 plotly==5.4.0` are matched with the current versions being used in branch-3.2 at Python 3.6. ### Why are the changes needed? Currently, the function references for pandas API on Spark are all missing: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/series.html due to missing dependencies when building the docs. ### Does this PR introduce _any_ user-facing change? Yes, the broken links of documentation at https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/series.html will all be recovered. ### How was this patch tested? To be honest, it has not been tested. I don't have the nerve to run Docker releasing script for the sake of testing so I defer to the next release manager. The combinations of the dependency versions are being tested in GitHub Actions at `branch-3.2`. Closes #34813 from HyukjinKwon/SPARK-37554. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 03750c046b55f60b43646c8108e5f2e540782755) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 July 2022, 00:27:52 UTC
9dd4c07 [SPARK-37730][PYTHON][FOLLOWUP] Split comments to comply pycodestyle check ### What changes were proposed in this pull request? This is a follow-up of a backporting commit, https://github.com/apache/spark/commit/bc54a3f0c2e08893702c3929bfe7a9d543a08cdb . ### Why are the changes needed? The original commit doesn't pass the linter check because there was no lint check between SPARK-37380 and SPARK-37834. The content of this PR is a part of SPARK-37834. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Python Linter. Closes #37146 from dongjoon-hyun/SPARK-37730. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 July 2022, 00:24:11 UTC
bc54a3f [SPARK-37730][PYTHON] Replace use of MPLPlot._add_legend_handle with MPLPlot._append_legend_handles_labels ### What changes were proposed in this pull request? Replace use of MPLPlot._add_legend_handle (removed in pandas) with MPLPlot._append_legend_handles_labels in histogram and KDE plots. Based on: https://github.com/pandas-dev/pandas/commit/029907c9d69a0260401b78a016a6c4515d8f1c40 ### Why are the changes needed? Fix of SPARK-37730. plot.hist and plot.kde don't throw AttributeError for pandas=1.3.5. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ~~Tested with existing plot test on CI (for older pandas only).~~ (it seems that CI doesn't run matplotlib tests, see https://github.com/apache/spark/pull/35000#issuecomment-1001267197) I've run tests on a local computer, see https://github.com/apache/spark/pull/35000#issuecomment-1001494019 : ``` $ python python/pyspark/pandas/tests/plot/test_series_plot_matplotlib.py ``` :question: **QUESTION:** Maybe add plot testing for pandas 1.3.5 on CI? (I've noticed that CI uses `pandas=1.3.4`, maybe update it to `1.3.5`?) Closes #35000 from mslapek/fixpythonplot. Authored-by: Michał Słapek <28485371+mslapek@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 371e307686debc4f7b44a37d2345a1a512f3fdcc) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 10 July 2022, 21:39:38 UTC
c5983c1 [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/35314 to branch 3.2. See that original PR for context. ### Why are the changes needed? To fix potential correctness issue. ### Does this PR introduce _any_ user-facing change? No but fix the exiting correctness issue when reading partition column with CalendarInterval type. ### How was this patch tested? Added unit test in `ColumnVectorSuite.scala`. Closes #37114 from c21/branch-3.2. Authored-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 July 2022, 14:51:40 UTC
32aff86 [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast This is a backport of https://github.com/apache/spark/pull/36974 for branch-3.2 ### What changes were proposed in this pull request? Change `currentPhysicalPlan` to `inputPlan ` when we restore the broadcast exchange for DPP. ### Why are the changes needed? The currentPhysicalPlan can be wrapped with broadcast query stage so it is not safe to match it. For example: The broadcast exchange which is added by DPP is running before than the normal broadcast exchange(e.g. introduced by join). ### Does this PR introduce _any_ user-facing change? yes bug fix ### How was this patch tested? add test Closes #37087 from ulysses-you/inputplan-3.2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 07 July 2022, 14:49:03 UTC
be891ad [SPARK-39551][SQL][3.2] Add AQE invalid plan check ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/36953 This PR adds a check for invalid plans in AQE replanning process. The check will throw exceptions when it detects an invalid plan, causing AQE to void the current replanning result and keep using the latest valid plan. ### Why are the changes needed? AQE logical optimization rules can lead to invalid physical plans and cause runtime exceptions as certain physical plan nodes are not compatible with others. E.g., `BroadcastExchangeExec` can only work as a direct child of broadcast join nodes, but it could appear under other incompatible physical plan nodes because of empty relation propagation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #37108 from dongjoon-hyun/SPARK-39551. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 July 2022, 04:35:04 UTC
1c0bd4c [SPARK-39656][SQL][3.2] Fix wrong namespace in DescribeNamespaceExec backport https://github.com/apache/spark/pull/37049 for branch-3.2 <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. 8. If you want to add or modify an error type or message, please read the guideline first in 'core/src/main/resources/error/README.md'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> DescribeNamespaceExec change ns.last to ns.quoted ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> DescribeNamespaceExec should show the whole namespace rather than last ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> yes, a small bug fix ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> fix test Closes #37072 from ulysses-you/desc-namespace-3.2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> 05 July 2022, 19:40:26 UTC
3d084fe [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the regexp and like functions ### What changes were proposed in this pull request? In the PR, I propose to fix args formatting of some regexp functions by adding explicit new lines. That fixes the following items in arg lists. Before: <img width="745" alt="Screenshot 2022-07-05 at 09 48 28" src="https://user-images.githubusercontent.com/1580697/177274234-04209d43-a542-4c71-b5ca-6f3239208015.png"> After: <img width="704" alt="Screenshot 2022-07-05 at 11 06 13" src="https://user-images.githubusercontent.com/1580697/177280718-cb05184c-8559-4461-b94d-dfaaafda7dd2.png"> ### Why are the changes needed? To improve readability of Spark SQL docs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By building docs and checking manually: ``` $ SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 bundle exec jekyll build ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 4e42f8b12e8dc57a15998f22d508a19cf3c856aa) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #37093 from MaxGekk/fix-regexp-docs-3.2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 05 July 2022, 15:28:35 UTC
6ae97e2 [SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__ ### What changes were proposed in this pull request? This PR fix the wrong aliases in `__array_ufunc__` ### Why are the changes needed? When running test with numpy 1.23.0 (current latest), hit a bug: `NotImplementedError: pandas-on-Spark objects currently do not support <ufunc 'divide'>.` In `__array_ufunc__` we first call `maybe_dispatch_ufunc_to_dunder_op` to try dunder methods first, and then we try pyspark API. `maybe_dispatch_ufunc_to_dunder_op` is from pandas code. pandas fix a bug https://github.com/pandas-dev/pandas/pull/44822#issuecomment-991166419 https://github.com/pandas-dev/pandas/pull/44822/commits/206b2496bc6f6aa025cb26cb42f52abeec227741 when upgrade to numpy 1.23.0, we need to also sync this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Current CI passed - The exsiting UT `test_series_datetime` already cover this, I also test it in my local env with 1.23.0 ```shell pip install "numpy==1.23.0" python/run-tests --testnames 'pyspark.pandas.tests.test_series_datetime SeriesDateTimeTest.test_arithmetic_op_exceptions' ``` Closes #37078 from Yikun/SPARK-39611. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fb48a14a67940b9270390b8ce74c19ae58e2880e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 July 2022, 11:52:56 UTC
9adfc3a [SPARK-39650][SS] Fix incorrect value schema in streaming deduplication with backward compatibility ### What changes were proposed in this pull request? This PR proposes to fix the incorrect value schema in streaming deduplication. It stores the empty row having a single column with null (using NullType), but the value schema is specified as all columns, which leads incorrect behavior from state store schema compatibility checker. This PR proposes to set the schema of value as `StructType(Array(StructField("__dummy__", NullType)))` to fit with the empty row. With this change, the streaming queries creating the checkpoint after this fix would work smoothly. To not break the existing streaming queries having incorrect value schema, this PR proposes to disable the check for value schema on streaming deduplication. Disabling the value check was there for the format validation (we have two different checkers for state store), but it has been missing for state store schema compatibility check. To avoid adding more config, this PR leverages the existing config "format validation" is using. ### Why are the changes needed? This is a bug fix. Suppose the streaming query below: ``` # df has the columns `a`, `b`, `c` val df = spark.readStream.format("...").load() val query = df.dropDuplicate("a").writeStream.format("...").start() ``` while the query is running, df can produce a different set of columns (e.g. `a`, `b`, `c`, `d`) from the same source due to schema evolution. Since we only deduplicate the rows with column `a`, the change of schema should not matter for streaming deduplication, but state store schema checker throws error saying "value schema is not compatible" before this fix. ### Does this PR introduce _any_ user-facing change? No, this is basically a bug fix which end users wouldn't notice unless they encountered a bug. ### How was this patch tested? New tests. Closes #37041 from HeartSaVioR/SPARK-39650. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit fe536033bdd00d921b3c86af329246ca55a4f46a) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 02 July 2022, 14:07:54 UTC
1387af7 [SPARK-39553][CORE] Multi-thread unregister shuffle shouldn't throw NPE when using Scala 2.13 This pr add a `shuffleStatus != null` condition to `o.a.s.MapOutputTrackerMaster#unregisterShuffle` method to avoid throwing NPE when using Scala 2.13. Ensure that no NPE is thrown when `o.a.s.MapOutputTrackerMaster#unregisterShuffle` is called by multiple threads, this pr is only for Scala 2.13. `o.a.s.MapOutputTrackerMaster#unregisterShuffle` method will be called concurrently by the following two paths: - BlockManagerStorageEndpoint: https://github.com/apache/spark/blob/6f1046afa40096f477b29beecca5ca6286dfa7f3/core/src/main/scala/org/apache/spark/storage/BlockManagerStorageEndpoint.scala#L56-L62 - ContextCleaner: https://github.com/apache/spark/blob/6f1046afa40096f477b29beecca5ca6286dfa7f3/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L234-L241 When test with Scala 2.13, for example `sql/core` module, there are many log as follows,although these did not cause UTs failure: ``` 17:44:09.957 WARN org.apache.spark.storage.BlockManagerMaster: Failed to remove shuffle 87 - null java.lang.NullPointerException at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1(MapOutputTracker.scala:882) at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1$adapted(MapOutputTracker.scala:881) at scala.Option.foreach(Option.scala:437) at org.apache.spark.MapOutputTrackerMaster.unregisterShuffle(MapOutputTracker.scala:881) at org.apache.spark.storage.BlockManagerStorageEndpoint$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$3(BlockManagerStorageEndpoint.scala:59) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.scala:17) at org.apache.spark.storage.BlockManagerStorageEndpoint.$anonfun$doAsync$1(BlockManagerStorageEndpoint.scala:89) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:678) at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 17:44:09.958 ERROR org.apache.spark.ContextCleaner: Error cleaning shuffle 94 java.lang.NullPointerException at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1(MapOutputTracker.scala:882) at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1$adapted(MapOutputTracker.scala:881) at scala.Option.foreach(Option.scala:437) at org.apache.spark.MapOutputTrackerMaster.unregisterShuffle(MapOutputTracker.scala:881) at org.apache.spark.ContextCleaner.doCleanupShuffle(ContextCleaner.scala:241) at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$3(ContextCleaner.scala:202) at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$3$adapted(ContextCleaner.scala:195) at scala.Option.foreach(Option.scala:437) at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$1(ContextCleaner.scala:195) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1432) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:189) at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:79) ``` I think this is a bug of Scala 2.13.8 and already submit an issue to https://github.com/scala/bug/issues/12613, this PR is only for protection, we should remove this protection after Scala 2.13(maybe https://github.com/scala/scala/pull/9957) fixes this issue. No - Pass GA - Add new test `SPARK-39553: Multi-thread unregister shuffle shouldn't throw NPE` to `MapOutputTrackerSuite`, we can test manually as follows: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl core -am -Pscala-2.13 mvn clean test -pl core -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.MapOutputTrackerSuite ``` **Before** ``` - SPARK-39553: Multi-thread unregister shuffle shouldn't throw NPE *** FAILED *** 3 did not equal 0 (MapOutputTrackerSuite.scala:971) Run completed in 17 seconds, 505 milliseconds. Total number of tests run: 25 Suites: completed 2, aborted 0 Tests: succeeded 24, failed 1, canceled 0, ignored 1, pending 0 *** 1 TEST FAILED *** ``` **After** ``` - SPARK-39553: Multi-thread unregister shuffle shouldn't throw NPE Run completed in 17 seconds, 996 milliseconds. Total number of tests run: 25 Suites: completed 2, aborted 0 Tests: succeeded 25, failed 0, canceled 0, ignored 1, pending 0 All tests passed. ``` Closes #37024 from LuciferYang/SPARK-39553. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 29258964cae45cea43617ade971fb4ea9fe2902a) Signed-off-by: Sean Owen <srowen@gmail.com> 29 June 2022, 23:11:38 UTC
c0d9109 [SPARK-39570][SQL] Inline table should allow expressions with alias ### What changes were proposed in this pull request? `ResolveInlineTables` requires the column expressions to be foldable, however, `Alias` is not foldable. Inline-table does not use the names in the column expressions, and we should trim aliases before checking foldable. We did something similar in `ResolvePivot`. ### Why are the changes needed? To make inline-table handle more cases, and also fixed a regression caused by https://github.com/apache/spark/pull/31844 . After https://github.com/apache/spark/pull/31844 , we always add an alias for function literals like `current_timestamp`, which breaks inline table. ### Does this PR introduce _any_ user-facing change? yea, some failed queries can be run after this PR. ### How was this patch tested? new tests Closes #36967 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1df992f03fd935ac215424576530ab57d1ab939b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 June 2022, 14:41:40 UTC
359bc3f [SPARK-39621][PYTHON][TESTS] Make `run-tests.py` robust by avoiding `rmtree` on MacOS This PR aims to make `run-tests.py` robust by avoiding `rmtree` on MacOS. There exists a race condition in Python and it causes flakiness in MacOS - https://bugs.python.org/issue29699 - https://github.com/python/cpython/issues/73885 No. After passing CIs, this should be tested on MacOS. Closes #37010 from dongjoon-hyun/SPARK-39621. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 432945db743965f1beb59e3a001605335ca2df83) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 June 2022, 19:53:45 UTC
61dc08d [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr… …oDeserializer ### What changes were proposed in this pull request? Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer. ### Why are the changes needed? - HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes. - The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add ut in AvroCatalystDataConversionSuite. Closes #36973 from wzx140/avro-fix. Authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 558b395880673ec45bf9514c98983e50e21d9398) Signed-off-by: Sean Owen <srowen@gmail.com> 27 June 2022, 02:05:27 UTC
f5bc48b [SPARK-39548][SQL] CreateView Command with a window clause query hit a wrong window definition not found issue 1. In the inline CTE code path, fix a bug that top down style unresolved window expression check leads to mis-clarification of a defined window expression. 2. Move unresolved window expression check in project to `CheckAnalysis`. This bug fails a correct query. No. UT Closes #36947 from amaliujia/improvewindow. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4718d59c6c4e201bf940303a4311dfb753372395) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 June 2022, 05:32:59 UTC
725ce33 [SPARK-39543] The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1 The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1, to support something such as compressed formats example: `spark.range(0, 100).writeTo("t1").option("compression", "zstd").using("parquet").create` **before** gen: part-00000-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.**snappy**.parquet ... **after** gen: part-00000-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.**zstd**.parquet ... No new test Closes #36941 from Yikf/writeV2option. Authored-by: Yikf <yikaifei1@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e5b7fb85b2d91f2e84dc60888c94e15b53751078) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 June 2022, 05:07:28 UTC
back to top