https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
8c2b331 Preparing Spark release v3.3.3-rc1 04 August 2023, 04:33:19 UTC
407bb57 [SPARK-44653][SQL] Non-trivial DataFrame unions should not break caching We have a long-standing tricky optimization in `Dataset.union`, which invokes the optimizer rule `CombineUnions` to pre-optimize the analyzed plan. This is to avoid too large analyzed plan for a specific dataframe query pattern `df1.union(df2).union(df3).union...`. This tricky optimization is designed to break dataframe caching, but we thought it was fine as people usually won't cache the intermediate dataframe in a union chain. However, `CombineUnions` gets improved from time to time (e.g. https://github.com/apache/spark/pull/35214) and now it can optimize a wide range of Union patterns. Now it's possible that people union two dataframe, do something with `select`, and cache it. Then the dataframe is unioned again with other dataframes and people expect the df cache to work. However the cache won't work due to the tricky optimization in `Dataset.union`. This PR updates `Dataset.union` to only combine adjacent Unions to match the original purpose. Fix perf regression due to breaking df caching no new test Closes #42315 from cloud-fan/union. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ce1fe57cdd7004a891ef8b97c77ac96b3719efcd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 August 2023, 03:32:30 UTC
9e86aed [SPARK-44588][CORE][3.3] Fix double encryption issue for migrated shuffle blocks ### What changes were proposed in this pull request? Fix double encryption issue for migrated shuffle blocks Shuffle blocks upon migration are sent without decryption when io.encryption is enabled. The code on the receiving side ends up using serializer.wrapStream on the OutputStream to the file which results in the already encrypted bytes being encrypted again when the bytes are written out. This patch removes the usage of serializerManager.wrapStream on the receiving side and also adds tests that validate that this works as expected. I have also validated that the added tests will fail if the fix is not in place. Jira ticket with more details: https://issues.apache.org/jira/browse/SPARK-44588 ### Why are the changes needed? Migrated shuffle blocks will be double encrypted when `spark.io.encryption = true` without this fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests were added to test shuffle block migration with spark.io.encryption enabled and also fixes a test helper method to properly construct the SerializerManager with the encryption key. Closes #42277 from henrymai/branch-3.3_backport_double_encryption. Authored-by: Henry Mai <henrymai@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 August 2023, 04:07:51 UTC
073d0b6 [SPARK-44251][SQL] Set nullable correctly on coalesced join key in full outer USING join ### What changes were proposed in this pull request? For full outer joins employing USING, set the nullability of the coalesced join columns to true. ### Why are the changes needed? The following query produces incorrect results: ``` create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); create or replace temp view v2 as values (2, 3) as (c1, c2); select explode(array(c1)) as x from v1 full outer join v2 using (c1); -1 <== should be null 1 2 ``` The following query fails with a `NullPointerException`: ``` create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); create or replace temp view v2 as values ('2', 3) as (c1, c2); select explode(array(c1)) as x from v1 full outer join v2 using (c1); 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) ... ``` The above full outer joins implicitly add an aliased coalesce to the parent projection of the join: `coalesce(v1.c1, v2.c1) as c1`. In the case where only one side's key is nullable, the coalesce's nullability is false. As a result, the generator's output has nullable set as false. But this is incorrect: If one side has a row with explicit null key values, the other side's row will also have null key values (because the other side's row will be "made up"), and both the `coalesce` and the `explode` will return a null value. While `UpdateNullability` actually repairs the nullability of the `coalesce` before execution, it doesn't recreate the generator output, so the nullability remains incorrect in `Generate#output`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #41809 from bersprockets/using_oddity2. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 7a27bc68c849041837e521285e33227c3d1f9853) Signed-off-by: Yuming Wang <yumwang@ebay.com> 11 July 2023, 03:19:49 UTC
edc13fd [SPARK-44215][3.3][SHUFFLE] If num chunks are 0, then server should throw a RuntimeException ### What changes were proposed in this pull request? The executor expects `numChunks` to be > 0. If it is zero, then we see that the executor fails with ``` 23/06/20 19:07:37 ERROR task 2031.0 in stage 47.0 (TID 25018) Executor: Exception in task 2031.0 in stage 47.0 (TID 25018) java.lang.ArithmeticException: / by zero at org.apache.spark.storage.PushBasedFetchHelper.createChunkBlockInfosFromMetaResponse(PushBasedFetchHelper.scala:128) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:1047) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:90) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ``` Because this is an `ArithmeticException`, the executor doesn't fallback. It's not a `FetchFailure` either, so the stage is not retried and the application fails. ### Why are the changes needed? The executor should fallback to fetch original blocks and not fail because this suggests that there is an issue with push-merged block. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified the existing UTs to validate that RuntimeException is thrown when numChunks are 0. Closes #41859 from otterc/SPARK-44215-branch-3.3. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> 06 July 2023, 01:48:05 UTC
e9b525e [SPARK-42784] should still create subDir when the number of subDir in merge dir is less than conf ### What changes were proposed in this pull request? Fixed a minor issue with diskBlockManager after push-based shuffle is enabled ### Why are the changes needed? this bug will affect the efficiency of push based shuffle ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #40412 from Stove-hust/feature-42784. Authored-by: meifencheng <meifencheng@meituan.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 35d51571a803b8fa7d14542236276425b517d3af) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 01 July 2023, 03:51:05 UTC
84620f2 [SPARK-44241][CORE] Mistakenly set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause incessant executor cons/destructions ### What changes were proposed in this pull request? This PR makes zero when io.connectionTimeout/connectionCreationTimeout is negative. Zero here means - connectionCreationTimeout = 0,an unlimited CONNNETION_TIMEOUT for connection establishment - connectionTimeout=0, `IdleStateHandler` for triggering `IdleStateEvent` is disabled. ### Why are the changes needed? 1. This PR fixes a bug when connectionCreationTimeout is 0, which means unlimited to netty, but ChannelFuture.await(0) fails directly and inappropriately. 2. This PR fixes a bug when connectionCreationTimeout is less than 0, which causes meaningless transport client reconnections and endless executor reconstructions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new unit tests Closes #41785 from yaooqinn/SPARK-44241. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 38645fa470b5af7c2e41efa4fb092bdf2463fbbd) Signed-off-by: Kent Yao <yao@apache.org> 30 June 2023, 10:33:54 UTC
b7d8ddb [SPARK-44184][PYTHON][DOCS] Remove a wrong doc about `ARROW_PRE_0_15_IPC_FORMAT` ### What changes were proposed in this pull request? This PR aims to remove a wrong documentation about `ARROW_PRE_0_15_IPC_FORMAT`. ### Why are the changes needed? Since Apache Spark 3.0.0, Spark doesn't allow `ARROW_PRE_0_15_IPC_FORMAT` environment variable at all. https://github.com/apache/spark/blob/2407183cb8637b6ac2d1b76320cae9cbde3411da/python/pyspark/sql/pandas/utils.py#L69-L73 ### Does this PR introduce _any_ user-facing change? No. This is a removal of outdated wrong documentation. ### How was this patch tested? Manual review. Closes #41730 from dongjoon-hyun/SPARK-44184. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 00e7c08606d0b6de22604d2a7350ea0711355300) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 June 2023, 01:53:43 UTC
d52aa4f [SPARK-44158][K8S] Remove unused `spark.kubernetes.executor.lostCheckmaxAttempts` ### What changes were proposed in this pull request? This PR aims to remove `spark.kubernetes.executor.lostCheckmaxAttempts` because it was not used after SPARK-24248 (Apache Spark 2.4.0) ### Why are the changes needed? To clean up this from documentation and code. ### Does this PR introduce _any_ user-facing change? No because it was no-op already. ### How was this patch tested? Pass the CIs. Closes #41713 from dongjoon-hyun/SPARK-44158. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6590e7db5212bb0dc90f22133a96e3d5e385af65) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 June 2023, 20:49:45 UTC
a7bbaca [SPARK-44134][CORE] Fix setting resources (GPU/FPGA) to 0 when they are set in spark-defaults.conf ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-44134 With resource aware scheduling, if you specify a default value in the spark-defaults.conf, a user can't override that to set it to 0. Meaning spark-defaults.conf has something like: spark.executor.resource.{resourceName}.amount=1 spark.task.resource.{resourceName}.amount =1 If the user tries to override when submitting an application with spark.executor.resource.{resourceName}.amount=0 and spark.task.resource.{resourceName}.amount =0, the applicatoin fails to submit. it should submit and just not try to allocate those resources. This worked back in Spark 3.0 but was broken when the stage level scheduling feature was added. Here I fixed it by simply removing any task resources from the list if they are set to 0. Note I also fixed a typo in the exception message when no executor resources are specified but task resources are. Note, ideally this is backported to all of the maintenance releases ### Why are the changes needed? Fix a bug described above ### Does this PR introduce _any_ user-facing change? no api changes ### How was this patch tested? Added unit test and then ran manually on standalone and YARN clusters to verify overriding the configs now works. Closes #41703 from tgravescs/fixResource0. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cf6e90c205bd76ff9e1fc2d88757d9d44ec93162) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 June 2023, 00:41:06 UTC
6868d6c [SPARK-44142][PYTHON] Replace type with tpe in utility to convert python types to spark types ### What changes were proposed in this pull request? In the typehints utility to convert python types to spark types, use the variable `tpe` for comparison to the string representation of categorical types rather than `type`. ### Why are the changes needed? Currently, the Python keyword `type` is used in the comparison, which will always be false. The user's type is stored in variable `tpe`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing `test_categorical.py` tests Closes #41697 from ted-jenks/patch-3. Authored-by: Ted <77486246+ted-jenks@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 5feab1072ba057da3811069c3eb8efec0de1044c) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 23 June 2023, 00:09:33 UTC
b969906 [SPARK-44040][SQL] Fix compute stats when AggregateExec node above QueryStageExec ### What changes were proposed in this pull request? This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`. For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example: ```sql SELECT count(*) FROM tbl WHERE false; ``` The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891). ### Why are the changes needed? Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan === !Aggregate [id#5L], [id#5L] Project [id#5L] +- Union false, false +- Union false, false :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #41576 from wangyum/SPARK-44040. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 55ba63c257b6617ec3d2aca5bc1d0989d4f29de8) Signed-off-by: Yuming Wang <yumwang@ebay.com> 16 June 2023, 04:06:37 UTC
3586ca3 [SPARK-32559][SQL] Fix the trim logic did't handle ASCII control characters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in https://github.com/apache/spark/pull/29375 trim ASCII control characters unexpectly. Before this patch ![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7) And hive ![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe) ### Why are the changes needed? The behavior described above doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? add ut Closes #41535 from Kwafoor/trim_bugfix. Lead-authored-by: wangjunbo <wangjunbo@qiyi.com> Co-authored-by: Junbo wang <1042815068@qq.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 80588e4ddfac1d2e2fdf4f9a7783c56be6a97cdd) Signed-off-by: Kent Yao <yao@apache.org> 13 June 2023, 03:44:40 UTC
8840614 [SPARK-43398][CORE] Executor timeout should be max of idle shuffle and rdd timeout ### What changes were proposed in this pull request? Executor timeout should be max of idle, shuffle and rdd timeout ### Why are the changes needed? Wrong timeout value when combining idle, shuffle and rdd timeout ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test in `ExecutorMonitorSuite` Closes #41082 from warrenzhu25/max-timeout. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7107742a381cde2e6de9425e3e436282a8c0d27c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 June 2023, 07:40:57 UTC
ababc57 [SPARK-43976][CORE] Handle the case where modifiedConfigs doesn't exist in event logs ### What changes were proposed in this pull request? This prevents NPE by handling the case where `modifiedConfigs` doesn't exist in event logs. ### Why are the changes needed? Basically, this is the same solution for that case. - https://github.com/apache/spark/pull/34907 The new code was added here, but we missed the corner case. - https://github.com/apache/spark/pull/35972 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41472 from dongjoon-hyun/SPARK-43976. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b4ab34bf9b22d0f0ca4ab13f9b6106f38ccfaebe) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 June 2023, 16:35:01 UTC
0c509be [SPARK-41958][CORE][3.3] Disallow arbitrary custom classpath with proxy user in cluster mode Backporting fix for SPARK-41958 to 3.3 branch from #39474 Below description from original PR. -------------------------- ### What changes were proposed in this pull request? This PR proposes to disallow arbitrary custom classpath with proxy user in cluster mode by default. ### Why are the changes needed? To avoid arbitrary classpath in spark cluster. ### Does this PR introduce _any_ user-facing change? Yes. User should reenable this feature by `spark.submit.proxyUser.allowCustomClasspathInClusterMode`. ### How was this patch tested? Manually tested. Closes #39474 from Ngone51/dev. Lead-authored-by: Peter Toth <peter.tothgmail.com> Co-authored-by: Yi Wu <yi.wudatabricks.com> Signed-off-by: Hyukjin Kwon <gurwls223apache.org> (cherry picked from commit 909da96e1471886a01a9e1def93630c4fd40e74a) ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #41428 from degant/spark-41958-3.3. Lead-authored-by: Degant Puri <depuri@microsoft.com> Co-authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 June 2023, 00:14:36 UTC
86d22f4 [SPARK-43956][SQL][3.3] Fix the bug doesn't display column's sql for Percentile[Cont|Disc] ### What changes were proposed in this pull request? This PR used to backport https://github.com/apache/spark/pull/41436 to 3.3 ### Why are the changes needed? Fix the bug doesn't display column's sql for Percentile[Cont|Disc]. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see the correct sql information. ### How was this patch tested? Test cases updated. Closes #41446 from beliefer/SPARK-43956_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 04 June 2023, 00:30:16 UTC
11d2f73 [SPARK-43751][SQL][DOC] Document `unbase64` behavior change ### What changes were proposed in this pull request? After SPARK-37820, `select unbase64("abcs==")`(malformed input) always throws an exception, this PR does not help in that case, it only improves the error message for `to_binary()`. So, `unbase64()`'s behavior for malformed input changed silently after SPARK-37820: - before: return a best-effort result, because it uses [LENIENT](https://github.com/apache/commons-codec/blob/rel/commons-codec-1.15/src/main/java/org/apache/commons/codec/binary/Base64InputStream.java#L46) policy: any trailing bits are composed into 8-bit bytes where possible. The remainder are discarded. - after: throw an exception And there is no way to restore the previous behavior. To tolerate the malformed input, the user should migrate `unbase64(<input>)` to `try_to_binary(<input>, 'base64')` to get NULL instead of interrupting by exception. ### Why are the changes needed? Add the behavior change to migration guide. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Manuelly review. Closes #41280 from pan3793/SPARK-43751. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit af6c1ec7c795584c28e15e4963eed83917e2f06a) Signed-off-by: Kent Yao <yao@apache.org> 26 May 2023, 03:34:25 UTC
457bb34 [MINOR][PS][TESTS] Fix `SeriesDateTimeTests.test_quarter` to work properly ### What changes were proposed in this pull request? This PR proposes to fix `SeriesDateTimeTests.test_quarter` to work properly. ### Why are the changes needed? Test has not been properly testing ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested, and the existing CI should pass Closes #41274 from itholic/minor_quarter_test. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5c53384def22b01b8ef28bee6f2d10648bce1a1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 07:04:42 UTC
a3e79f4 [SPARK-43719][WEBUI] Handle `missing row.excludedInStages` field ### What changes were proposed in this pull request? This PR aims to handle a corner case when `row.excludedInStages` field is missing. ### Why are the changes needed? To fix the following type error when Spark loads some very old 2.4.x or 3.0.x logs. ![missing](https://github.com/apache/spark/assets/9700541/f402df10-bf92-4c9f-8bd1-ec9b98a67966) We have two places and this PR protects both places. ``` $ git grep row.excludedInStages core/src/main/resources/org/apache/spark/ui/static/executorspage.js: if (typeof row.excludedInStages === "undefined" || row.excludedInStages.length == 0) { core/src/main/resources/org/apache/spark/ui/static/executorspage.js: return "Active (Excluded in Stages: [" + row.excludedInStages.join(", ") + "])"; ``` ### Does this PR introduce _any_ user-facing change? No, this will remove the error case only. ### How was this patch tested? Manual review. Closes #41266 from dongjoon-hyun/SPARK-43719. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit eeab2e701330f7bc24e9b09ce48925c2c3265aa8) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 05:17:33 UTC
ebf1572 [SPARK-43718][SQL] Set nullable correctly for keys in USING joins ### What changes were proposed in this pull request? In `Anaylzer#commonNaturalJoinProcessing`, set nullable correctly when adding the join keys to the Project's hidden_output tag. ### Why are the changes needed? The value of the hidden_output tag will be used for resolution of attributes in parent operators, so incorrect nullabilty can cause problems. For example, assume this data: ``` create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); ``` The following query produces incorrect results: ``` spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> ``` Similar issues occur with right outer join and left outer join. `t1.c1` and `t2.c1` have the wrong nullability at the time the array is resolved, so the array's `containsNull` value is incorrect. `UpdateNullability` will update the nullability of `t1.c1` and `t2.c1` in the `CreateArray` arguments, but will not update `containsNull` in the function's data type. Queries that don't use arrays also can get wrong results. Assume this data: ``` create or replace temp view t1 as values (0), (1), (2) as (c1); create or replace temp view t2 as values (1), (2), (3) as (c1); create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); ``` The following query produces incorrect results: ``` select t1.c1 as t1_c1, t2.c1 as t2_c1, b from t1 full outer join t2 using (c1), lateral ( select b from t3 where a = coalesce(t2.c1, 1) ) lt3; 1 1 2 NULL 3 4 Time taken: 2.395 seconds, Fetched 2 row(s) spark-sql (default)> ``` The result should be the following: ``` 0 NULL 2 1 1 2 NULL 3 4 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #41267 from bersprockets/using_anomaly. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 217b30a4f7ca18ade19a9552c2b87dd4caeabe57) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 May 2023, 04:47:56 UTC
99f10aa [SPARK-43589][SQL][3.3] Fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString` ### What changes were proposed in this pull request? This is a logical backporting of #41232 This PR aims to fix `cannotBroadcastTableOverMaxTableBytesError` to use `bytesToString` instead of shift operations. ### Why are the changes needed? To avoid user confusion by giving more accurate values. For example, `maxBroadcastTableBytes` is 1GB and `dataSize` is `2GB - 1 byte`. **BEFORE** ``` Cannot broadcast the table that is larger than 1GB: 1 GB. ``` **AFTER** ``` Cannot broadcast the table that is larger than 1024.0 MiB: 2048.0 MiB. ``` ### Does this PR introduce _any_ user-facing change? Yes, but only error message. ### How was this patch tested? Pass the CIs with newly added test case. Closes #41234 from dongjoon-hyun/SPARK-43589-3.3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2023, 15:34:07 UTC
b4d454f [SPARK-43587][CORE][TESTS] Run `HealthTrackerIntegrationSuite` in a dedicated JVM ### What changes were proposed in this pull request? This PR aims to run `HealthTrackerIntegrationSuite` in a dedicated JVM to mitigate a flaky tests. ### Why are the changes needed? `HealthTrackerIntegrationSuite` has been flaky and SPARK-25400 and SPARK-37384 increased the timeout `from 1s to 10s` and `10s to 20s`, respectively. The usual suspect of this flakiness is some unknown side-effect like GCs. In this PR, we aims to run this in a separate JVM instead of increasing the timeout more. https://github.com/apache/spark/blob/abc140263303c409f8d4b9632645c5c6cbc11d20/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala#L56-L58 This is the recent failure. - https://github.com/apache/spark/actions/runs/5020505360/jobs/9002039817 ``` [info] HealthTrackerIntegrationSuite: [info] - If preferred node is bad, without excludeOnFailure job will fail (92 milliseconds) [info] - With default settings, job can succeed despite multiple bad executors on node (3 seconds, 163 milliseconds) [info] - Bad node with multiple executors, job will still succeed with the right confs *** FAILED *** (20 seconds, 43 milliseconds) [info] java.util.concurrent.TimeoutException: Futures timed out after [20 seconds] [info] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) [info] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187) [info] at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:355) [info] at org.apache.spark.scheduler.SchedulerIntegrationSuite.awaitJobTermination(SchedulerIntegrationSuite.scala:276) [info] at org.apache.spark.scheduler.HealthTrackerIntegrationSuite.$anonfun$new$9(HealthTrackerIntegrationSuite.scala:92) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #41229 from dongjoon-hyun/SPARK-43587. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit eb2456bce2522779bf6b866a5fbb728472d35097) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2023, 06:31:41 UTC
132678f [SPARK-43541][SQL][3.3] Propagate all `Project` tags in resolving of expressions and missing columns ### What changes were proposed in this pull request? In the PR, I propose to propagate all tags in a `Project` while resolving of expressions and missing columns in `ColumnResolutionHelper.resolveExprsAndAddMissingAttrs()`. This is a backport of https://github.com/apache/spark/pull/41204. ### Why are the changes needed? To fix the bug reproduced by the query below: ```sql spark-sql (default)> WITH > t1 AS (select key from values ('a') t(key)), > t2 AS (select key from values ('a') t(key)) > SELECT t1.key > FROM t1 FULL OUTER JOIN t2 USING (key) > WHERE t1.key NOT LIKE 'bb.%'; [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `t1`.`key` cannot be resolved. Did you mean one of the following? [`key`].; line 4 pos 7; ``` ### Does this PR introduce _any_ user-facing change? No. It fixes a bug, and outputs the expected result: `a`. ### How was this patch tested? By new test added to `using-join.sql`: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z using-join.sql" ``` and the related test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.HiveContextCompatibilitySuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit 09d5742a8679839d0846f50e708df98663a6d64c) Closes #41221 from MaxGekk/fix-using-join-3.3. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2023, 00:24:50 UTC
9110c05 [SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row ### What changes were proposed in this pull request? This is a pull request to port the fix from the master branch to version 3.3. [PR](https://github.com/apache/spark/pull/40755) When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](https://github.com/apache/spark/pull/35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129gmail.com> Closes #40858 from kings129/fix_encoder_branch_33. Authored-by: --global <xuqiang129@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 May 2023, 21:14:30 UTC
85ff71f [SPARK-43395][BUILD] Exclude macOS tar extended metadata in make-distribution.sh ### What changes were proposed in this pull request? Add args `--no-mac-metadata --no-xattrs --no-fflags` to `tar` on macOS in `dev/make-distribution.sh` to exclude macOS-specific extended metadata. ### Why are the changes needed? The binary tarball created on macOS includes extended macOS-specific metadata and xattrs, which causes warnings when unarchiving it on Linux. Step to reproduce 1. create tarball on macOS (13.3.1) ``` ➜ apache-spark git:(master) tar --version bsdtar 3.5.3 - libarchive 3.5.3 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.8 ``` ``` ➜ apache-spark git:(master) dev/make-distribution.sh --tgz ``` 2. unarchive the binary tarball on Linux (CentOS-7) ``` ➜ ~ tar --version tar (GNU tar) 1.26 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by John Gilmore and Jay Fenlason. ``` ``` ➜ ~ tar -xzf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz tar: Ignoring unknown extended header keyword `SCHILY.fflags' tar: Ignoring unknown extended header keyword `LIBARCHIVE.xattr.com.apple.FinderInfo' ``` ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Create binary tarball on macOS then unarchive on Linux, warnings disappear after this change. Closes #41074 from pan3793/SPARK-43395. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 2d0240df3c474902e263f67b93fb497ca13da00f) Signed-off-by: Sean Owen <srowen@gmail.com> 06 May 2023, 14:38:19 UTC
bc2c955 [SPARK-43337][UI][3.3] Asc/desc arrow icons for sorting column does not get displayed in the table column ### What changes were proposed in this pull request? Remove css `!important` tag for asc/desc arrow icons in jquery.dataTables.1.10.25.min.css ### Why are the changes needed? Upgrading to DataTables 1.10.25 broke asc/desc arrow icons for sorting column. The sorting icon is not displayed when the column is clicked to sort by asc/desc. This is because the new DataTables 1.10.25's jquery.dataTables.1.10.25.min.css file added `!important` rule preventing the override set in webui-dataTables.css ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ![image](https://user-images.githubusercontent.com/52679095/236394863-e0004e7b-5173-495a-af23-32c1343e0ee6.png) ![image](https://user-images.githubusercontent.com/52679095/236394879-db0e5e0e-f6b3-48c3-9c79-694dd9abcb76.png) Closes #41060 from maytasm/fix-arrow-3. Authored-by: Maytas Monsereenusorn <maytasm@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> 05 May 2023, 19:07:59 UTC
e9aab41 [SPARK-43293][SQL] `__qualified_access_only` should be ignored in normal columns This is a followup of https://github.com/apache/spark/pull/39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes #40961 from cloud-fan/col. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 021f02e02fb88bbbccd810ae000e14e0c854e2e6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 April 2023, 02:55:21 UTC
058dcbf [SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method ### What changes were proposed in this pull request? The df.describe() method will cached the RDD. And if the cached RDD is RDD[Unsaferow], which may be released after the row is used, then the result will be wong. Here we need to copy the RDD before caching as the [TakeOrderedAndProjectExec ](https://github.com/apache/spark/blob/d68d46c9e2cec04541e2457f4778117b570d8cdb/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L204)operator does. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Closes #40914 from JkSelf/describe. Authored-by: Jia Ke <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 April 2023, 09:24:46 UTC
8f0c75c [SPARK-43113][SQL][3.3] Evaluate stream-side variables when generating code for a bound condition ### What changes were proposed in this pull request? This is a back-port of #40766 and #40881. In `JoinCodegenSupport#getJoinCondition`, evaluate any referenced stream-side variables before using them in the generated code. This patch doesn't evaluate the passed stream-side variables directly, but instead evaluates a copy (`streamVars2`). This is because `SortMergeJoin#codegenFullOuter` will want to evaluate the stream-side vars within a different scope than the condition check, so we mustn't delete the initialization code from the original `ExprCode` instances. ### Why are the changes needed? When a bound condition of a full outer join references the same stream-side column more than once, wholestage codegen generates bad code. For example, the following query fails with a compilation error: ``` create or replace temp view v1 as select * from values (1, 1), (2, 2), (3, 1) as v1(key, value); create or replace temp view v2 as select * from values (1, 22, 22), (3, -1, -1), (7, null, null) as v2(a, b, c); select * from v1 full outer join v2 on key = a and value > b and value > c; ``` The error is: ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 277, Column 9: Redefinition of local variable "smj_isNull_7" ``` The same error occurs with code generated from ShuffleHashJoinExec: ``` select /*+ SHUFFLE_HASH(v2) */ * from v1 full outer join v2 on key = a and value > b and value > c; ``` In this case, the error is: ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 174, Column 5: Redefinition of local variable "shj_value_1" ``` Neither `SortMergeJoin#codegenFullOuter` nor `ShuffledHashJoinExec#doProduce` evaluate the stream-side variables before calling `consumeFullOuterJoinRow#getJoinCondition`. As a result, `getJoinCondition` generates definition/initialization code for each referenced stream-side variable at the point of use. If a stream-side variable is used more than once in the bound condition, the definition/initialization code is generated more than once, resulting in the "Redefinition of local variable" error. In the end, the query succeeds, since Spark disables wholestage codegen and tries again. (In the case other join-type/strategy pairs, either the implementations don't call `JoinCodegenSupport#getJoinCondition`, or the stream-side variables are pre-evaluated before the call is made, so no error happens in those cases). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes #40917 from bersprockets/full_join_codegen_issue_br33. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 April 2023, 00:58:15 UTC
6d0a271 [SPARK-41660][SQL][3.3] Only propagate metadata columns if they are used ### What changes were proposed in this pull request? backporting https://github.com/apache/spark/pull/39152 to 3.3 ### Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40889 from huaxingao/metadata. Authored-by: huaxingao <huaxin_gao@apple.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> 21 April 2023, 14:46:00 UTC
cd16624 [SPARK-43158][DOCS] Set upperbound of pandas version for Binder integration This PR proposes to set the upperbound for pandas in Binder integration. We don't currently support pandas 2.0.0 properly, see also https://issues.apache.org/jira/browse/SPARK-42618 To make the quickstarts working. Yes, it fixes the quickstart. Tested in: - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Closes #40814 from HyukjinKwon/set-lower-bound-pandas. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a2592e6d46caf82c642f4a01a3fd5c282bbe174e) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 April 2023, 03:59:18 UTC
cc6d3f9 [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions This PR fixes construct aggregate expressions by replacing grouping functions if a expression is part of aggregation. In the following example, the second `b` should also be replaced: <img width="545" alt="image" src="https://user-images.githubusercontent.com/5399861/230415618-84cd6334-690e-4b0b-867b-ccc4056226a8.png"> Fix bug: ``` spark-sql (default)> SELECT CASE WHEN a IS NULL THEN count(b) WHEN b IS NULL THEN count(c) END > FROM grouping > GROUP BY GROUPING SETS (a, b, c); [MISSING_AGGREGATION] The non-aggregating expression "b" is based on columns which are not participating in the GROUP BY clause. ``` No. Unit test. Closes #40685 from wangyum/SPARK-43050. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 45b84cd37add1b9ce274273ad5e519e6bc1d8013) Signed-off-by: Yuming Wang <yumwang@ebay.com> 15 April 2023, 02:17:43 UTC
5ce51c3 [SPARK-43069][BUILD] Use `sbt-eclipse` instead of `sbteclipse-plugin` ### What changes were proposed in this pull request? This PR aims to use `sbt-eclipse` instead of `sbteclipse-plugin`. ### Why are the changes needed? Thanks to SPARK-34959, Apache Spark 3.2+ uses SBT 1.5.0 and we can use `set-eclipse` instead of old `sbteclipse-plugin`. - https://github.com/sbt/sbt-eclipse/releases/tag/6.0.0 ### Does this PR introduce _any_ user-facing change? No, this is a dev-only plugin. ### How was this patch tested? Pass the CIs and manual tests. ``` $ build/sbt eclipse Using /Users/dongjoon/.jenv/versions/1.8 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. Using SPARK_LOCAL_IP=localhost Attempting to fetch sbt Launching sbt from build/sbt-launch-1.8.2.jar [info] welcome to sbt 1.8.2 (AppleJDK-8.0.302.8.1 Java 1.8.0_302) [info] loading settings for project spark-merge-build from plugins.sbt ... [info] loading project definition from /Users/dongjoon/APACHE/spark-merge/project [info] Updating https://repo1.maven.org/maven2/com/github/sbt/sbt-eclipse_2.12_1.0/6.0.0/sbt-eclipse-6.0.0.pom 100.0% [##########] 2.5 KiB (4.5 KiB / s) ... ``` Closes #40708 from dongjoon-hyun/SPARK-43069. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9cba5529d1fc3faf6b743a632df751d84ec86a07) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 April 2023, 19:54:28 UTC
f228f84 [SPARK-39696][CORE] Fix data race in access to TaskMetrics.externalAccums ### What changes were proposed in this pull request? This PR fixes a data race around concurrent access to `TaskMetrics.externalAccums`. The race occurs between the `executor-heartbeater` thread and the thread executing the task. This data race is not known to cause issues on 2.12 but in 2.13 ~due this change https://github.com/scala/scala/pull/9258~ (LuciferYang bisected this to first cause failures in scala 2.13.7 one possible reason could be https://github.com/scala/scala/pull/9786) leads to an uncaught exception in the `executor-heartbeater` thread, which means that the executor will eventually be terminated due to missing hearbeats. This fix of using of using `CopyOnWriteArrayList` is cherry picked from https://github.com/apache/spark/pull/37206 where is was suggested as a fix by LuciferYang since `TaskMetrics.externalAccums` is also accessed from outside the class `TaskMetrics`. The old PR was closed because at that point there was no clear understanding of the race condition. JoshRosen commented here https://github.com/apache/spark/pull/37206#issuecomment-1189930626 saying that there should be no such race based on because all accumulators should be deserialized as part of task deserialization here: https://github.com/apache/spark/blob/0cc96f76d8a4858aee09e1fa32658da3ae76d384/core/src/main/scala/org/apache/spark/executor/Executor.scala#L507-L508 and therefore no writes should occur while the hearbeat thread will read the accumulators. But my understanding is that is incorrect as accumulators will also be deserialized as part of the taskBinary here: https://github.com/apache/spark/blob/169f828b1efe10d7f21e4b71a77f68cdd1d706d6/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala#L87-L88 which will happen while the heartbeater thread is potentially reading the accumulators. This can both due to user code using accumulators (see the new test case) but also when using the Dataframe/Dataset API as sql metrics will also be `externalAccums`. One way metrics will be sent as part of the taskBinary is when the dep is a `ShuffleDependency`: https://github.com/apache/spark/blob/fbbcf9434ac070dd4ced4fb9efe32899c6db12a9/core/src/main/scala/org/apache/spark/Dependency.scala#L85 with a ShuffleWriteProcessor that comes from https://github.com/apache/spark/blob/fbbcf9434ac070dd4ced4fb9efe32899c6db12a9/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L411-L422 ### Why are the changes needed? The current code has a data race. ### Does this PR introduce _any_ user-facing change? It will fix an uncaught exception in the `executor-hearbeater` thread when using scala 2.13. ### How was this patch tested? This patch adds a new test case, that before the fix was applied consistently produces the uncaught exception in the heartbeater thread when using scala 2.13. Closes #40663 from eejbyfeldt/SPARK-39696. Lead-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Co-authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6ce0822f76e11447487d5f6b3cce94a894f2ceef) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b2ff4c4f7ec21d41cb173b413bd5aa5feefd7eee) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 07 April 2023, 03:52:08 UTC
b40e408 [SPARK-43005][PYSPARK] Fix typo in pyspark/pandas/config.py By comparing compute.isin_limit and plotting.max_rows, `v is v` is likely to be a typo. ### What changes were proposed in this pull request? fix `v is v >= 0` with `v >= 0`. ### Why are the changes needed? By comparing compute.isin_limit and plotting.max_rows, `v is v` is likely to be a typo. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By GitHub Actions. Closes #40620 from thyecust/patch-2. Authored-by: thyecust <thy@mail.ecust.edu.cn> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 5ac2b0fc024ae499119dfd5ab2ee4d038418c5fd) Signed-off-by: Sean Owen <srowen@gmail.com> 03 April 2023, 13:24:39 UTC
92e8d26 [SPARK-43004][CORE] Fix typo in ResourceRequest.equals() vendor == vendor is always true, this is likely to be a typo. ### What changes were proposed in this pull request? fix `vendor == vendor` with `that.vendor == vendor`, and `discoveryScript == discoveryScript` with `that.discoveryScript == discoveryScript` ### Why are the changes needed? vendor == vendor is always true, this is likely to be a typo. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By GitHub Actions. Closes #40622 from thyecust/patch-4. Authored-by: thyecust <thy@mail.ecust.edu.cn> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 52c000ece27c9ef34969a7fb252714588f395926) Signed-off-by: Sean Owen <srowen@gmail.com> 03 April 2023, 03:36:21 UTC
6f266ee [SPARK-42967][CORE][3.2][3.3][3.4] Fix SparkListenerTaskStart.stageAttemptId when a task is started after the stage is cancelled ### What changes were proposed in this pull request? The PR fixes a bug that SparkListenerTaskStart can have `stageAttemptId = -1` when a task is launched after the stage is cancelled. Actually, we should use the information within `Task` to update the `stageAttemptId` field. ### Why are the changes needed? -1 is not a legal stageAttemptId value, thus it can lead to unexpected problem if a subscriber try to parse the stage information from the SparkListenerTaskStart event. ### Does this PR introduce _any_ user-facing change? No, it's a bugfix. ### How was this patch tested? Manually verified. Closes #40592 from jiangxb1987/SPARK-42967. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 1a6b1770c85f37982b15d261abf9cc6e4be740f4) Signed-off-by: Gengliang Wang <gengliang@apache.org> 30 March 2023, 22:48:32 UTC
794bea5 [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true Change `PlanSubqueries` to set `shouldBroadcast` to true when instantiating an `InSubqueryExec` instance. The below left outer join gets an error: ``` create or replace temp view v1 as select * from values (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, value9, value10); create or replace temp view v2 as select * from values (1, 2), (3, 8), (7, 9) as v2(a, b); create or replace temp view v3 as select * from values (3), (8) as v3(col1); set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 set spark.sql.adaptive.enabled=false; select * from v1 left outer join v2 on key = a and key in (select col1 from v3); ``` The join fails during predicate codegen: ``` 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to interpreter mode java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) at org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) ``` It fails again after fallback to interpreter mode: ``` 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) ``` Both the predicate codegen and the evaluation fail for the same reason: `PlanSubqueries` creates `InSubqueryExec` with `shouldBroadcast=false`. The driver waits for the subquery to finish, but it's the executor that uses the results of the subquery (for predicate codegen or evaluation). Because `shouldBroadcast` is set to false, the result is stored in a transient field (`InSubqueryExec#result`), so the result of the subquery is not serialized when the `InSubqueryExec` instance is sent to the executor. The issue occurs, as far as I can tell, only when both whole stage codegen is disabled and adaptive execution is disabled. When wholestage codegen is enabled, the predicate codegen happens on the driver, so the subquery's result is available. When adaptive execution is enabled, `PlanAdaptiveSubqueries` always sets `shouldBroadcast=true`, so the subquery's result is available on the executor, if needed. No. New unit test. Closes #40569 from bersprockets/join_subquery_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5b20f3d94095f54017be3d31d11305e597334d8b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 March 2023, 12:33:22 UTC
4aa8860 [SPARK-42922][SQL] Move from Random to SecureRandom ### What changes were proposed in this pull request? Most uses of `Random` in spark are either in testcases or where we need a pseudo random number which is repeatable. Use `SecureRandom`, instead of `Random` for the cases where it impacts security. ### Why are the changes needed? Use of `SecureRandom` in more security sensitive contexts. This was flagged in our internal scans as well. ### Does this PR introduce _any_ user-facing change? Directly no. Would improve security posture of Apache Spark. ### How was this patch tested? Existing unit tests Closes #40568 from mridulm/SPARK-42922. Authored-by: Mridul Muralidharan <mridulatgmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 744434358cb0c687b37d37dd62f2e7d837e52b2d) Signed-off-by: Sean Owen <srowen@gmail.com> 28 March 2023, 03:48:23 UTC
fc126b4 [SPARK-42906][K8S] Replace a starting digit with `x` in resource name prefix ### What changes were proposed in this pull request? Change the generated resource name prefix to meet K8s requirements > DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?') ### Why are the changes needed? In current implementation, the following app name causes error ``` bin/spark-submit \ --master k8s://https://*.*.*.*:6443 \ --deploy-mode cluster \ --name 你好_187609 \ ... ``` ``` Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://*.*.*.*:6443/api/v1/namespaces/spark/services. Message: Service "187609-f19020870d12c349-driver-svc" is invalid: metadata.name: Invalid value: "187609-f19020870d12c349-driver-svc": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?'). ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT. Closes #40533 from pan3793/SPARK-42906. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0b9a3017005ccab025b93d7b545412b226d4e63c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 March 2023, 22:31:45 UTC
13702f0 [SPARK-42934][BUILD] Add `spark.hadoop.hadoop.security.key.provider.path` to `scalatest-maven-plugin` ### What changes were proposed in this pull request? When testing `OrcEncryptionSuite` using maven, all test suites are always skipped. So this pr add `spark.hadoop.hadoop.security.key.provider.path` to `systemProperties` of `scalatest-maven-plugin` to make `OrcEncryptionSuite` can test by maven. ### Why are the changes needed? Make `OrcEncryptionSuite` can test by maven. ### Does this PR introduce _any_ user-facing change? No, just for maven test ### How was this patch tested? - Pass GitHub Actions - Manual testing: run ``` build/mvn clean install -pl sql/core -DskipTests -am build/mvn test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite ``` **Before** ``` Discovery starting. Discovery completed in 3 seconds, 218 milliseconds. Run starting. Expected test count is: 4 OrcEncryptionSuite: 21:57:58.344 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - Write and read an encrypted file !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider5af5d76f doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:37) - Write and read an encrypted table !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider5ad6cc21 doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:65) - SPARK-35325: Write and read encrypted nested columns !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider691124ee doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:116) - SPARK-35992: Write and read fully-encrypted columns with default masking !!! CANCELED !!! [] was empty org.apache.orc.impl.NullKeyProvider5403799b doesn't has the test keys. ORC shim is created with old Hadoop libraries (OrcEncryptionSuite.scala:166) 21:58:00.035 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) ===== Run completed in 5 seconds, 41 milliseconds. Total number of tests run: 0 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 0, canceled 4, ignored 0, pending 0 No tests were executed. ``` **After** ``` Discovery starting. Discovery completed in 3 seconds, 185 milliseconds. Run starting. Expected test count is: 4 OrcEncryptionSuite: 21:58:46.540 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable - Write and read an encrypted file - Write and read an encrypted table - SPARK-35325: Write and read encrypted nested columns - SPARK-35992: Write and read fully-encrypted columns with default masking 21:58:51.933 WARN org.apache.spark.sql.execution.datasources.orc.OrcEncryptionSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.datasources.orc.OrcEncryptionSuite, threads: rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) ===== Run completed in 8 seconds, 708 milliseconds. Total number of tests run: 4 Suites: completed 2, aborted 0 Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #40566 from LuciferYang/SPARK-42934-2. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a3d9e0ae0f95a55766078da5d0bf0f74f3c3cfc3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 March 2023, 16:42:59 UTC
07a1563 [SPARK-42799][BUILD] Update SBT build `xercesImpl` version to match with `pom.xml` This PR aims to update `XercesImpl` version to `2.12.2` from `2.12.0` in order to match with the version of `pom.xml`. https://github.com/apache/spark/blob/149e020a5ca88b2db9c56a9d48e0c1c896b57069/pom.xml#L1429-L1433 When we updated this version via SPARK-39183, we missed to update `SparkBuild.scala`. - https://github.com/apache/spark/pull/36544 No, this is a dev-only change because the release artifact' dependency is managed by Maven. Pass the CIs. Closes #40431 from dongjoon-hyun/SPARK-42799. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 049aa380b8b1361c2898bc499e64613d329c6f72) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 March 2023, 07:40:29 UTC
aa84c06 [SPARK-42785][K8S][CORE] When spark submit without `--deploy-mode`, avoid facing NPE in Kubernetes Case ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/37880 when user spark submit without `--deploy-mode XXX` or `–conf spark.submit.deployMode=XXXX`, may face NPE with this code. ### Why are the changes needed? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#164 ```scala args.deployMode.equals("client") && ``` Of course, submit without `deployMode` is not allowed and will throw an exception and terminate the application, but we should leave it to the later logic to give the appropriate hint instead of giving a NPE. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ![popo_2023-03-14 17-50-46](https://user-images.githubusercontent.com/52876270/224965310-ba9ec82f-e668-4a06-b6ff-34c3e80ca0b4.jpg) Closes #40414 from zwangsheng/SPARK-42785. Authored-by: zwangsheng <2213335496@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 767253bb6219f775a8a21f1cdd0eb8c25fa0b9de) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 March 2023, 15:49:34 UTC
aa86093 [SPARK-42747][ML] Fix incorrect internal status of LoR and AFT ### What changes were proposed in this pull request? Add a hook `onParamChange` in `Params.{set, setDefault, clear}`, so that subclass can update the internal status within it. ### Why are the changes needed? In 3.1, we added internal auxiliary variables in LoR and AFT to optimize prediction/transformation. In LoR, when users call `model.{setThreshold, setThresholds}`, the internal status will be correctly updated. But users still can call `model.set(model.threshold, value)`, then the status will not be updated. And when users call `model.clear(model.threshold)`, the status should be updated with default threshold value 0.5. for example: ``` import org.apache.spark.ml.linalg._ import org.apache.spark.ml.classification._ val df = Seq((1.0, 1.0, Vectors.dense(0.0, 5.0)), (0.0, 2.0, Vectors.dense(1.0, 2.0)), (1.0, 3.0, Vectors.dense(2.0, 1.0)), (0.0, 4.0, Vectors.dense(3.0, 3.0))).toDF("label", "weight", "features") val lor = new LogisticRegression().setWeightCol("weight") val model = lor.fit(df) val vec = Vectors.dense(0.0, 5.0) val p0 = model.predict(vec) // return 0.0 model.setThreshold(0.05) // change status val p1 = model.set(model.threshold, 0.5).predict(vec) // return 1.0; but should be 0.0 val p2 = model.clear(model.threshold).predict(vec) // return 1.0; but should be 0.0 ``` what makes it even worse it that `pyspark.ml` always set params via `model.set(model.threshold, value)`, so the internal status is easily out of sync, see the example in [SPARK-42747](https://issues.apache.org/jira/browse/SPARK-42747) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added ut Closes #40367 from zhengruifeng/ml_param_hook. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 5a702f22f49ca6a1b6220ac645e3fce70ec5189d) Signed-off-by: Sean Owen <srowen@gmail.com> 11 March 2023, 14:46:10 UTC
cfcd453 [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field ### What changes were proposed in this pull request? Fix /api/v1/applications to return total uptime instead of 0 for duration ### Why are the changes needed? Fix REST API OneApplicationResource ### Does this PR introduce _any_ user-facing change? yes, /api/v1/applications will return the total uptime instead of 0 for the duration ### How was this patch tested? locally build and run ```json [ { "id" : "local-1678183638394", "name" : "SparkSQL::10.221.102.180", "attempts" : [ { "startTime" : "2023-03-07T10:07:17.754GMT", "endTime" : "1969-12-31T23:59:59.999GMT", "lastUpdated" : "2023-03-07T10:07:17.754GMT", "duration" : 20317, "sparkUser" : "kentyao", "completed" : false, "appSparkVersion" : "3.5.0-SNAPSHOT", "startTimeEpoch" : 1678183637754, "endTimeEpoch" : -1, "lastUpdatedEpoch" : 1678183637754 } ] } ] ``` Closes #40313 from yaooqinn/SPARK-42697. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit d3d8fdc2882f5c084897ca9b2af9a063358f3a21) Signed-off-by: Kent Yao <yao@apache.org> 09 March 2023, 05:34:27 UTC
9740739 [SPARK-39399][CORE][K8S] Fix proxy-user authentication for Spark on k8s in cluster deploy mode ### What changes were proposed in this pull request? The PR fixes the authentication failure of the proxy user on driver side while accessing kerberized hdfs through spark on k8s job. It follows the similar approach as it was done for Mesos: https://github.com/mesosphere/spark/pull/26 ### Why are the changes needed? When we try to access the kerberized HDFS through a proxy user in Spark Job running in cluster deploy mode with Kubernetes resource manager, we encounter AccessControlException. This is because authentication in driver is done using tokens of the proxy user and since proxy user doesn't have any delegation tokens on driver, auth fails. Further details: https://issues.apache.org/jira/browse/SPARK-25355?focusedCommentId=17532063&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17532063 https://issues.apache.org/jira/browse/SPARK-25355?focusedCommentId=17532135&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17532135 ### Does this PR introduce _any_ user-facing change? Yes, user will now be able to use proxy-user to access kerberized hdfs with Spark on K8s. ### How was this patch tested? The patch was tested by: 1. Running job which accesses kerberized hdfs with proxy user in cluster mode and client mode with kubernetes resource manager. 2. Running job which accesses kerberized hdfs without proxy user in cluster mode and client mode with kubernetes resource manager. 3. Build and run test github action : https://github.com/shrprasa/spark/actions/runs/3051203625 Closes #37880 from shrprasa/proxy_user_fix. Authored-by: Shrikant Prasad <shrprasa@visa.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit b3b3557ccbe53e34e0d0dbe3d21f49a230ee621b) Signed-off-by: Kent Yao <yao@apache.org> 08 March 2023, 03:34:54 UTC
2bd20a9 [SPARK-42478][SQL][3.3] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory This is a backport of https://github.com/apache/spark/pull/40064 for branch-3.3 ### What changes were proposed in this pull request? Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory ### Why are the changes needed? [SPARK-41448](https://issues.apache.org/jira/browse/SPARK-41448) make consistent MR job IDs in FileBatchWriter and FileFormatWriter, but it breaks a serializable issue, JobId is non-serializable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #40290 from Yikf/backport-SPARK-42478-3.3. Authored-by: Yikf <yikaifei@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 March 2023, 22:06:12 UTC
210ed7d [SPARK-42673][BUILD] Make `build/mvn` build Spark only with the verified maven version ### What changes were proposed in this pull request? `build/mvn` tends to use the new maven version to build Spark now, and GA starts to use 3.9.0 as the default maven version. But there may be some uncertain factors when building Spark with unverified version. For example, `java-11-17` GA build task build with maven 3.9.0 has many error logs in master like follow: ``` Error: [ERROR] An error occurred attempting to read POM org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen <?xml version="1.0" encoding="ISO-8859-1"... 1:42) at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDeclWithVersion (MXParser.java:3423) at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDecl (MXParser.java:3345) at org.codehaus.plexus.util.xml.pull.MXParser.parsePI (MXParser.java:3197) at org.codehaus.plexus.util.xml.pull.MXParser.parseProlog (MXParser.java:1828) at org.codehaus.plexus.util.xml.pull.MXParser.nextImpl (MXParser.java:1757) at org.codehaus.plexus.util.xml.pull.MXParser.next (MXParser.java:1375) at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read (MavenXpp3Reader.java:3940) at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read (MavenXpp3Reader.java:612) at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read (MavenXpp3Reader.java:627) at org.cyclonedx.maven.BaseCycloneDxMojo.readPom (BaseCycloneDxMojo.java:759) at org.cyclonedx.maven.BaseCycloneDxMojo.readPom (BaseCycloneDxMojo.java:746) at org.cyclonedx.maven.BaseCycloneDxMojo.retrieveParentProject (BaseCycloneDxMojo.java:694) at org.cyclonedx.maven.BaseCycloneDxMojo.getClosestMetadata (BaseCycloneDxMojo.java:524) at org.cyclonedx.maven.BaseCycloneDxMojo.convert (BaseCycloneDxMojo.java:481) at org.cyclonedx.maven.CycloneDxMojo.execute (CycloneDxMojo.java:70) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126) at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:342) at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:330) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:213) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:175) at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 (MojoExecutor.java:76) at org.apache.maven.lifecycle.internal.MojoExecutor$1.run (MojoExecutor.java:163) at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute (DefaultMojosExecutionStrategy.java:39) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:160) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:105) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:73) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:53) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:118) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:260) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:172) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:100) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:821) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:270) at org.apache.maven.cli.MavenCli.main (MavenCli.java:192) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:77) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:568) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347) ``` So this pr change the version check condition of `build/mvn` to make it build Spark only with the verified maven version. ### Why are the changes needed? Make `build/mvn` build Spark only with the verified maven version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - `java-11-17` GA build task pass and no error message as above - Manual test: 1. Make the system use maven 3.9.0( > 3.8.7 ) by default: run `mvn -version` ``` Apache Maven 3.9.0 (9b58d2bad23a66be161c4664ef21ce219c2c8584) Maven home: /Users/${userName}/Tools/maven Java version: 1.8.0_362, vendor: Azul Systems, Inc., runtime: /Users/${userName}/Tools/zulu8/zulu-8.jdk/Contents/Home/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "mac os x", version: "13.2.1", arch: "aarch64", family: "mac" ``` and run `build/mvn -version` ``` Using `mvn` from path: /${basedir}/spark/build/apache-maven-3.8.7/bin/mvn Using SPARK_LOCAL_IP=localhost Apache Maven 3.8.7 (b89d5959fcde851dcb1c8946a785a163f14e1e29) Maven home: /${basedir}/spark/build/apache-maven-3.8.7 Java version: 1.8.0_362, vendor: Azul Systems, Inc., runtime: /Users/${userName}/Tools/zulu8/zulu-8.jdk/Contents/Home/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "mac os x", version: "13.2.1", arch: "aarch64", family: "mac" ``` We can see Spark use 3.8.7 in build directory when the system default maven > 3.8.7 2. Make the system use maven 3.8.7 by default: run `mvn -version` ``` mvn -version Apache Maven 3.8.7 (b89d5959fcde851dcb1c8946a785a163f14e1e29) Maven home: /Users/${userName}/Tools/maven Java version: 1.8.0_362, vendor: Azul Systems, Inc., runtime: /Users/${userName}/Tools/zulu8/zulu-8.jdk/Contents/Home/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "mac os x", version: "13.2.1", arch: "aarch64", family: "mac" ``` and run `build/mvn -version` ``` Using `mvn` from path: /Users/${userName}/Tools/maven/bin/mvn Using SPARK_LOCAL_IP=localhost Apache Maven 3.8.7 (b89d5959fcde851dcb1c8946a785a163f14e1e29) Maven home: /Users/${userName}/Tools/maven Java version: 1.8.0_362, vendor: Azul Systems, Inc., runtime: /Users/${userName}/Tools/zulu8/zulu-8.jdk/Contents/Home/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "mac os x", version: "13.2.1", arch: "aarch64", family: "mac" ``` We can see Spark use system default maven 3.8.7 when the system default maven is 3.8.7. 3. Make the system use maven 3.8.6( < 3.8.7 ) by default: run `mvn -version` ``` mvn -version Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63) Maven home: /Users/${userName}/Tools/maven Java version: 1.8.0_362, vendor: Azul Systems, Inc., runtime: /Users/${userName}/Tools/zulu8/zulu-8.jdk/Contents/Home/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "mac os x", version: "13.2.1", arch: "aarch64", family: "mac" ``` and run `build/mvn -version` ``` Using `mvn` from path: /Users/${userName}/Tools/maven/bin/mvn Using SPARK_LOCAL_IP=localhost Apache Maven 3.8.7 (b89d5959fcde851dcb1c8946a785a163f14e1e29) Maven home: /Users/${userName}/Tools/maven Java version: 1.8.0_362, vendor: Azul Systems, Inc., runtime: /Users/${userName}/Tools/zulu8/zulu-8.jdk/Contents/Home/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "mac os x", version: "13.2.1", arch: "aarch64", family: "mac" ``` We can see Spark use 3.8.7 in build directory when the system default maven < 3.8.7. Closes #40283 from LuciferYang/ban-maven-3.9.x. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f70b8cf1a00002b6c6b96ec4e6ad4d6c2f0ab392) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 March 2023, 19:06:07 UTC
d98a1d8 [SPARK-42635][SQL][3.3] Fix the TimestampAdd expression This is a backport of #40237. ### What changes were proposed in this pull request? This PR fixed the counter-intuitive behaviors of the `TimestampAdd` expression mentioned in https://issues.apache.org/jira/browse/SPARK-42635. See the following *user-facing* changes for details. ### Does this PR introduce _any_ user-facing change? Yes. This PR fixes the three problems mentioned in SPARK-42635: 1. When the time is close to daylight saving time transition, the result may be discontinuous and not monotonic. 2. Adding month, quarter, and year silently ignores `Int` overflow during unit conversion. 3. Adding sub-month units (week, day, hour, minute, second, millisecond, microsecond)silently ignores `Long` overflow during unit conversion. Some examples of the result changes: Old results: ``` // In America/Los_Angeles timezone: timestampadd(DAY, 1, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 (this is correct, put it here for comparison) timestampadd(HOUR, 23, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 timestampadd(HOUR, 24, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 timestampadd(SECOND, 86400 - 1, 2011-03-12 03:00:00) = 2011-03-13 03:59:59 timestampadd(SECOND, 86400, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 // In UTC timezone: timestampadd(quarter, 1431655764, 1970-01-01 00:00:00) = 1969-09-01 00:00:00 timestampadd(day, 106751992, 1970-01-01 00:00:00) = -290308-12-22 15:58:10.448384 ``` New results: ``` // In America/Los_Angeles timezone: timestampadd(DAY, 1, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 timestampadd(HOUR, 23, 2011-03-12 03:00:00) = 2011-03-13 03:00:00 timestampadd(HOUR, 24, 2011-03-12 03:00:00) = 2011-03-13 04:00:00 timestampadd(SECOND, 86400 - 1, 2011-03-12 03:00:00) = 2011-03-13 03:59:59 timestampadd(SECOND, 86400, 2011-03-12 03:00:00) = 2011-03-13 04:00:00 // In UTC timezone: timestampadd(quarter, 1431655764, 1970-01-01 00:00:00) = throw overflow exception timestampadd(day, 106751992, 1970-01-01 00:00:00) = throw overflow exception ``` ### How was this patch tested? Pass existing tests and some new tests. Closes #40264 from chenhao-db/cherry-pick-SPARK-42635. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 04 March 2023, 19:15:45 UTC
20870c3 [SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](https://github.com/apache/spark/pull/37817) request [numpy 1.24.0](https://github.com/numpy/numpy/pull/22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](https://github.com/numpy/numpy/pull/14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes #40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <aimtsou@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b3c26b8b3aa90c829aec50ba170d14873ca5bde9) Signed-off-by: Sean Owen <srowen@gmail.com> 03 March 2023, 00:50:41 UTC
7e90f42 [SPARK-42553][SQL][3.3] Ensure at least one time unit after "interval" ### What changes were proposed in this pull request? This PR aims to ensure "at least one time unit should be given for interval literal" by modifying SqlBaseParser. This is a backport of https://github.com/apache/spark/pull/40195 ### Why are the changes needed? INTERVAL is a Non-Reserved keyword in spark. But when I run ```shell scala> spark.sql("select interval from mytable") ``` I get ``` org.apache.spark.sql.catalyst.parser.ParseException: at least one time unit should be given for interval literal(line 1, pos 7)== SQL == select interval from mytable -------^^^ at org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196) ...... ``` It is a bug because "Non-Reserved keywords" have a special meaning in particular contexts and can be used as identifiers in other contexts. So by design, INTERVAL can be used as a column name. Currently the interval's grammar is ``` interval : INTERVAL (errorCapturingMultiUnitsInterval | errorCapturingUnitToUnitInterval)? ; ``` There is no need to make the time unit nullable, we can ensure "at least one time unit should be given for interval literal" if the interval's grammar is ``` interval : INTERVAL (errorCapturingMultiUnitsInterval | errorCapturingUnitToUnitInterval) ; ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test: PlanParsserSuite."SPARK-42553: NonReserved keyword 'interval' can be column name" Local test ```shell scala> val myDF = spark.sparkContext.makeRDD(1 to 5).toDF("interval") myDF: org.apache.spark.sql.DataFrame = [interval: int] scala> myDF.createOrReplaceTempView("mytable") scala> spark.sql("select interval from mytable;").show() +--------+ |interval| +--------+ | 1| | 2| | 3| | 4| | 5| +--------+ ``` Closes #40253 from jiang13021/branch-3.3-42553. Authored-by: jiangyzanze <jiangyanze.jyz@alibaba-inc.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> 02 March 2023, 15:25:02 UTC
037ec0c [SPARK-42649][CORE] Remove the standard Apache License header from the top of third-party source files This PR aims to remove the standard Apache License header from the top of third-party source files. According to LICENSE file, I found two files. - https://github.com/apache/spark/blob/master/LICENSE This was requested via `devspark` mailing list. - https://lists.apache.org/thread/wfy9sykncw2znhzlvyd18bkyjr7l9x43 Here is the ASF legal policy. - https://www.apache.org/legal/src-headers.html#3party > Do not add the standard Apache License header to the top of third-party source files. No. This is a source code distribution. Manual review. Closes #40249 from dongjoon-hyun/SPARK-42649. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2c9f67ca5d1bb5de0fe4418ebcf95f2d1a8e3371) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 March 2023, 09:04:21 UTC
b091b65 [SPARK-42596][CORE][YARN] OMP_NUM_THREADS not set to number of executor cores by default ### What changes were proposed in this pull request? The PR fixes a mistake in SPARK-41188 that removed the PythonRunner code setting OMP_NUM_THREADS to number of executor cores by default. That author and reviewers thought it's a duplicate. ### Why are the changes needed? SPARK-41188 stopped setting OMP_NUM_THREADS to number of executor cores by default when running Python UDF on YARN. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual testing Closes #40199 from jzhuge/SPARK-42596. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 43b15b31d26bbf1e539728e6c64aab4eda7ade62) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 28 February 2023, 02:19:45 UTC
52d52a1 [SPARK-40376][PYTHON] Avoid Numpy deprecation warning ### What changes were proposed in this pull request? Use `bool` instead of `np.bool` as `np.bool` will be deprecated (see: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) Using `np.bool` generates this warning: ``` UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation. 3070E `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. 3071E Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` ### Why are the changes needed? Deprecation soon: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. ### Does this PR introduce _any_ user-facing change? The warning will be suppressed ### How was this patch tested? Existing tests should suffice. Closes #37817 from ELHoussineT/patch-1. Authored-by: ELHoussineT <elhoussinetalab@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> 25 February 2023, 16:50:24 UTC
11eb66e [SPARK-42286][SPARK-41991][SPARK-42473][3.3][SQL] Fallback to previous codegen code path for complex expr with CAST ### What changes were proposed in this pull request? This PR fixes the internal error `Child is not Cast or ExpressionProxy of Cast` for valid `CaseWhen` expr with `Cast` expr in its branches. Specifically, after SPARK-39865, an improved error msg for overflow exception during table insert was introduced. The improvement covers `Cast` expr and `ExpressionProxy` expr, but `CaseWhen` and other complex ones are not covered. An example below hits an internal error today. ``` create table t1 as select x FROM values (1), (2), (3) as tab(x); create table t2 (x Decimal(9, 0)); insert into t2 select 0 - (case when x = 1 then 1 else x end) from t1 where x = 1; ``` To fix the query failure, we decide to fall back to the previous handling if the expr is not a simple `Cast` expr or `ExpressionProxy` expr. ### Why are the changes needed? To fix the query regression introduced in SPARK-39865. ### Does this PR introduce _any_ user-facing change? No. We just fall back to the previous error msg if the expression involving `Cast` is not a simple one. ### How was this patch tested? - Added Unit test. - Removed one test case for the `Child is not Cast or ExpressionProxy of Cast` internal error, as now we do not check if the child has a `Cast` expression and fall back to the previous error message. Closes #40140 from RunyaoChen/cast_fix_branch_3.3_cp. Lead-authored-by: RunyaoChen <runyao.chen@databricks.com> Co-authored-by: Runyao Chen <runyao.chen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> 24 February 2023, 18:56:50 UTC
82fe394 [SPARK-42516][SQL] Always capture the session time zone config while creating views ### What changes were proposed in this pull request? In the PR, I propose to capture the session time zone config (`spark.sql.session.timeZone`) as a view property, and use it while re-parsing/analysing the view. If the SQL config is not set while creating a view, use the default value of the config. ### Why are the changes needed? To improve user experience with Spark SQL. The current behaviour might confuse users because query results depends on whether or not the session time zone was set explicitly while creating a view. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the current value of the session time zone is used in view analysis but this behaviour can be restored via another SQL config `spark.sql.legacy.useCurrentConfigsForView`. ### How was this patch tested? By running the new test via: ``` $ build/sbt "test:testOnly *.PersistedViewTestSuite" ``` Closes #40103 from MaxGekk/view-tz-conf. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 00e56905f77955f67e3809d724b33aebcc79cb5e) Signed-off-by: Max Gekk <max.gekk@gmail.com> 22 February 2023, 11:03:49 UTC
9b0d49b [MINOR][TESTS] Avoid NPE in an anonym SparkListener in DataFrameReaderWriterSuite ### What changes were proposed in this pull request? Avoid the following NPE in an anonym SparkListener in DataFrameReaderWriterSuite, as job desc may be absent ``` java.lang.NullPointerException at java.util.concurrent.ConcurrentLinkedQueue.checkNotNull(ConcurrentLinkedQueue.java:920) at java.util.concurrent.ConcurrentLinkedQueue.offer(ConcurrentLinkedQueue.java:327) at java.util.concurrent.ConcurrentLinkedQueue.add(ConcurrentLinkedQueue.java:297) at org.apache.spark.sql.test.DataFrameReaderWriterSuite$$anon$2.onJobStart(DataFrameReaderWriterSuite.scala:1151) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1462) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) ``` ### Why are the changes needed? Test Improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #40102 from yaooqinn/test-minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 088ebdeea67dd509048a7559f1c92a3636e18ce6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 February 2023, 12:08:17 UTC
6da0f88 [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160 ### What changes were proposed in this pull request? SPARK-41952 was raised for a while, but unfortunately, the Parquet community does not publish the patched version yet, as a workaround, we can fix the issue on the Spark side first. We encountered this memory issue when migrating data from parquet/snappy to parquet/zstd, Spark executors always occupy unreasonable off-heap memory and have a high risk of being killed by NM. See more discussions at https://github.com/apache/parquet-mr/pull/982 and https://github.com/apache/iceberg/pull/5681 ### Why are the changes needed? The issue is fixed in the parquet community [PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the patched version is not available yet. ### Does this PR introduce _any_ user-facing change? Yes, it's bug fix. ### How was this patch tested? The existing UT should cover the correctness check, I also verified this patch by scanning a large parquet/zstd table. ``` spark-shell --executor-cores 4 --executor-memory 6g --conf spark.executor.memoryOverhead=2g ``` ``` spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false) ``` - before this patch All executors get killed by NM quickly. ``` ERROR YarnScheduler: Lost executor 1 on hadoop-xxxx.****.org: Container killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical memory used. Consider boosting spark.executor.memoryOverhead. ``` <img width="1872" alt="image" src="https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png"> - after this patch Query runs well, no executor gets killed. <img width="1881" alt="image" src="https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png"> Closes #40091 from pan3793/SPARK-41952. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com> 20 February 2023, 17:42:10 UTC
8a490ea [SPARK-41741][SQL] Encode the string using the UTF_8 charset in ParquetFilters This PR makes it encode the string using the `UTF_8` charset in `ParquetFilters`. Fix data issue where the default charset is not `UTF_8`. No. Manual test. Closes #40090 from wangyum/SPARK-41741. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit d5fa41efe2b1aa0aa41f558c1bef048b4632cf5c) Signed-off-by: Yuming Wang <yumwang@ebay.com> 20 February 2023, 11:30:34 UTC
dd0a41a [SPARK-42462][K8S] Prevent `docker-image-tool.sh` from publishing OCI manifests ### What changes were proposed in this pull request? This is found during Apache Spark 3.3.2 docker image publishing. It's not an Apache Spark but important for `docker-image-tool.sh` to provide backward compatibility during cross-building. This PR targets for all **future releases**, Apache Spark 3.4.0/3.3.3/3.2.4. ### Why are the changes needed? Docker `buildx` v0.10.0 publishes OCI Manifests by default which is not supported by `docker manifest` command like the following. https://github.com/docker/buildx/issues/1509 ``` $ docker manifest inspect apache/spark:v3.3.2 no such manifest: docker.io/apache/spark:v3.3.2 ``` Note that the published images are working on both AMD64/ARM64 machines, but `docker manifest` cannot be used. For example, we cannot create `latest` tag. ### Does this PR introduce _any_ user-facing change? This will fix the regression of Docker `buildx`. ### How was this patch tested? Manually builds the multi-arch image and check `manifest`. ``` $ docker manifest inspect apache/spark:v3.3.2 { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [ { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:30ae5023fc384ae3b68d2fb83adde44b1ece05f926cfceecac44204cdc9e79cb", "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 3444, "digest": "sha256:aac13b5b5a681aefa91036d2acae91d30a743c2e78087c6df79af4de46a16e1b", "platform": { "architecture": "arm64", "os": "linux" } } ] } ``` Closes #40051 from dongjoon-hyun/SPARK-42462. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2ac70ae5381333aa899d82f6cd4c3bbae524e1c2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 February 2023, 05:52:44 UTC
a3c27b6 [SPARK-42445][R] Fix SparkR `install.spark` function ### What changes were proposed in this pull request? This PR fixes `SparkR` `install.spark` method. ``` $ curl -LO https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/SparkR_3.3.2.tar.gz $ R CMD INSTALL SparkR_3.3.2.tar.gz $ R R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid" Copyright (C) 2022 The R Foundation for Statistical Computing Platform: aarch64-apple-darwin20 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(SparkR) Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from ‘package:base’: as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union > install.spark() Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.3.2 for Hadoop 2.7 from: - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.7.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 > install.spark(hadoopVersion="3") Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.3.2 for Hadoop 3 from: - https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-3.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 ``` Note that this is a regression at Spark 3.3.0 and not a blocker for on-going Spark 3.3.2 RC vote. ### Why are the changes needed? https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#ref-usage ![Screenshot 2023-02-14 at 10 07 49 PM](https://user-images.githubusercontent.com/9700541/218946460-ab7eab1b-65ae-4cb2-bc7c-5810ad359ac9.png) First, the existing Spark 2.0.0 link is broken. - https://spark.apache.org/docs/latest/api/R/reference/install.spark.html#details - http://apache.osuosl.org/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz (Broken) Second, Spark 3.3.0 changed the Hadoop postfix pattern from the distribution files so that the function raises errors as described before. - http://archive.apache.org/dist/spark/spark-3.2.3/spark-3.2.3-bin-hadoop2.7.tgz (Old Pattern) - http://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz (New Pattern) ### Does this PR introduce _any_ user-facing change? No, this fixes a bug like Spark 3.2.3 and older versions. ### How was this patch tested? Pass the CI and manual testing. Please note that the link pattern is correct although it fails because 3.5.0 is not published yet. ``` $ NO_MANUAL=1 ./dev/make-distribution.sh --r $ R CMD INSTALL R/SparkR_3.5.0-SNAPSHOT.tar.gz $ R > library(SparkR) > install.spark() Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://dlcdn.apache.org/spark Downloading spark-3.5.0 for Hadoop 3 from: - https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz trying URL 'https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz' simpleWarning in download.file(remotePath, localPath): downloaded length 0 != reported length 196 ``` Closes #40031 from dongjoon-hyun/SPARK-42445. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b47d29a5299620bf4a87e33bb2de4db81a572edf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 February 2023, 17:31:38 UTC
b370674 Preparing development version 3.3.3-SNAPSHOT 10 February 2023, 17:25:43 UTC
5103e00 Preparing Spark release v3.3.2-rc1 10 February 2023, 17:25:33 UTC
307ec98 [MINOR][SS] Fix setTimeoutTimestamp doc ### What changes were proposed in this pull request? This patch updates the API doc of `setTimeoutTimestamp` of `GroupState`. ### Why are the changes needed? Update incorrect API doc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc change only. Closes #39958 from viirya/fix_group_state. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit a180e67d3859a4e145beaf671c1221fb4d6cbda7) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 10 February 2023, 02:17:51 UTC
7205567 [SPARK-40819][SQL][FOLLOWUP] Update SqlConf version for nanosAsLong configuration As requested by HyukjinKwon in https://github.com/apache/spark/pull/38312 NB: This change needs to be backported ### What changes were proposed in this pull request? Update version set for "spark.sql.legacy.parquet.nanosAsLong" configuration in SqlConf. This update is required because the previous PR set version to `3.2.3` which has already been released. Updating to version `3.2.4` will correctly reflect when this configuration element was added ### Why are the changes needed? Correctness and to complete SPARK-40819 ### Does this PR introduce _any_ user-facing change? No, this is merely so this configuration element has the correct version ### How was this patch tested? N/A Closes #39943 from awdavidson/SPARK-40819_sql-conf. Authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 409c661542c4b966876f0af4119803de25670649) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 February 2023, 00:02:36 UTC
3ec9b05 [SPARK-40819][SQL][3.3] Timestamp nanos behaviour regression As per HyukjinKwon request on https://github.com/apache/spark/pull/38312 to backport fix into 3.3 ### What changes were proposed in this pull request? Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2 ### Why are the changes needed? Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns ``` Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true)) ``` https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()` Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test covering this scenario. Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)` Closes #39904 from awdavidson/ts-nanos-fix-3.3. Lead-authored-by: alfreddavidson <alfie.davidson9@gmail.com> Co-authored-by: awdavidson <54780428+awdavidson@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 08 February 2023, 02:07:25 UTC
51ed6ba [SPARK-41962][MINOR][SQL] Update the order of imports in class SpecificParquetRecordReaderBase ### What changes were proposed in this pull request? Update the order of imports in class SpecificParquetRecordReaderBase. ### Why are the changes needed? Follow the code style. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passed GA. Closes #39906 from wayneguow/import. Authored-by: wayneguow <guow93@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit d6134f78d3d448a990af53beb8850ff91b71aef6) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 07 February 2023, 07:11:29 UTC
17b7123 [SPARK-42346][SQL] Rewrite distinct aggregates after subquery merge ### What changes were proposed in this pull request? Unfortunately https://github.com/apache/spark/pull/32298 introduced a regression from Spark 3.2 to 3.3 as after that change a merged subquery can contain multiple distict type aggregates. Those aggregates need to be rewritten by the `RewriteDistinctAggregates` rule to get the correct results. This PR fixed that. ### Why are the changes needed? The following query: ``` SELECT (SELECT count(distinct c1) FROM t1), (SELECT count(distinct c2) FROM t1) ``` currently fails with: ``` java.lang.IllegalStateException: You hit a query analyzer bug. Please report your query to Spark user mailing list. at org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:538) ``` but works again after this PR. ### Does this PR introduce _any_ user-facing change? Yes, the above query works again. ### How was this patch tested? Added new UT. Closes #39887 from peter-toth/SPARK-42346-rewrite-distinct-aggregates-after-subquery-merge. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 5940b9884b4b172f65220da7857d2952b137bc51) Signed-off-by: Yuming Wang <yumwang@ebay.com> 06 February 2023, 13:22:58 UTC
cdb494b [SPARK-42344][K8S] Change the default size of the CONFIG_MAP_MAXSIZE The default size of the CONFIG_MAP_MAXSIZE should not be greater than 1048576 ### What changes were proposed in this pull request? This PR changed the default size of the CONFIG_MAP_MAXSIZE from 1572864(1.5 MiB) to 1048576(1.0 MiB) ### Why are the changes needed? When a job is submitted by the spark to the K8S with a configmap , The Spark-Submit will call the K8S‘s POST API "api/v1/namespaces/default/configmaps". And the size of the configmaps will be validated by this K8S API,the max value shoud not be greater than 1048576. In the previous comment,the explain in https://etcd.io/docs/v3.4/dev-guide/limit/ is: "etcd is designed to handle small key value pairs typical for metadata. Larger requests will work, but may increase the latency of other requests. By default, the maximum size of any request is 1.5 MiB. This limit is configurable through --max-request-bytes flag for etcd server." This explanation is from the perspective of etcd ,not K8S. So I think the default value of the configmap in Spark should not be greate than 1048576. ### Does this PR introduce _any_ user-facing change? Yes. Generally, the size of the configmap will not exceed 1572864 or even 1048576. So the problem solved here may not be perceived by users. ### How was this patch tested? local test Closes #39884 from ninebigbig/master. Authored-by: Yan Wei <ninebigbig@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9ac46408ec943d5121bbc14f2ce0d8b2ff453de5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d07a0e9b476e1d846e7be13394c8251244cc832e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 05 February 2023, 11:12:19 UTC
2d539c5 [SPARK-41554] fix changing of Decimal scale when scale decreased by m… …ore than 18 This is a backport PR for https://github.com/apache/spark/pull/39099 Closes #39813 from fe2s/branch-3.3-fix-decimal-scaling. Authored-by: oleksii.diagiliev <oleksii.diagiliev@workday.com> Signed-off-by: Sean Owen <srowen@gmail.com> 03 February 2023, 16:48:56 UTC
6e0dfa9 [MINOR][DOCS][PYTHON][PS] Fix the `.groupby()` method docstring ### What changes were proposed in this pull request? Update the docstring for the `.groupby()` method. ### Why are the changes needed? The `.groupby()` method accept a list of columns (or a single column), and a column is defined by a `Series` or name (`Label`). It's a bit confusing to say "using a Series of columns", because `Series` (capitalized) is a specific object that isn't actually used/reasonable to use here. ### Does this PR introduce _any_ user-facing change? Yes (documentation) ### How was this patch tested? N/A Closes #38625 from deepyaman/patch-3. Authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 71154dc1b35c7227ef9033fe5abc2a8b3f2d0990) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 February 2023, 06:51:15 UTC
80e8df1 [SPARK-42259][SQL] ResolveGroupingAnalytics should take care of Python UDAF This is a long-standing correctness issue with Python UDAF and grouping analytics. The rule `ResolveGroupingAnalytics` should take care of Python UDAF when matching aggregate expressions. bug fix Yes, the query result was wrong before existing tests Closes #39824 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1219c8492376e038894111cd5d922229260482e7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 February 2023, 09:40:56 UTC
0bb8f22 [SPARK-42230][INFRA][FOLLOWUP] Add `GITHUB_PREV_SHA` and `APACHE_SPARK_REF` to lint job ### What changes were proposed in this pull request? Like the other jobs, this PR aims to add `GITHUB_PREV_SHA` and `APACHE_SPARK_REF` environment variables to `lint` job. ### Why are the changes needed? This is required to detect the changed module accurately. ### Does this PR introduce _any_ user-facing change? No, this is a infra-only bug fix. ### How was this patch tested? Manual review. Closes #39809 from dongjoon-hyun/SPARK-42230-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1304a3329d7feb1bd6f9a9dba09f37494c9bb4a2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 January 2023, 19:59:45 UTC
9d7373a [SPARK-42230][INFRA] Improve `lint` job by skipping PySpark and SparkR docs if unchanged ### What changes were proposed in this pull request? This PR aims to improve `GitHub Action lint` job by skipping `PySpark` and `SparkR` documentation generations if PySpark and R module is unchanged. ### Why are the changes needed? `Documentation Generation` took over 53 minutes because it generates all Java/Scala/SQL/PySpark/R documentation always. However, `R` module is not changed frequently so that the documentation is always identical. `PySpark` module is more frequently changed, but still we can skip in many cases. This PR shows the reduction from `53` minutes to `18` minutes. **BEFORE** ![Screenshot 2023-01-29 at 4 36 07 PM](https://user-images.githubusercontent.com/9700541/215365573-bf83717b-cd9e-46e2-912f-5c9d2f359b08.png) **AFTER** ![Screenshot 2023-01-29 at 10 13 27 PM](https://user-images.githubusercontent.com/9700541/215401795-3f810e52-2fe3-44fd-99f4-b5750964c3b6.png) ### Does this PR introduce _any_ user-facing change? No, this is an infra only change. ### How was this patch tested? Manual review. Closes #39792 from dongjoon-hyun/SPARK-42230. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1d3c2681d26bf6034d15ee261e5395e9f45d67f8) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 January 2023, 07:38:48 UTC
6cd4ff6 [SPARK-42168][3.3][SQL][PYTHON][FOLLOW-UP] Test FlatMapCoGroupsInPandas with Window function ### What changes were proposed in this pull request? This ports tests from #39717 in branch-3.2 to branch-3.3. See https://github.com/apache/spark/pull/39752#issuecomment-1407157253. ### Why are the changes needed? To make sure this use case is tested. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E test in `test_pandas_cogrouped_map.py` and analysis test in `EnsureRequirementsSuite.scala`. Closes #39781 from EnricoMi/branch-3.3-cogroup-window-bug-test. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 January 2023, 07:25:13 UTC
289e650 [SPARK-42201][BUILD] `build/sbt` should allow `SBT_OPTS` to override JVM memory setting ### What changes were proposed in this pull request? This PR aims to fix a bug which `build/sbt` doesn't allow JVM memory setting via `SBT_OPTS`. ### Why are the changes needed? `SBT_OPTS` is supposed to be used in this way in the community. https://github.com/apache/spark/blob/e30bb538e480940b1963eb14c3267662912d8584/appveyor.yml#L54 However, `SBT_OPTS` memory setting like the following is ignored because ` -Xms4096m -Xmx4096m -XX:ReservedCodeCacheSize=512m` is injected by default after `SBT_OPTS`. We should switch the order. ``` $ SBT_OPTS="-Xmx6g" build/sbt package ``` https://github.com/apache/spark/blob/e30bb538e480940b1963eb14c3267662912d8584/build/sbt-launch-lib.bash#L124 ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Manually run the following. ``` $ SBT_OPTS="-Xmx6g" build/sbt package ``` While running the above command, check the JVM options. ``` $ ps aux | grep java dongjoon 36683 434.3 3.1 418465456 1031888 s001 R+ 1:11PM 0:19.86 /Users/dongjoon/.jenv/versions/temurin17/bin/java -Xms4096m -Xmx4096m -XX:ReservedCodeCacheSize=512m -Xmx6g -jar build/sbt-launch-1.8.2.jar package ``` Closes #39758 from dongjoon-hyun/SPARK-42201. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 66ec1eb630a4682f5ad2ed2ee989ffcce9031608) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 January 2023, 00:39:35 UTC
518a24c [SPARK-42188][BUILD][3.3] Force SBT protobuf version to match Maven ### What changes were proposed in this pull request? Update `SparkBuild.scala` to force SBT use of `protobuf-java` to match the Maven version. The Maven dependencyManagement section forces `protobuf-java` to use `2.5.0`, but SBT is using `3.14.0`. ### Why are the changes needed? Define `protoVersion` in `SparkBuild.scala` and use it in `DependencyOverrides` to force the SBT version of `protobuf-java` to match the setting defined in the Maven top-level `pom.xml`. Add comments to both `pom.xml` and `SparkBuild.scala` to ensure that the values are kept in sync. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Before the update, SBT reported using `3.14.0`: ``` % build/sbt dependencyTree | grep proto | sed 's/^.*-com/com/' | sort | uniq -c 8 com.google.protobuf:protobuf-java:2.5.0 (evicted by: 3.14.0) 70 com.google.protobuf:protobuf-java:3.14.0 ``` After the patch is applied, SBT reports using `2.5.0`: ``` % build/sbt dependencyTree | grep proto | sed 's/^.*-com/com/' | sort | uniq -c 70 com.google.protobuf:protobuf-java:2.5.0 ``` Closes #39746 from snmvaughan/feature/SPARK-42188-3.3. Authored-by: Steve Vaughan Jr <s_vaughan@apple.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> 26 January 2023, 02:17:39 UTC
d69e7b6 [SPARK-42179][BUILD][SQL][3.3] Upgrade ORC to 1.7.8 ### What changes were proposed in this pull request? This PR aims to upgrade ORC to 1.7.8 for Apache Spark 3.3.2. ### Why are the changes needed? Apache ORC 1.7.8 is a maintenance release with important bug fixes. - https://orc.apache.org/news/2023/01/21/ORC-1.7.8/ - [ORC-1332](https://issues.apache.org/jira/browse/ORC-1332) Avoid NegativeArraySizeException when using searchArgument - [ORC-1343](https://issues.apache.org/jira/browse/ORC-1343) Ignore orc.create.index ### Does this PR introduce _any_ user-facing change? The ORC dependency is going to be changed from 1.7.6 (Apache Spark 3.3.1) to 1.7.8 (Apache Spark 3.3.2) ### How was this patch tested? Pass the CIs. Closes #39735 from dongjoon-hyun/SPARK-42179. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 25 January 2023, 11:24:22 UTC
04edc7e [SPARK-42176][SQL] Fix cast of a boolean value to timestamp The PR fixes an issue when casting a boolean to timestamp. While `select cast(true as timestamp)` works and returns `1970-01-01 00:00:00.000001`, casting `false` to timestamp fails with the following error: > IllegalArgumentException: requirement failed: Literal must have a corresponding value to timestamp, but class Integer found. SBT test also fails with this error: ``` [info] java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long [info] at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) [info] at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getWriter$5(InternalRow.scala:178) [info] at org.apache.spark.sql.catalyst.InternalRow$.$anonfun$getWriter$5$adapted(InternalRow.scala:178) ``` The issue was that we need to return `0L` instead of `0` when converting `false` to a long. Fixes a small bug in cast. No. I added a unit test to verify the fix. Closes #39729 from sadikovi/fix_spark_boolean_to_timestamp. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 866343c7be47d71b88ae9a6b4dda26f8c4f5964b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 January 2023, 03:33:15 UTC
b5cbee8 [SPARK-42090][3.3] Introduce sasl retry count in RetryingBlockTransferor ### What changes were proposed in this pull request? This PR introduces sasl retry count in RetryingBlockTransferor. ### Why are the changes needed? Previously a boolean variable, saslTimeoutSeen, was used. However, the boolean variable wouldn't cover the following scenario: 1. SaslTimeoutException 2. IOException 3. SaslTimeoutException 4. IOException Even though IOException at #2 is retried (resulting in increment of retryCount), the retryCount would be cleared at step #4. Since the intention of saslTimeoutSeen is to undo the increment due to retrying SaslTimeoutException, we should keep a counter for SaslTimeoutException retries and subtract the value of this counter from retryCount. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test is added, courtesy of Mridul. Closes #39611 from tedyu/sasl-cnt. Authored-by: Ted Yu <yuzhihonggmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes #39709 from akpatnam25/SPARK-42090-backport-3.3. Authored-by: Ted Yu <yuzhihong@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> 24 January 2023, 18:23:25 UTC
0fff27d [MINOR][K8S][DOCS] Add all resource managers in `Scheduling Within an Application` section ### What changes were proposed in this pull request? `Job Scheduling` document doesn't mention `K8s resource manager` so far because `Scheduling Across Applications` section only mentions all resource managers except K8s. This PR aims to add all supported resource managers in `Scheduling Within an Application section` section. ### Why are the changes needed? K8s also supports `FAIR` schedule within an application. ### Does this PR introduce _any_ user-facing change? No. This is a doc-only update. ### How was this patch tested? N/A Closes #39704 from dongjoon-hyun/minor_job_scheduling. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 45dbc44410f9bf74c7fb4431aad458db32960461) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 January 2023, 08:06:32 UTC
41e6875 [SPARK-42157][CORE] `spark.scheduler.mode=FAIR` should provide FAIR scheduler ### What changes were proposed in this pull request? Like our documentation, `spark.sheduler.mode=FAIR` should provide a `FAIR Scheduling Within an Application`. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application ![Screenshot 2023-01-22 at 2 59 22 PM](https://user-images.githubusercontent.com/9700541/213944956-931e3a3c-d094-4455-8990-233c7966194b.png) This bug is hidden in our CI because we have `fairscheduler.xml` always as one of test resources. - https://github.com/apache/spark/blob/master/core/src/test/resources/fairscheduler.xml ### Why are the changes needed? Currently, when `spark.scheduler.mode=FAIR` is given without scheduler allocation file, Spark creates `Fair Scheduler Pools` with `FIFO` scheduler which is wrong. We need to switch the mode to `FAIR` from `FIFO`. **BEFORE** ``` $ bin/spark-shell -c spark.scheduler.mode=FAIR Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/01/22 14:47:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 23/01/22 14:47:38 WARN FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration. Spark context Web UI available at http://localhost:4040 ``` ![Screenshot 2023-01-22 at 2 50 38 PM](https://user-images.githubusercontent.com/9700541/213944555-6e367a33-ca58-4daf-9ba4-b0319fbe4516.png) **AFTER** ``` $ bin/spark-shell -c spark.scheduler.mode=FAIR Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/01/22 14:48:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://localhost:4040 ``` ![Screenshot 2023-01-22 at 2 50 14 PM](https://user-images.githubusercontent.com/9700541/213944551-660aa298-638b-450c-ad61-db9e42a624b0.png) ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug fix to match with Apache Spark official documentation. ### How was this patch tested? Pass the CIs. Closes #39703 from dongjoon-hyun/SPARK-42157. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4d51bfa725c26996641f566e42ae392195d639c5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 January 2023, 07:48:27 UTC
b7ababf [SPARK-41415][3.3] SASL Request Retries ### What changes were proposed in this pull request? Add the ability to retry SASL requests. Will add it as a metric too soon to track SASL retries. ### Why are the changes needed? We are seeing increased SASL timeouts internally, and this issue would mitigate the issue. We already have this feature enabled for our 2.3 jobs, and we have seen failures significantly decrease. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests, and tested on cluster to ensure the retries are being triggered correctly. Closes #38959 from akpatnam25/SPARK-41415. Authored-by: Aravind Patnam <apatnamlinkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> Closes #39644 from akpatnam25/SPARK-41415-backport-3.3. Authored-by: Aravind Patnam <apatnam@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> 21 January 2023, 03:30:51 UTC
8f09a69 [SPARK-42134][SQL] Fix getPartitionFiltersAndDataFilters() to handle filters without referenced attributes ### What changes were proposed in this pull request? This is a small correctness fix to `DataSourceUtils.getPartitionFiltersAndDataFilters()` to handle filters without any referenced attributes correctly. E.g. without the fix the following query on ParquetV2 source: ``` spark.conf.set("spark.sql.sources.useV1SourceList", "") spark.range(1).write.mode("overwrite").format("parquet").save(path) df = spark.read.parquet(path).toDF("i") f = udf(lambda x: False, "boolean")(lit(1)) val r = df.filter(f) r.show() ``` returns ``` +---+ | i| +---+ | 0| +---+ ``` but it should return with empty results. The root cause of the issue is that during `V2ScanRelationPushDown` a filter that doesn't reference any column is incorrectly identified as partition filter. ### Why are the changes needed? To fix a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UT. Closes #39676 from peter-toth/SPARK-42134-fix-getpartitionfiltersanddatafilters. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit dcdcb80c53681d1daff416c007cf8a2810155625) Signed-off-by: huaxingao <huaxin_gao@apple.com> 21 January 2023, 02:35:57 UTC
ffa6cbf [SPARK-40817][K8S][3.3] `spark.files` should preserve remote files ### What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/38376 to `branch-3.3` You can find a detailed description of the issue and an example reproduction on the Jira card: https://issues.apache.org/jira/browse/SPARK-40817 The idea for this fix is to update the logic which uploads user-specified files (via `spark.jars`, `spark.files`, etc) to `spark.kubernetes.file.upload.path`. After uploading local files, it used to overwrite the initial list of URIs passed by the user and it would thus erase all remote URIs which were specified there. Small example of this behaviour: 1. User set the value of `spark.jars` to `s3a://some-bucket/my-application.jar,/tmp/some-local-jar.jar` when running `spark-submit` in cluster mode 2. `BasicDriverFeatureStep.getAdditionalPodSystemProperties()` gets called at one point while running `spark-submit` 3. This function would set `spark.jars` to a new value of `${SPARK_KUBERNETES_UPLOAD_PATH}/spark-upload-${RANDOM_STRING}/some-local-jar.jar`. Note that `s3a://some-bucket/my-application.jar` has been discarded. With the logic proposed in this PR, the new value of `spark.jars` would be `s3a://some-bucket/my-application.jar,${SPARK_KUBERNETES_UPLOAD_PATH}/spark-upload-${RANDOM_STRING}/some-local-jar.jar`, so in other words we are making sure that remote URIs are no longer discarded. ### Why are the changes needed? We encountered this issue in production when trying to launch Spark on Kubernetes jobs in cluster mode with a fix of local and remote dependencies. ### Does this PR introduce _any_ user-facing change? Yes, see description of the new behaviour above. ### How was this patch tested? - Added a unit test for the new behaviour - Added an integration test for the new behaviour - Tried this patch in our Kubernetes environment with `SparkPi`: ``` spark-submit \ --master k8s://https://$KUBERNETES_API_SERVER_URL:443 \ --deploy-mode cluster \ --name=spark-submit-test \ --class org.apache.spark.examples.SparkPi \ --conf spark.jars=/opt/my-local-jar.jar,s3a://$BUCKET_NAME/my-remote-jar.jar \ --conf spark.kubernetes.file.upload.path=s3a://$BUCKET_NAME/my-upload-path/ \ [...] /opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar ``` Before applying the patch, `s3a://$BUCKET_NAME/my-remote-jar.jar` was discarded from the final value of `spark.jars`. After applying the patch and launching the job again, I confirmed that `s3a://$BUCKET_NAME/my-remote-jar.jar` was no longer discarded by looking at the Spark config for the running job. Closes #39669 from antonipp/spark-40817-branch-3.3. Authored-by: Anton Ippolitov <anton.ippolitov@datadoghq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 January 2023, 22:39:56 UTC
0d37682 [SPARK-42110][SQL][TESTS] Reduce the number of repetition in ParquetDeltaEncodingSuite.`random data test` ### What changes were proposed in this pull request? `random data test` is consuming about 4 minutes in GitHub Action and worse in some other environment. ### Why are the changes needed? - https://github.com/apache/spark/actions/runs/3948081724/jobs/6757667891 ``` ParquetDeltaEncodingInt - random data test (1 minute, 51 seconds) ... ParquetDeltaEncodingLong ... - random data test (1 minute, 54 seconds) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs. Closes #39648 from dongjoon-hyun/SPARK-42110. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d4f757baca64b4b66dd8f4e0b09bf085cce34af5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 January 2023, 00:36:05 UTC
9a8b652 [SPARK-42084][SQL] Avoid leaking the qualified-access-only restriction This is a better fix than https://github.com/apache/spark/pull/39077 and https://github.com/apache/spark/pull/38862 The special attribute metadata `__qualified_access_only` is very risky, as it breaks normal column resolution. The aforementioned 2 PRs remove the restriction in `SubqueryAlias` and `Alias`, but it's not good enough as we may forget to do the same thing for new logical plans/expressions in the future. It's also problematic if advanced users manipulate logical plans and expressions directly, when there is no `SubqueryAlias` and `Alias` to remove the restriction. To be safe, we should only apply this restriction when resolving join hidden columns, which means the plan node right above `Project(Join(using or natural join))`. This PR simply removes the restriction when a column is resolved from a sequence of `Attributes`, or from star expansion, and also when adding the `Project` hidden columns to its output. This makes sure that the qualified-access-only restriction will not be leaked to normal column resolution, but only metadata column resolution. To make the join hidden column feature more robust No existing tests Closes #39596 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 January 2023, 10:58:56 UTC
408b583 [SPARK-42071][CORE] Register `scala.math.Ordering$Reverse` to KyroSerializer ### What changes were proposed in this pull request? This PR aims to register `scala.math.Ordering$Reverse` to KyroSerializer. ### Why are the changes needed? Scala 2.12.12 added a new class 'Reverse' via https://github.com/scala/scala/pull/8965. This affects Apache Spark 3.2.0+. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly added test case. Closes #39578 from dongjoon-hyun/SPARK-42071. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e3c0fbeadfe5242fa6265cb0646d72d3b5f6ef35) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 January 2023, 09:10:03 UTC
ff30903 [SPARK-41989][PYTHON] Avoid breaking logging config from pyspark.pandas See https://issues.apache.org/jira/browse/SPARK-41989 for in depth explanation Short summary: `pyspark/pandas/__init__.py` uses, at import time, `logging.warning()` which might silently call `logging.basicConfig()`. So by importing `pyspark.pandas` (directly or indirectly) a user might unknowingly break their own logging setup (e.g. when based on `logging.basicConfig()` or related). `logging.getLogger(...).warning()` does not trigger this behavior. User-defined logging setups will be more predictable. Manual testing so far. I'm not sure it's worthwhile to cover this with a unit test Closes #39516 from soxofaan/SPARK-41989-pyspark-pandas-logging-setup. Authored-by: Stefaan Lippens <stefaan.lippens@vito.be> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 04836babb7a1a2aafa7c65393c53c42937ef75a4) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 12 January 2023, 09:25:30 UTC
b97f79d [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations ### What changes were proposed in this pull request? Backport #39131 to branch-3.3. Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an `Aggregate` when the join condition references an attribute that exists in its right plan and its left plan's child. This usually happens when the anti-join / semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those attributes (in this example due to the projection of `value` to `id`). This behaviour already exists for `Project` and `Union`, but `Aggregate` lacks this safety guard. ### Why are the changes needed? Without this change, the optimizer creates an incorrect plan. This example fails with `distinct()` (an aggregation), and succeeds without `distinct()`, but both queries are identical: ```scala val ids = Seq(1, 2, 3).toDF("id").distinct() val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), "left_anti").collect() assert(result.length == 1) ``` With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition `(value#907 + 1) = value#907`, which can never be true. This effectively removes the anti-join. **Before this PR:** The anti-join is fully removed from the plan. ``` == Physical Plan == AdaptiveSparkPlan (16) +- == Final Plan == LocalTableScan (1) (16) AdaptiveSparkPlan Output [1]: [id#900] Arguments: isFinalPlan=true ``` This is caused by `PushDownLeftSemiAntiJoin` adding join condition `(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 1) AS id#912` exists in the right child of the join as well as in the left grandchild: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === !Join LeftAnti, (id#912 = id#910) Aggregate [id#910], [(id#910 + 1) AS id#912] !:- Aggregate [id#910], [(id#910 + 1) AS id#912] +- Project [value#907 AS id#910] !: +- Project [value#907 AS id#910] +- Join LeftAnti, ((value#907 + 1) = value#907) !: +- LocalRelation [value#907] :- LocalRelation [value#907] !+- Aggregate [id#910], [id#910] +- Aggregate [id#910], [id#910] ! +- Project [value#914 AS id#910] +- Project [value#914 AS id#910] ! +- LocalRelation [value#914] +- LocalRelation [value#914] ``` The right child of the join and in the left grandchild would become the children of the pushed-down join, which creates an invalid join condition. **After this PR:** Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous as both sides of the prospect join contain `id#910`. Hence, the join is not pushed down. The rule is then not applied any more. The final plan contains the anti-join: ``` == Physical Plan == AdaptiveSparkPlan (24) +- == Final Plan == * BroadcastHashJoin LeftSemi BuildRight (14) :- * HashAggregate (7) : +- AQEShuffleRead (6) : +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, rowCount=3) : +- Exchange (4) : +- * HashAggregate (3) : +- * Project (2) : +- * LocalTableScan (1) +- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, rowCount=3) +- BroadcastExchange (12) +- * HashAggregate (11) +- AQEShuffleRead (10) +- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, rowCount=3) +- ReusedExchange (8) (8) ReusedExchange [Reuses operator id: 4] Output [1]: [id#898] (24) AdaptiveSparkPlan Output [1]: [id#900] Arguments: isFinalPlan=true ``` ### Does this PR introduce _any_ user-facing change? It fixes correctness. ### How was this patch tested? Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`. Closes #39409 from EnricoMi/branch-antijoin-selfjoin-fix-3.3. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 January 2023, 03:32:45 UTC
977e445 [SPARK-41864][INFRA][PYTHON] Fix mypy linter errors Currently, the GitHub Action Python linter job is broken. This PR will recover Python linter failure. There are two kind of failures. 1. https://github.com/apache/spark/actions/runs/3829330032/jobs/6524170799 ``` python/pyspark/pandas/sql_processor.py:221: error: unused "type: ignore" comment Found 1 error in 1 file (checked 380 source files) ``` 2. After fixing (1), we hit the following. ``` ModuleNotFoundError: No module named 'py._path'; 'py' is not a package ``` No. Pass the GitHub CI on this PR. Or, manually run the following. ``` $ dev/lint-python starting python compilation test... python compilation succeeded. starting black test... black checks passed. starting flake8 test... flake8 checks passed. starting mypy annotations test... annotations passed mypy checks. starting mypy examples test... examples passed mypy checks. starting mypy data test... annotations passed data checks. all lint-python tests passed! ``` Closes #39373 from dongjoon-hyun/SPARK-41864. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 13b2856e6e77392a417d2bb2ce804f873ee72b28) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 04 January 2023, 03:10:04 UTC
2da30ad [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available ### What changes were proposed in this pull request? This PR aims to skip `flake8` tests if the command is not available. ### Why are the changes needed? Linters are optional modules and we can be skip in some systems like `mypy`. ``` $ dev/lint-python starting python compilation test... python compilation succeeded. The Python library providing 'black' module was not found. Skipping black checks for now. The flake8 command was not found. Skipping for now. The mypy command was not found. Skipping for now. all lint-python tests passed! ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests. Closes #39372 from dongjoon-hyun/SPARK-41863. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1a5ef40a4d59b377b028b55ea3805caf5d55f28f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 03 January 2023, 23:01:52 UTC
02a7fda [SPARK-41732][SQL][SS][3.3] Apply tree-pattern based pruning for the rule SessionWindowing This PR ports back #39245 to branch-3.3. ### What changes were proposed in this pull request? This PR proposes to apply tree-pattern based pruning for the rule SessionWindowing, to minimize the evaluation of rule with SessionWindow node. ### Why are the changes needed? The rule SessionWindowing is unnecessarily evaluated multiple times without proper pruning. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #39253 from HeartSaVioR/SPARK-41732-3.3. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 28 December 2022, 11:51:09 UTC
0887a2f Revert "[MINOR][TEST][SQL] Add a CTE subquery scope test case" This reverts commit aa39b06462a98f37be59e239d12edd9f09a25b88. 28 December 2022, 08:57:18 UTC
aa39b06 [MINOR][TEST][SQL] Add a CTE subquery scope test case ### What changes were proposed in this pull request? I noticed we were missing a test case for this in SQL tests, so I added one. ### Why are the changes needed? To ensure we scope CTEs properly in subqueries. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a test case change. Closes #39189 from rxin/cte_test. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Reynold Xin <rxin@databricks.com> (cherry picked from commit 24edf8ecb5e47af294f89552dfd9957a2d9f193b) Signed-off-by: Reynold Xin <rxin@databricks.com> 23 December 2022, 22:55:54 UTC
19824cf [SPARK-41686][SPARK-41030][BUILD][3.3] Upgrade Apache Ivy to 2.5.1 ### What changes were proposed in this pull request? Upgrade Apache Ivy from 2.5.0 to 2.5.1 ### Why are the changes needed? [CVE-2022-37865](https://www.cve.org/CVERecord?id=CVE-2022-37865) and [CVE-2022-37866](https://nvd.nist.gov/vuln/detail/CVE-2022-37866) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA Closes #39176 from tobiasstadler/SPARK-41686. Authored-by: Tobias Stadler <ts.stadler@gmx.de> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 December 2022, 12:53:35 UTC
9934b56 [SPARK-41350][3.3][SQL][FOLLOWUP] Allow simple name access of join hidden columns after alias backport https://github.com/apache/spark/pull/39077 to 3.3 Closes #39121 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 December 2022, 03:26:19 UTC
7cd6907 [SPARK-41668][SQL] DECODE function returns wrong results when passed NULL ### What changes were proposed in this pull request? The DECODE function was implemented for Oracle compatibility. It works similar to CASE expression, but it is supposed to have one major difference: NULL == NULL https://docs.oracle.com/database/121/SQLRF/functions057.htm#SQLRF00631 The Spark implementation does not observe this, however: ``` > select decode(null, 6, 'Spark', NULL, 'SQL', 4, 'rocks'); NULL ``` The result is supposed to be 'SQL'. This PR is to fix the issue. ### Why are the changes needed? Bug fix and Oracle compatibility. ### Does this PR introduce _any_ user-facing change? Yes, DECODE function will return matched value when passed null, instead of always returning null. ### How was this patch tested? New UT. Closes #39163 from gengliangwang/fixDecode. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit e09fcebdfaed22b28abbe5c9336f3a6fc92bd046) Signed-off-by: Gengliang Wang <gengliang@apache.org> 22 December 2022, 02:34:03 UTC
b0c9b73 [SPARK-41535][SQL] Set null correctly for calendar interval fields in `InterpretedUnsafeProjection` and `InterpretedMutableProjection` In `InterpretedUnsafeProjection`, use `UnsafeWriter.write`, rather than `UnsafeWriter.setNullAt`, to set null for interval fields. Also, in `InterpretedMutableProjection`, use `InternalRow.setInterval`, rather than `InternalRow.setNullAt`, to set null for interval fields. This returns the wrong answer: ``` set spark.sql.codegen.wholeStage=false; set spark.sql.codegen.factoryMode=NO_CODEGEN; select first(col1), last(col2) from values (make_interval(0, 0, 0, 7, 0, 0, 0), make_interval(17, 0, 0, 2, 0, 0, 0)) as data(col1, col2); +---------------+---------------+ |first(col1) |last(col2) | +---------------+---------------+ |16 years 2 days|16 years 2 days| +---------------+---------------+ ``` In the above case, `TungstenAggregationIterator` uses `InterpretedUnsafeProjection` to create the aggregation buffer and to initialize all the fields to null. `InterpretedUnsafeProjection` incorrectly calls `UnsafeRowWriter#setNullAt`, rather than `unsafeRowWriter#write`, for the two calendar interval fields. As a result, the writer never allocates memory from the variable length region for the two intervals, and the pointers in the fixed region get left as zero. Later, when `InterpretedMutableProjection` attempts to update the first field, `UnsafeRow#setInterval` picks up the zero pointer and stores interval data on top of the null-tracking bit set. The call to UnsafeRow#setInterval for the second field also stomps the null-tracking bit set. Later updates to the null-tracking bit set (e.g., calls to `setNotNullAt`) further corrupt the interval data, turning `interval 7 years 2 days` into `interval 16 years 2 days`. Even after one fixes the above bug in `InterpretedUnsafeProjection` so that the buffer is created correctly, `InterpretedMutableProjection` has a similar bug to SPARK-41395, except this time for calendar interval data: ``` set spark.sql.codegen.wholeStage=false; set spark.sql.codegen.factoryMode=NO_CODEGEN; select first(col1), last(col2), max(col3) from values (null, null, 1), (make_interval(0, 0, 0, 7, 0, 0, 0), make_interval(17, 0, 0, 2, 0, 0, 0), 3) as data(col1, col2, col3); +---------------+---------------+---------+ |first(col1) |last(col2) |max(col3)| +---------------+---------------+---------+ |16 years 2 days|16 years 2 days|3 | +---------------+---------------+---------+ ``` These two bugs could get exercised during codegen fallback. No. New unit tests. Closes #39117 from bersprockets/unsafe_interval_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 7f153842041d66e9cf0465262f4458cfffda4f43) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 20 December 2022, 00:30:25 UTC
back to top