https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
97340c1 Preparing Spark release v3.1.0-rc1 05 January 2021, 10:51:32 UTC
e01c850 [SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build ### What changes were proposed in this pull request? This PR proposes to use python3 instead of python in SQL documentation build. After SPARK-29672, we use `sql/create-docs.sh` everywhere in Spark dev. We should fix it in `sql/create-docs.sh` too. This blocks release because the release container does not have `python` but only `python3`. ### Why are the changes needed? To unblock the release. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I manually ran the script Closes #31041 from HyukjinKwon/SPARK-34010. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8d09f9649510bf5d812c82b04f7711b9252a7db0) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 05 January 2021, 10:48:51 UTC
9528e2b Preparing development version 3.1.1-SNAPSHOT 05 January 2021, 08:25:46 UTC
4d5133e Preparing Spark release v3.1.0-rc1 05 January 2021, 08:25:39 UTC
62838cc [SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/29703. It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there: - https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html - https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support - http://crs4.github.io/pydoop/_pydoop1/installation.html ### Why are the changes needed? To avoid the environment variables is unexpectedly conflicted. ### Does this PR introduce _any_ user-facing change? It renames the environment variable but it's not released yet. ### How was this patch tested? Existing unittests will test. Closes #31028 from HyukjinKwon/SPARK-32017-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 329850c667305053e4433c4c6da0e47b231302d4) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 05 January 2021, 08:21:46 UTC
26fc56c [SPARK-34007][BUILD] Downgrade scala-maven-plugin to 4.3.0 ### What changes were proposed in this pull request? This PR is a partial revert of https://github.com/apache/spark/pull/30456 by downgrading scala-maven-plugin from 4.4.0 to 4.3.0. Currently, when you run the docker release script (`./dev/create-release/do-release-docker.sh`), it fails to compile as below during incremental compilation with zinc for an unknown reason: ``` [INFO] Compiling 21 Scala sources and 3 Java sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes ... [ERROR] ## Exception when compiling 24 sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes java.lang.SecurityException: class "javax.servlet.SessionCookieConfig"'s signer information does not match signer information of other classes in the same package java.lang.ClassLoader.checkCerts(ClassLoader.java:891) java.lang.ClassLoader.preDefineClass(ClassLoader.java:661) java.lang.ClassLoader.defineClass(ClassLoader.java:754) java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) java.net.URLClassLoader.defineClass(URLClassLoader.java:468) java.net.URLClassLoader.access$100(URLClassLoader.java:74) java.net.URLClassLoader$1.run(URLClassLoader.java:369) java.net.URLClassLoader$1.run(URLClassLoader.java:363) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:362) java.lang.ClassLoader.loadClass(ClassLoader.java:418) java.lang.ClassLoader.loadClass(ClassLoader.java:351) java.lang.Class.getDeclaredMethods0(Native Method) java.lang.Class.privateGetDeclaredMethods(Class.java:2701) java.lang.Class.privateGetPublicMethods(Class.java:2902) java.lang.Class.getMethods(Class.java:1615) sbt.internal.inc.ClassToAPI$.toDefinitions0(ClassToAPI.scala:170) sbt.internal.inc.ClassToAPI$.$anonfun$toDefinitions$1(ClassToAPI.scala:123) scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) sbt.internal.inc.ClassToAPI$.toDefinitions(ClassToAPI.scala:123) sbt.internal.inc.ClassToAPI$.$anonfun$process$1(ClassToAPI.scala:3 ``` This happens when it builds Spark with Hadoop 2. It doesn't reproduce when you build this alone. It should follow the sequence of build in the release script. This is fixed by downgrading. Looks like there is a regression in scala-maven-plugin somewhere between 4.4.0 and 4.3.0. ### Why are the changes needed? To unblock the release. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It can be tested as below: ```bash ./dev/create-release/do-release-docker.sh -d $WORKING_DIR ``` Closes #31031 from HyukjinKwon/SPARK-34007. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 356fdc9a7fc88fd07751c40b920043eaebeb0abf) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 05 January 2021, 08:20:49 UTC
498f548 [SPARK-33935][SQL] Fix CBO cost function ### What changes were proposed in this pull request? Changed the cost function in CBO to match documentation. ### Why are the changes needed? The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as: ``` The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight). ``` The implementation in `JoinReorderDP.betterThan` does not match this documentaiton: ``` def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { if (other.planCost.card == 0 || other.planCost.size == 0) { false } else { val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card) val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size) relativeRows * conf.joinReorderCardWeight + relativeSize * (1 - conf.joinReorderCardWeight) < 1 } } ``` This different implementation has an unfortunate consequence: given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes. A example values, that have this fenomen with the default weight value (0.7): A.card = 500, B.card = 300 A.size = 30, B.size = 80 Both A betterThan B and B betterThan A would have score above 1 and would return false. This happens with several of the TPCDS queries. The new implementation does not have this behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New and existing UTs Closes #30965 from tanelk/SPARK-33935_cbo_cost_function. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit f252a9334e49dc359dd9255fcfe17a6bc75b8781) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 05 January 2021, 07:00:45 UTC
f702a95 [SPARK-33100][SQL] Ignore a semicolon inside a bracketed comment in spark-sql ### What changes were proposed in this pull request? Now the spark-sql does not support parse the sql statements with bracketed comments. For the sql statements: ``` /* SELECT 'test'; */ SELECT 'test'; ``` Would be split to two statements: The first one: `/* SELECT 'test'` The second one: `*/ SELECT 'test'` Then it would throw an exception because the first one is illegal. In this PR, we ignore the content in bracketed comments while splitting the sql statements. Besides, we ignore the comment without any content. ### Why are the changes needed? Spark-sql might split the statements inside bracketed comments and it is not correct. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #29982 from turboFei/SPARK-33110. Lead-authored-by: fwang12 <fwang12@ebay.com> Co-authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit a071826f72cd717a58bf37b877f805490f7a147f) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 05 January 2021, 06:55:55 UTC
b3cd86d [SPARK-34000][CORE] Fix stageAttemptToNumSpeculativeTasks java.util.NoSuchElementException ### What changes were proposed in this pull request? From below log, Stage 600 could be removed from `stageAttemptToNumSpeculativeTasks` by `onStageCompleted()`, but the speculative task 306.1 in stage 600 threw `NoSuchElementException` when it entered into `onTaskEnd()`. ``` 21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 : Lost task 306.1 in stage 600.0 (TID 283610, hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): TaskKilled (another attempt succeeded) 21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). 21/01/04 03:00:32,259 INFO [task-result-getter-2] cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all completed, from pool default 21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 50 rows from offsets [5378600, 5378650) with 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47 21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an exception java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0) at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97) ``` ### Why are the changes needed? To avoid throwing the java.util.NoSuchElementException ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a protective patch and it's not easy to reproduce in UT due to the event order is not fixed in a async queue. Closes #31025 from LantaoJin/SPARK-34000. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a7d3fcd354289c1d0f5c80887b4f33beb3ad96a2) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 January 2021, 05:37:41 UTC
65326a5 [SPARK-33992][SQL] override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/29643, we move the plan rewriting methods to QueryPlan. we need to override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer because it and resolveOperatorsUpWithNewOutput are called in the analyzer. For example, PaddingAndLengthCheckForCharVarchar could fail query when resolveOperatorsUpWithNewOutput with ```logtalk [info] - char/varchar resolution in sub query *** FAILED *** (367 milliseconds) [info] java.lang.RuntimeException: This method should not be called in the analyzer [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule(AnalysisHelper.scala:150) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule$(AnalysisHelper.scala:146) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.assertNotAnalysisRule(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:161) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:160) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$updateOuterReferencesInSubquery(QueryPlan.scala:267) ``` ### Why are the changes needed? trivial bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31013 from yaooqinn/SPARK-33992. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f0ffe0cd652188873f2ec007e4e282744717a0b3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 January 2021, 05:34:21 UTC
95e3414 [SPARK-33894][SQL] Change visibility of private case classes in mllib to avoid runtime compilation errors with Scala 2.13 ### What changes were proposed in this pull request? Change visibility modifier of two case classes defined inside objects in mllib from private to private[OuterClass] ### Why are the changes needed? Without this change when running tests for Scala 2.13 you get runtime code generation errors. These errors look like this: ``` [info] Cause: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: No applicable constructor/method found for zero actual parameters; candidates are: "public java.lang.String org.apache.spark.ml.feature.Word2VecModel$Data.word()" ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests now pass for Scala 2.13 Closes #31018 from koertkuipers/feat-visibility-scala213. Authored-by: Koert Kuipers <koert@tresata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9b4173fa95047fed94e2fe323ad281fb48deffda) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 January 2021, 23:40:47 UTC
eef0e4c [SPARK-33950][SQL][3.1][3.0] Refresh cache in v1 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this re-creates the cache associated with the modified table. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 0` since it was deleted by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 1 1 ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit 67195d0d977caa5a458e8a609c434205f9b54d1b) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #31006 from MaxGekk/drop-partition-refresh-cache-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 January 2021, 21:33:41 UTC
390316e [SPARK-33980][SS] Invalidate char/varchar in spark.readStream.schema ### What changes were proposed in this pull request? invalidate char/varchar in `spark.readStream.schema` just like what we've done for `spark.read.schema` in da72b87374a7be5416b99ed016dc2fc9da0ed88a ### Why are the changes needed? bugfix, char/varchar is only for table schema while `spark.sql.legacy.charVarcharAsString=false` ### Does this PR introduce _any_ user-facing change? yes, char/varchar will fail to define ss readers when `spark.sql.legacy.charVarcharAsString=false` ### How was this patch tested? new tests Closes #31003 from yaooqinn/SPARK-33980. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ac4651a7d19b248c86290d419ac3f6d69ed2b61e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 January 2021, 21:00:00 UTC
368ce09 Preparing development version 3.1.1-SNAPSHOT 04 January 2021, 09:40:28 UTC
eaaa893 Preparing Spark release v3.1.0-rc1 04 January 2021, 09:40:22 UTC
9268392 [SPARK-33945][SQL][3.1] Handles a random seed consisting of an expr tree ### What changes were proposed in this pull request? This PR intends to fix the minor bug that throws an analysis exception when a seed param in `rand`/`randn` having a expr tree (e.g., `rand(1 + 1)`) with constant folding (`ConstantFolding` and `ReorderAssociativeOperator`) disabled. A query to reproduce this issue is as follows; ``` // v3.1.0, v3.0.2, and v2.4.8 $./bin/spark-shell scala> sql("select rand(1 + 2)").show() +-------------------+ | rand((1 + 2))| +-------------------+ |0.25738143505962285| +-------------------+ $./bin/spark-shell --conf spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding,org.apache.spark.sql.catalyst.optimizer.ReorderAssociativeOperator scala> sql("select rand(1 + 2)").show() org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer, long or null literal.; at org.apache.spark.sql.catalyst.expressions.RDG.seed$lzycompute(randomExpressions.scala:49) at org.apache.spark.sql.catalyst.expressions.RDG.seed(randomExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.Rand.doGenCode(randomExpressions.scala:98) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at scala.Option.getOrElse(Option.scala:189) ... ``` A root cause is that the match-case code below cannot handle the case described above: https://github.com/apache/spark/blob/42f5e62403469cec6da680b9fbedd0aa508dcbe5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala#L46-L51 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Checking if GA/Jenkins can pass Closes #30977 from maropu/FixRandSeedIssue. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 January 2021, 05:36:25 UTC
1fa052f [SPARK-33398] Fix loading tree models prior to Spark 3.0 ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master; field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L. ### Why are the changes needed? to support load old tree model in 3.0/3.1/master ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added testsuites Closes #30889 from zhengruifeng/fix_tree_load. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 6b7527e381591bcd51be205853aea3e349893139) Signed-off-by: Sean Owen <srowen@gmail.com> 03 January 2021, 17:52:58 UTC
1a4b6f4 [SPARK-33963][SQL] Canonicalize `HiveTableRelation` w/o table stats ### What changes were proposed in this pull request? Skip table stats in canonicalizing of `HiveTableRelation`. ### Why are the changes needed? The changes fix a regression comparing to Spark 3.0, see SPARK-33963. ### Does this PR introduce _any_ user-facing change? Yes. After changes Spark behaves as in the version 3.0.1. ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #30995 from MaxGekk/fix-caching-hive-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit fc7d0165d29e04a8e78577c853a701bdd8a2af4a) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 January 2021, 02:24:01 UTC
e9f8b4d [SPARK-33961][BUILD] Upgrade SBT to 1.4.6 ### What changes were proposed in this pull request? This PR aims to upgrade SBT to 1.4.6 to fix the SBT regression. ### Why are the changes needed? [SBT 1.4.6](https://github.com/sbt/sbt/releases/tag/v1.4.6) has the following fixes - Updates to Coursier 2.0.8, which fixes the cache directory setting on Windows - Fixes performance regression in shell tab completion - Fixes match error when using withDottyCompat - Fixes thread-safety in AnalysisCallback handler ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30993 from dongjoon-hyun/SPARK-SBT-1.4.6. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1c25bea0bbe8365523d2a3d6b06da03d67f25794) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 02 January 2021, 22:49:16 UTC
b28e4ef [SPARK-33906][WEBUI] Fix the bug of UI Executor page stuck due to undefined peakMemoryMetrics ### What changes were proposed in this pull request? Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page. ### Why are the changes needed? App live UI may stuck at Executors page without this fix. Steps to reproduce (with master branch): In mac OS standalone mode, open a spark-shell $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 val x = sc.makeRDD(1 to 100000, 5) x.count() Then open the app UI in the browser, and click the Executors page, will get stuck at this page: ![image](https://user-images.githubusercontent.com/26694233/103105677-ca1a7380-45f4-11eb-9245-c69f4a4e816b.png) Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906. I debugged it and observed that ExecutorMetricsPoller .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test, rerun the steps of bug reproduce and see the bug is gone. Closes #30920 from baohe-zhang/SPARK-33906. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 45df6db906b39646f5b5f6b4a88addf1adcbe107) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 December 2020, 21:35:06 UTC
124e744 [SPARK-33944][SQL] Incorrect logging for warehouse keys in SharedState options ### What changes were proposed in this pull request? While using SparkSession's initial options to generate the sharable Spark conf and Hadoop conf in ShardState, we shall put the log in the codeblock that the warehouse keys being handled. ### Why are the changes needed? bugfix, rm ambiguous log when setting spark.sql.warehouse.dir in SparkSession.builder.config, but only warn setting hive.metastore.warehouse.dir ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30978 from yaooqinn/SPARK-33944. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ed9f7288019be4803cdb1ee570ca21ad76af371a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 December 2020, 21:20:41 UTC
087b9ed [SPARK-33942][DOCS] Remove `hiveClientCalls.count` in `CodeGenerator` metrics docs ### What changes were proposed in this pull request? Removed the **hiveClientCalls.count** in CodeGenerator metrics in Component instance = Executor ### Why are the changes needed? Wrong information regarding metrics was being displayed on Monitoring Documentation. I had added referred documentation for adding metrics logging in Graphite. This metric was not being reported. I had to check if the issue was at my application end or spark code or documentation. Documentation had the wrong info. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual, checked it on my forked repository feature branch [SPARK-33942](https://github.com/coderbond007/spark/blob/SPARK-33942/docs/monitoring.md) Closes #30976 from coderbond007/SPARK-33942. Authored-by: Pradyumn Agrawal (pradyumn.ag) <pradyumn.ag@media.net> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 13e8c2840969a17d5ba113686501abd3c23e3c23) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 31 December 2020, 01:26:17 UTC
42f5e62 [SPARK-33927][BUILD] Fix Dockerfile for Spark release to work ### What changes were proposed in this pull request? This PR proposes to fix the `Dockerfile` for Spark release. - Port https://github.com/apache/spark/commit/b135db3b1a5c0b2170e98b97f6160bcf55903799 to `Dockerfile` - Upgrade Ubuntu 18.04 -> 20.04 (because of porting b135db3) - Remove Python 2 (because of Ubuntu upgrade) - Use built-in Python 3.8.5 (because of Ubuntu upgrade) - Node.js 11 -> 12 (because of Ubuntu upgrade) - Ruby 2.5 -> 2.7 (because of Ubuntu upgrade) - Python dependencies and Jekyll + plugins upgrade to the latest as it's used in GitHub Actions build (unrelated to the issue itself) ### Why are the changes needed? To make a Spark release :-). ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested via: ```bash cd dev/create-release/spark-rm docker build -t spark-rm --build-arg UID=$UID . ``` ``` ... Successfully built 516d7943634f Successfully tagged spark-rm:latest ``` Closes #30971 from HyukjinKwon/SPARK-33927. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 403bf55cbef1e4cf50dc868202cccfb867279bbd) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 30 December 2020, 07:37:39 UTC
a9004d1 [MINOR][SS] Call fetchEarliestOffsets when it is necessary ### What changes were proposed in this pull request? This minor patch changes two variables where calling `fetchEarliestOffsets` to `lazy` because these values are not always necessary. ### Why are the changes needed? To avoid unnecessary Kafka RPC calls. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30969 from viirya/ss-minor3. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 4a669f583089fc704cdc46cff8f1680470a068ee) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 30 December 2020, 07:15:57 UTC
7f5c352 [SPARK-33874][K8S] Handle long lived sidecars ### What changes were proposed in this pull request? For liveness check when checkAllContainers is not set, we check the liveness status of the Spark container if we can find it. ### Why are the changes needed? Some environments may deploy long lived logs collecting side cars which outlive the Spark application. Just because they remain alive does not mean the Spark executor should keep running. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Extended the existing pod status tests. Closes #30892 from holdenk/SPARK-33874-handle-long-lived-sidecars. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 448494ebcf88b4cd0a89ee933bd042d5e45169a1) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 30 December 2020, 05:06:54 UTC
24fc9f0 [SPARK-33936][SQL][3.1] Add the version when connector's interfaces were added ### What changes were proposed in this pull request? Add the `since` tag to the interfaces added recently in version 3.1.0. ### Why are the changes needed? 1. To follow the existing convention for Spark API. 2. To inform devs when Spark API was changed. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? `dev/scalastyle` Closes #30967 from MaxGekk/spark-23889-interfaces-followup-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 December 2020, 20:27:30 UTC
902308c [SPARK-33775][FOLLOWUP][TEST-MAVEN][BUILD] Suppress maven compilation warnings in Scala 2.13 ### What changes were proposed in this pull request? This pr is followup of SPARK-33775, the main change of this pr this sync suppression rules from `SparkBuild.scala` to `pom.xml` to let maven build have the same suppression ability for compilation warnings in Scala 2.13 ### Why are the changes needed? Suppress unimportant compilation warnings in Scala 2.13 with maven build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Local manual test:The suppressed compilation warnings are no longer printed to the console. Closes #30951 from LuciferYang/SPARK-33775-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 9d6dbe0fe5f06c8ca18775fefc5c7f59e8b64f5c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 December 2020, 12:42:11 UTC
fc04edb [SPARK-33931][INFRA] Recover GitHub Action `build_and_test` job ### What changes were proposed in this pull request? This PR aims to recover GitHub Action `build_and_test` job. ### Why are the changes needed? Currently, `build_and_test` job fails to start because of the following in master/branch-3.1 at least. ``` r-lib/actions/setup-rv1 is not allowed to be used in apache/spark. Actions in this workflow must be: created by GitHub, verified in the GitHub Marketplace, within a repository owned by apache or match the following: adoptopenjdk/*, apache/*, gradle/wrapper-validation-action. ``` - https://github.com/apache/spark/actions/runs/449826457 ![Screen Shot 2020-12-28 at 10 06 11 PM](https://user-images.githubusercontent.com/9700541/103262174-f1f13a80-4958-11eb-8ceb-631527155775.png) ### Does this PR introduce _any_ user-facing change? No. This is a test infra. ### How was this patch tested? To check GitHub Action `build_and_test` job on this PR. Closes #30959 from dongjoon-hyun/SPARK-33931. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 2627825647c32fab79cf356917ad99d3ff668a9b) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 29 December 2020, 11:52:17 UTC
122a83c [SPARK-33928][SPARK-23365][TEST][CORE] Fix flaky o.a.s.ExecutorAllocationManagerSuite - " Don't update target num executors when killing idle executors" ### What changes were proposed in this pull request? Use the testing mode for the test to fix the flaky. ### Why are the changes needed? The test is flaky: ```scala [info] - SPARK-23365 Don't update target num executors when killing idle executors *** FAILED *** (126 milliseconds) [info] 1 did not equal 2 (ExecutorAllocationManagerSuite.scala:1615) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$84(ExecutorAllocationManagerSuite.scala:1617) ... ``` The root cause should be the same as https://github.com/apache/spark/pull/29773 since the test run under non-testing mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Flaky is gone by running the test hundreds of times after this fix. Closes #30956 from Ngone51/fix-flaky-SPARK-23365. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1ef7ddd38aa28dcd8166a60a485c722c5a8ded7a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 December 2020, 07:35:55 UTC
403a4c2 [SPARK-33916][CORE] Fix fallback storage offset and improve compression codec test coverage ### What changes were proposed in this pull request? This PR aims to fix offset bug and improve compression codec test coverage. ### Why are the changes needed? When the user choose a non-default codec, it causes a failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the extended test suite. Closes #30934 from dongjoon-hyun/SPARK-33916. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 6497ccbbda1874187ee60a4f6368e6d9ae6580ff) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 29 December 2020, 00:33:11 UTC
0450a1e [SPARK-33899][SQL][3.1] Fix assert failure in v1 SHOW TABLES/VIEWS on `spark_catalog` ### What changes were proposed in this pull request? Remove `assert(ns.nonEmpty)` in `ResolveSessionCatalog` for: - `SHOW TABLES` - `SHOW TABLE EXTENDED` - `SHOW VIEWS` ### Why are the changes needed? Spark SQL shouldn't fail with internal assert failures even for invalid user inputs. For instance: ```sql spark-sql> show tables in spark_catalog; 20/12/24 11:19:46 ERROR SparkSQLDriver: Failed in [show tables in spark_catalog] java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:366) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:49) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, for the example above: ```sql spark-sql> show tables in spark_catalog; Error in query: multi-part identifier cannot be empty. ``` ### How was this patch tested? Added new UT to `DDLSuite`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes #30950 from MaxGekk/remove-assert-ns-nonempty-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 December 2020, 18:21:46 UTC
37a7f86 [SPARK-33923][SQL][TESTS] Fix some tests with AQE enabled ### What changes were proposed in this pull request? * Remove the explicit AQE disable confs * Use `AdaptiveSparkPlanHelper` to check plans * No longer extending `DisableAdaptiveExecutionSuite` for `BucketedReadSuite` but only disable AQE for two certain tests there. ### Why are the changes needed? Some tests that are fixed in https://github.com/apache/spark/pull/30655 doesn't really require AQE off. Instead, they could use `AdaptiveSparkPlanHelper` to pass when AQE on. It's better to run tests with AQE on since we've turned it on by default. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass all tests and the updated tests. Closes #30941 from Ngone51/SPARK-33680-follow-up. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 00fa49aeaa601f50df81adb25184f141ba0a44ee) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 December 2020, 08:03:59 UTC
844c4b4 [SPARK-33907][SQL][3.1][FOLLOWUP] Add test for corrupt record column ### What changes were proposed in this pull request? This patch adds one e2e test for corrupt record column case. ### Why are the changes needed? Improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30944 from viirya/SPARK-33907-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 December 2020, 07:51:54 UTC
93ed055 [SPARK-33901][SQL] Fix Char and Varchar display error after DDLs ### What changes were proposed in this pull request? After CTAS / CREATE TABLE LIKE / CVAS/ alter table add columns, the target tables will display string instead of char/varchar ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30918 from yaooqinn/SPARK-33901. Lead-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3fdbc48373cdf12b8ba05632bc65ad49b7af1afb) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 December 2020, 06:48:40 UTC
9556d92 [SPARK-33824][PYTHON][DOCS][FOLLOW-UP] Clarify about PYSPARK_DRIVER_PYTHON and spark.yarn.appMasterEnv.PYSPARK_PYTHON ### What changes were proposed in this pull request? This PR proposes to clarify: - `PYSPARK_DRIVER_PYTHON` should not be set for cluster modes in YARN and Kubernates. - `spark.yarn.appMasterEnv.PYSPARK_PYTHON` is not required in YARN. This is just another way to set `PYSPARK_PYTHON` that is specific for a Spark application. ### Why are the changes needed? To clarify what's required and not. ### Does this PR introduce _any_ user-facing change? Yes, this is a user-facing doc change. ### How was this patch tested? Manually tested. Note that this credits to gaborgsomogyi who actually tested and raised a doubt about this offline to me. I also manually tested all again to double check. Closes #30938 from HyukjinKwon/SPARK-33824-followup. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 678294ddc20207a8e8f02dfdad71cd51819299c8) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 28 December 2020, 00:52:55 UTC
e848b5a [SPARK-33911][SQL][DOCS][3.1] Update the SQL migration guide about changes in `HiveClientImpl` ### What changes were proposed in this pull request? Update the SQL migration guide about the changes made by: - https://github.com/apache/spark/pull/30778 - https://github.com/apache/spark/pull/30711 ### Why are the changes needed? To inform users about the recent changes in the upcoming releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30931 from MaxGekk/sql-migr-guide-hiveclientimpl-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 December 2020, 08:58:32 UTC
1b3c519 [SPARK-33560][TEST-MAVEN][BUILD] Add "unused-import" check to Maven compilation process ### What changes were proposed in this pull request? Similar to SPARK-33441, this pr add `unused-import` check to Maven compilation process. After this pr `unused-import` will trigger Maven compilation error. For Scala 2.13 profile, this pr also left TODO(SPARK-33499) similar to SPARK-33441 because `scala.language.higherKinds` no longer needs to be imported explicitly since Scala 2.13.1 ### Why are the changes needed? Let Maven build also check for unused imports as compilation error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Local manual test:add an unused import intentionally to trigger maven compilation error. Closes #30784 from LuciferYang/SPARK-33560. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 37ae0a608670c660ba4c92b9ebb9cb9fb2bd67e6) Signed-off-by: Sean Owen <srowen@gmail.com> 26 December 2020, 23:40:31 UTC
943b057 [SPARK-33897][SQL] Can't set option 'cross' in join method ### What changes were proposed in this pull request? [The PySpark documentation](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) says "Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti." However, I get the following error when I set the cross option. ``` scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b"))) df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C"))) df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = "cross").show() java.lang.IllegalArgumentException: requirement failed: Unsupported using join type Cross at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.plans.UsingJoin.<init>(joinTypes.scala:106) at org.apache.spark.sql.Dataset.join(Dataset.scala:1025) ... 53 elided ``` ### Why are the changes needed? The documentation says cross option can be set, but when I try to set it, I get an java.lang.IllegalArgumentException. ### Does this PR introduce _any_ user-facing change? Accepting this PR fix will behave the same as the documentation. ### How was this patch tested? There is already a test for [JoinTypes](https://github.com/apache/spark/blob/1b9fd67904671ea08526bfb7a97d694815d47665/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/JoinTypesTest.scala), but I can't find a test for the join option itself. Closes #30803 from kozakana/allow_cross_option. Authored-by: kozakana <goki727@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 2553d53dc85fdf1127446941e2bc749e721c1b57) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 December 2020, 07:31:08 UTC
704407b [SPARK-33907][SQL][3.1] Only prune columns of JsonToStructs if parsing options is empty ### What changes were proposed in this pull request? As a follow-up task to SPARK-32958, this patch takes safer approach to only prune columns from `JsonToStructs` if the parsing option is empty. It is to avoid unexpected behavior change regarding parsing. This patch also adds a few e2e tests to make sure failfast parsing behavior is not changed. ### Why are the changes needed? It is to avoid unexpected behavior change regarding parsing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30921 from viirya/SPARK-33907. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 December 2020, 22:13:02 UTC
e60afd0 [SPARK-33621][SPARK-33784][SQL][3.1] Add a way to inject data source rewrite rules ### What changes were proposed in this pull request? This PR adds a way to inject data source rewrite rules to branch-3.1 via backporting two JIRA issues. - [SPARK-33621][SQL] Add a way to inject data source rewrite rules - [SPARK-33784][SQL] Rename dataSourceRewriteRules batch ### Why are the changes needed? Right now `SparkSessionExtensions` allow us to inject optimization rules but they are added to operator optimization batch. There are cases when users need to run rules after the operator optimization batch (e.g. cases when a rule relies on the fact that expressions have been optimized). Currently, this is not possible. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? This PR comes with a new test. Closes #30917 from aokolnychyi/backport-spark-33784. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 December 2020, 20:36:37 UTC
9a9a40f [SPARK-33900][WEBUI] Show shuffle read size / records correctly when only remotebytesread is available ### What changes were proposed in this pull request? Shuffle Read Size / Records can also be displayed in remoteBytesRead>0 localBytesRead=0. current: ![image](https://user-images.githubusercontent.com/3898450/103079421-c4ca2280-460e-11eb-9e2f-49d35b5d324d.png) fix: ![image](https://user-images.githubusercontent.com/3898450/103079439-cc89c700-460e-11eb-9a41-6b2882980d11.png) ### Why are the changes needed? At present, the page only displays the data of Shuffle Read Size / Records when localBytesRead>0. When there is only remote reading, metrics cannot be seen on the stage page. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test Closes #30916 from cxzl25/SPARK-33900. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 700f5ab65c1c84522302ce92d176adf229c34daa) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 24 December 2020, 15:55:38 UTC
e13d47a [SPARK-33892][SQL] Display char/varchar in DESC and SHOW CREATE TABLE ### What changes were proposed in this pull request? Display char/varchar in - DESC table - DESC column - SHOW CREATE TABLE ### Why are the changes needed? show the correct definition for users ### Does this PR introduce _any_ user-facing change? yes, char/varchar column's will print char/varchar instead of string ### How was this patch tested? new tests Closes #30908 from yaooqinn/SPARK-33892. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 29cca68e9e55fae8389378de6f30d0dfa7a74010) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 December 2020, 08:56:12 UTC
8da8e60 [SPARK-33895][SQL] Char and Varchar fail in MetaOperation of ThriftServer ### What changes were proposed in this pull request? ``` Caused by: java.lang.IllegalArgumentException: Unrecognized type name: CHAR(10) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:187) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:203) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:195) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4(SparkGetColumnsOperation.scala:99) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4$adapted(SparkGetColumnsOperation.scala:98) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ``` meta operation is targeting raw table schema, we need to handle these types there. ### Why are the changes needed? bugfix, see the above case ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests locally ![image](https://user-images.githubusercontent.com/8326978/103069196-cdfcc480-45f9-11eb-9c6a-d4c42123c6e3.png) Closes #30914 from yaooqinn/SPARK-33895. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d7dc42d5f6bbe861c7e4ac1bb49e0830af5e19f4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 December 2020, 07:40:49 UTC
ed9749b [SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API ### What changes were proposed in this pull request? Follow up work for #30521, document the following behaviors in the API doc: - Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table. - Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created. ### Why are the changes needed? We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Document only. Closes #30885 from xuanyuanking/SPARK-33659. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 86c1cfc5791dae5f2ee8ccd5095dbeb2243baba6) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 24 December 2020, 03:44:50 UTC
10cb18f [SPARK-33877][SQL][FOLLOWUP] SQL reference documents for INSERT w/ a column list ### What changes were proposed in this pull request? followup of https://github.com/apache/spark/commit/a3dd8dacee8f6b316be90500f9fd8ec8997a5784 via suggestion https://github.com/apache/spark/pull/30888#discussion_r547822642 ### Why are the changes needed? doc improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing GA doc Closes #30909 from yaooqinn/SPARK-33877-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 368a2c341d8f3315c759e1c2362439534a9d44e7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 December 2020, 23:38:42 UTC
2bde0bd [SPARK-33893][CORE] Exclude fallback block manager from executorList ### What changes were proposed in this pull request? This PR aims to exclude fallback block manager from `executorList` function. ### Why are the changes needed? When a fallback storage is used, the executors UI tab hangs because the executor list REST API result doesn't have `peakMemoryMetrics` of `ExecutorMetrics`. The root cause is that the block manager id used by fallback storage is included in the API result and it doesn't have `peakMemoryMetrics` because it's populated during HeartBeat reporting. We should hide it. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix on UI. ### How was this patch tested? Manual. Run the following and visit Spark `executors` tab UI with browser. ``` bin/spark-shell -c spark.storage.decommission.fallbackStorage.path=file:///tmp/spark-storage/ ``` Closes #30911 from dongjoon-hyun/SPARK-33893. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit d467d817260d6ca605c34f493e68d0877209170f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 December 2020, 23:32:07 UTC
1c32557 [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? This is a retry of #30177. This is not a complete fix, but it would take long time to complete (#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30899 from ueshin/issues/SPARK-33277/context_aware_iterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5c9b421c3711ba373b4d5cbbd83a8ece91291ed0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 December 2020, 22:48:11 UTC
b174ac7 [SPARK-32916][SHUFFLE][TEST-MAVEN][TEST-HADOOP2.7] Ensure the number of chunks in meta file and index file are equal ### What changes were proposed in this pull request? 1. Fixes for bugs in `RemoteBlockPushResolver` where the number of chunks in meta file and index file are inconsistent due to exceptions while writing to either index file or meta file. This java class was introduced in https://github.com/apache/spark/pull/30062. - If the writing to index file fails, the position of meta file is not reset. This means that the number of chunks in meta file is inconsistent with index file. - During the exception handling while writing to index/meta file, we just set the pointer to the start position. If the files are closed just after this then it doesn't get rid of any the extra bytes written to it. 2. Adds an IOException threshold. If the `RemoteBlockPushResolver` encounters IOExceptions greater than this threshold while updating data/meta/index file of a shuffle partition, then it responds to the client with exception- `IOExceptions exceeded the threshold` so that client can stop pushing data for this shuffle partition. 3. When the update to metadata fails, exception is not propagated back to the client. This results in the increased size of the current chunk. However, with (2) in place, the current chunk will still be of a manageable size. ### Why are the changes needed? This fix is needed for the bugs mentioned above. 1. Moved writing to meta file after index file. This fixes the issue because if there is an exception writing to meta file, then the index file position is not updated. With this change, if there is an exception writing to index file, then none of the files are effectively updated and the same is true vice-versa. 2. Truncating the lengths of data/index/meta files when the partition is finalized. 3. When the IOExceptions have reached the threshold, it is most likely that future blocks will also face the issue. So, it is better to let the clients know so that they can stop pushing the blocks for that partition. 4. When just the meta update fails, client retries pushing the block which was successfully merged to data file. This can be avoided by letting the chunk grow slightly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests for all the bugs and threshold. Closes #30433 from otterc/SPARK-32916-followup. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 0677c39009de0830d995da77332f0756c76d6b56) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 23 December 2020, 18:42:52 UTC
0b746ea [SPARK-33889][SQL][3.1] Fix NPE from `SHOW PARTITIONS` on V2 tables ### What changes were proposed in this pull request? At `ShowPartitionsExec.run()`, check that a row returned by `listPartitionIdentifiers()` contains a `null` field, and convert it to `"null"`. ### Why are the changes needed? Because `SHOW PARTITIONS` throws NPE on V2 table with `null` partition values. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new UT to `v2.ShowPartitionsSuite`. Closes #30907 from MaxGekk/fix-npe-show-partitions-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 23 December 2020, 15:51:21 UTC
6a04775 [SPARK-33891][DOCS][CORE] Update dynamic allocation related documents ### What changes were proposed in this pull request? This PR aims to update the followings. - Remove the outdated requirement for `spark.shuffle.service.enabled` in `configuration.md` - Dynamic allocation section in `job-scheduling.md` ### Why are the changes needed? To make the document up-to-date. ### Does this PR introduce _any_ user-facing change? No, it's a documentation update. ### How was this patch tested? Manual. **BEFORE** ![Screen Shot 2020-12-23 at 2 22 04 AM](https://user-images.githubusercontent.com/9700541/102986441-ae647f80-44c5-11eb-97a3-87c2d368952a.png) ![Screen Shot 2020-12-23 at 2 22 34 AM](https://user-images.githubusercontent.com/9700541/102986473-bcb29b80-44c5-11eb-8eae-6802001c6dfa.png) **AFTER** ![Screen Shot 2020-12-23 at 2 25 36 AM](https://user-images.githubusercontent.com/9700541/102986767-2df24e80-44c6-11eb-8540-e74856a4c313.png) ![Screen Shot 2020-12-23 at 2 21 13 AM](https://user-images.githubusercontent.com/9700541/102986366-8e34c080-44c5-11eb-8054-1efd07c9458c.png) Closes #30906 from dongjoon-hyun/SPARK-33891. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 47d1aa4e93f668774fd0b16c780d3b1f6200bcd8) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 23 December 2020, 14:43:39 UTC
a0d51ec [SPARK-31960][YARN][DOCS][FOLLOW-UP] Document the behaviour change of Hadoop's classpath propagation in migration guide ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/28788, and proposes to update migration guide. ### Why are the changes needed? To tell users about the behaviour change. ### Does this PR introduce _any_ user-facing change? Yes, it updates migration guides for users. ### How was this patch tested? GitHub Actions' documentation build should test it. Closes #30903 from HyukjinKwon/SPARK-31960-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit d98c216e1959e276877c3d0a9562cc4cdd8b41bb) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 23 December 2020, 09:04:40 UTC
7ebb6c7 [SPARK-33879][SQL] Char Varchar values fails w/ match error as partition columns ### What changes were proposed in this pull request? ```sql spark-sql> select * from t10 where c0='abcd'; 20/12/22 15:43:38 ERROR SparkSQLDriver: Failed in [select * from t10 where c0='abcd'] scala.MatchError: CharType(10) (of class org.apache.spark.sql.types.CharType) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:815) at org.apache.spark.sql.catalyst.expressions.CastBase.cast$lzycompute(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:844) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.$anonfun$toRow$2(interface.scala:164) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:102) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.toRow(interface.scala:158) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3(ExternalCatalogUtils.scala:157) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3$adapted(ExternalCatalogUtils.scala:156) ``` c0 is a partition column, it fails in the partition pruning rule In this PR, we relace char/varchar w/ string type before the CAST happends ### Why are the changes needed? bugfix, see the case above ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, new tests Closes #30887 from yaooqinn/SPARK-33879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 2287f56a3e105e04cf4e86283eaee12f270c09a7) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 23 December 2020, 07:14:44 UTC
733b018 fix build 23 December 2020, 05:29:01 UTC
b33363b [SPARK-33877][SQL] SQL reference documents for INSERT w/ a column list We support a column list of INSERT for Spark v3.1.0 (See: SPARK-32976 (https://github.com/apache/spark/pull/29893)). So, this PR targets at documenting it in the SQL documents. ### What changes were proposed in this pull request? improve doc ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? doc ### How was this patch tested? passing GA doc gen. ![image](https://user-images.githubusercontent.com/8326978/102954876-8994fa00-450f-11eb-81f9-931af6d1f69b.png) ![image](https://user-images.githubusercontent.com/8326978/102954900-99acd980-450f-11eb-9733-115ad37d2319.png) ![image](https://user-images.githubusercontent.com/8326978/102954935-af220380-450f-11eb-9aaa-fdae0725d41e.png) ![image](https://user-images.githubusercontent.com/8326978/102954949-bc3ef280-450f-11eb-8a0d-d7b688efa7bb.png) Closes #30888 from yaooqinn/SPARK-33877. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a3dd8dacee8f6b316be90500f9fd8ec8997a5784) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 December 2020, 03:46:48 UTC
60a1e59 [SPARK-33364][SQL][FOLLOWUP] Refine the catalog v2 API to purge a table This is a followup of https://github.com/apache/spark/pull/30267 Inspired by https://github.com/apache/spark/pull/30886, it's better to have 2 methods `def dropTable` and `def purgeTable`, than `def dropTable(ident)` and `def dropTable(ident, purge)`. 1. make the APIs orthogonal. Previously, `def dropTable(ident, purge)` calls `def dropTable(ident)` and is a superset. 2. simplifies the catalog implementation a little bit. Now the `if (purge) ... else ...` check is done at the Spark side. No. existing tests Closes #30890 from cloud-fan/purgeTable. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit ec1560af251d2c3580f5bccfabc750f1c7af09df) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 23 December 2020, 02:48:05 UTC
154917c [MINOR] Spelling sql/core ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `sql/core` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at https://github.com/jsoref/spark/commit/706a726f87a0bbf5e31467fae9015218773db85b#commitcomment-44064356 ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30531 from jsoref/spelling-sql-core. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a093d6feefb0e086d19c86ae53bf92df12ccf2fa) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 December 2020, 19:38:11 UTC
999bbb6 [BUILD][MINOR] Do not publish snapshots from forks ### What changes were proposed in this pull request? The GitHub workflow `Publish Snapshot` publishes master and 3.1 branch via Nexus. For this, the workflow uses `secrets.NEXUS_USER` and `secrets.NEXUS_PW` secrets. These are not available in forks where this workflow fails every day: - https://github.com/G-Research/spark/actions/runs/431626797 - https://github.com/G-Research/spark/actions/runs/433153049 - https://github.com/G-Research/spark/actions/runs/434680048 - https://github.com/G-Research/spark/actions/runs/436958780 ### Why are the changes needed? Avoid attempting to publish snapshots from forked repositories. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Code review only. Closes #30884 from EnricoMi/branch-do-not-publish-snapshots-from-forks. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1d450250eb1db7e4f40451f369db830a8f01ec15) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 22 December 2020, 15:23:01 UTC
17fe038 [SPARK-33876][SQL] Add length-check for reading char/varchar from tables w/ a external location ### What changes were proposed in this pull request? This PR adds the length check to the existing ApplyCharPadding rule. Tables will have external locations when users execute SET LOCATION or CREATE TABLE ... LOCATION. If the location contains over length values we should FAIL ON READ. ### Why are the changes needed? ```sql spark-sql> INSERT INTO t2 VALUES ('1', 'b12345'); Time taken: 0.141 seconds spark-sql> alter table t set location '/tmp/hive_one/t2'; Time taken: 0.095 seconds spark-sql> select * from t; 1 b1234 ``` the above case should fail rather than implicitly applying truncation ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30882 from yaooqinn/SPARK-33876. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6da5cdf1dbfc35cee0ce32aa9e44c0b4187373d9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 December 2020, 14:24:22 UTC
685178e [SPARK-33846][SQL] Include Comments for a nested schema in StructType.toDDL ### What changes were proposed in this pull request? ```scala val nestedStruct = new StructType() .add(StructField("b", StringType).withComment("Nested comment")) val struct = new StructType() .add(StructField("a", nestedStruct).withComment("comment")) struct.toDDL ``` Currently, returns: ``` `a` STRUCT<`b`: STRING> COMMENT 'comment'` ``` With this PR, the code above returns: ``` `a` STRUCT<`b`: STRING COMMENT 'Nested comment'> COMMENT 'comment'` ``` ### Why are the changes needed? My team is using nested columns as first citizens, and I thought it would be nice to have comments for nested columns. ### Does this PR introduce _any_ user-facing change? Now, when users call something like this, ```scala spark.table("foo.bar").schema.fields.map(_.toDDL).mkString(", ") ``` they will get comments for the nested columns. ### How was this patch tested? I added unit tests under `org.apache.spark.sql.types.StructTypeSuite`. They test if nested StructType's comment is included in the DDL string. Closes #30851 from jacobhjkim/structtype-toddl. Authored-by: Jacob Kim <me@jacobkim.io> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 43a562035cd79083d06d9422a66488dba801066a) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 22 December 2020, 08:55:33 UTC
e09c59f [SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatalyst match special Array value ### What changes were proposed in this pull request? Add some case to match Array whose element type is primitive. ### Why are the changes needed? We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` . ``` Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215) at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140) ``` And same problem with other array whose element is primitive. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #30868 from ulysses-you/SPARK-33860. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1dd63dccd893162f8ef969e42273a794ad73e49c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 22 December 2020, 06:11:04 UTC
e705760 [MINOR][CORE] Remove unused variable CompressionCodec.DEFAULT_COMPRESSION_CODEC ### What changes were proposed in this pull request? This PR removed an unused variable `CompressionCodec.DEFAULT_COMPRESSION_CODEC`. ### Why are the changes needed? Apache Spark 3.0.0 centralized this default value to `IO_COMPRESSION_CODEC.defaultValue` via [SPARK-26462](https://github.com/apache/spark/pull/23447). We had better remove this variable to avoid any potential confusion in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI compilation. Closes #30880 from dongjoon-hyun/minor. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 16ae3a5c12f1bbd6c9f5f735bfad0cf51fdf2182) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 December 2020, 03:49:08 UTC
33c7049 [SPARK-33834][SQL] Verify ALTER TABLE CHANGE COLUMN with Char and Varchar ### What changes were proposed in this pull request? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change For v1 table, changing type is not allowed, we fix a regression that uses the replaced string instead of the original char/varchar type when altering char/varchar columns For v2 table, char/varchar to string, char(x) to char(x), char(x)/varchar(x) to varchar(y) if x <=y are valid cases, other changes are invalid ### Why are the changes needed? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30833 from yaooqinn/SPARK-33834. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f5fd10b1bc519cc05c98f5235fda3d59155cda9d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 December 2020, 03:07:37 UTC
bfa4f18 [SPARK-33873][CORE][TESTS] Test all compression codecs with encrypted spilling ### What changes were proposed in this pull request? This PR aims to test all compression codecs for encrypted spilling. ### Why are the changes needed? To improve test coverage. Currently, only `CompressionCodec.DEFAULT_COMPRESSION_CODEC` is under testing. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the updated test cases. Closes #30879 from dongjoon-hyun/SPARK-33873. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f62e957b31a281c542514c27da32ccda8e4bda46) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 December 2020, 00:35:13 UTC
d5231f5 [SPARK-33836][SS][PYTHON][FOLLOW-UP] Use test utils and clean up doctests in table and toTable ### What changes were proposed in this pull request? This PR proposes to: - Make doctests simpler to show the usage (since we're not running them now). - Use the test utils to drop the tables if exists. ### Why are the changes needed? Better docs and code readability. ### Does this PR introduce _any_ user-facing change? No, dev-only. It includes some doc changes in unreleased branches. ### How was this patch tested? Manually tested. ```bash cd python ./run-tests --python-executable=python3.9,python3.8 --testnames "pyspark.sql.tests.test_streaming StreamingTests" ``` Closes #30873 from HyukjinKwon/SPARK-33836. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 4106731fdd508c1af6e15b4f9dc2bb139e047174) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 21 December 2020, 21:27:46 UTC
e6d7fac [SPARK-33869][PYTHON][SQL][TESTS] Have a separate metastore directory for each PySpark test job ### What changes were proposed in this pull request? This PR proposes to have its own metastore directory to avoid potential conflict in catalog operations. ### Why are the changes needed? To make PySpark tests less flaky. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested by trying some sleeps in https://github.com/apache/spark/pull/30873. Closes #30875 from HyukjinKwon/SPARK-33869. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 38bbccab7560f2cfd00f9f85ca800434efe950b4) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 December 2020, 19:11:36 UTC
1171904 [SPARK-28863][SQL][FOLLOWUP] Make sure optimized plan will not be re-analyzed ### What changes were proposed in this pull request? It's a known issue that re-analyzing an optimized plan can lead to various issues. We made several attempts to avoid it from happening, but the current solution `AlreadyOptimized` is still not 100% safe, as people can inject catalyst rules to call analyzer directly. This PR proposes a simpler and safer idea: we set the `analyzed` flag to true after optimization, and analyzer will skip processing plans whose `analyzed` flag is true. ### Why are the changes needed? make the code simpler and safer ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. Closes #30777 from cloud-fan/ds. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b4bea1aa8972cdfd8901757a0ed990a20fca620f) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 December 2020, 11:59:47 UTC
a1eea37 [SPARK-33850][SQL][FOLLOWUP] Improve and cleanup the test code ### What changes were proposed in this pull request? This PR mainly improves and cleans up the test code introduced in #30855 based on the comment. The test code is actually taken from another test `explain formatted - check presence of subquery in case of DPP` so this PR cleans the code too ( removed unnecessary `withTable`). ### Why are the changes needed? To keep the test code clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `ExplainSuite` passes. Closes #30861 from sarutak/followup-SPARK-33850. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 3c8be3983cd390306e9abbfe078536a08881a5d6) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 December 2020, 11:50:52 UTC
9c946c3 [SPARK-33853][SQL] EXPLAIN CODEGEN and BenchmarkQueryTest don't show subquery code ### What changes were proposed in this pull request? This PR fixes an issue that `EXPLAIN CODEGEN` and `BenchmarkQueryTest` don't show the corresponding code for subqueries. The following example is about `EXPLAIN CODEGEN`. ``` spark.conf.set("spark.sql.adaptive.enabled", "false") val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") scala> spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:55; maxConstantPoolSize:97(0.15% used); numInnerClasses:0) == *(1) Project [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L] : +- Subquery scalar-subquery#3, [id=#24] : +- *(2) HashAggregate(keys=[], functions=[min(id#0L)], output=[v#2L]) : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#20] : +- *(1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) : +- *(1) Range (1, 100, step=1, splits=12) +- *(1) Scan OneRowRelation[] Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=1 /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private scala.collection.Iterator rdd_input_0; /* 010 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; /* 011 */ /* 012 */ public GeneratedIteratorForCodegenStage1(Object[] references) { /* 013 */ this.references = references; /* 014 */ } /* 015 */ /* 016 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 017 */ partitionIndex = index; /* 018 */ this.inputs = inputs; /* 019 */ rdd_input_0 = inputs[0]; /* 020 */ project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 021 */ /* 022 */ } /* 023 */ /* 024 */ private void project_doConsume_0() throws java.io.IOException { /* 025 */ // common sub-expressions /* 026 */ /* 027 */ project_mutableStateArray_0[0].reset(); /* 028 */ /* 029 */ if (false) { /* 030 */ project_mutableStateArray_0[0].setNullAt(0); /* 031 */ } else { /* 032 */ project_mutableStateArray_0[0].write(0, 1L); /* 033 */ } /* 034 */ append((project_mutableStateArray_0[0].getRow())); /* 035 */ /* 036 */ } /* 037 */ /* 038 */ protected void processNext() throws java.io.IOException { /* 039 */ while ( rdd_input_0.hasNext()) { /* 040 */ InternalRow rdd_row_0 = (InternalRow) rdd_input_0.next(); /* 041 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1); /* 042 */ project_doConsume_0(); /* 043 */ if (shouldStop()) return; /* 044 */ } /* 045 */ } /* 046 */ /* 047 */ } ``` After this change, the corresponding code for subqueries are shown. ``` Found 3 WholeStageCodegen subtrees. == Subtree 1 / 3 (maxMethodCodeSize:282; maxConstantPoolSize:206(0.31% used); numInnerClasses:0) == *(1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) +- *(1) Range (1, 100, step=1, splits=12) Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=1 /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private boolean agg_initAgg_0; /* 010 */ private boolean agg_bufIsNull_0; /* 011 */ private long agg_bufValue_0; /* 012 */ private boolean range_initRange_0; /* 013 */ private long range_nextIndex_0; /* 014 */ private TaskContext range_taskContext_0; /* 015 */ private InputMetrics range_inputMetrics_0; /* 016 */ private long range_batchEnd_0; /* 017 */ private long range_numElementsTodo_0; /* 018 */ private boolean agg_agg_isNull_2_0; /* 019 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[3]; /* 020 */ /* 021 */ public GeneratedIteratorForCodegenStage1(Object[] references) { /* 022 */ this.references = references; /* 023 */ } /* 024 */ /* 025 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 026 */ partitionIndex = index; /* 027 */ this.inputs = inputs; /* 028 */ /* 029 */ range_taskContext_0 = TaskContext.get(); /* 030 */ range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); /* 031 */ range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 032 */ range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 033 */ range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 034 */ /* 035 */ } /* 036 */ /* 037 */ private void agg_doAggregateWithoutKey_0() throws java.io.IOException { /* 038 */ // initialize aggregation buffer /* 039 */ agg_bufIsNull_0 = true; /* 040 */ agg_bufValue_0 = -1L; /* 041 */ /* 042 */ // initialize Range /* 043 */ if (!range_initRange_0) { /* 044 */ range_initRange_0 = true; /* 045 */ initRange(partitionIndex); /* 046 */ } /* 047 */ /* 048 */ while (true) { /* 049 */ if (range_nextIndex_0 == range_batchEnd_0) { /* 050 */ long range_nextBatchTodo_0; /* 051 */ if (range_numElementsTodo_0 > 1000L) { /* 052 */ range_nextBatchTodo_0 = 1000L; /* 053 */ range_numElementsTodo_0 -= 1000L; /* 054 */ } else { /* 055 */ range_nextBatchTodo_0 = range_numElementsTodo_0; /* 056 */ range_numElementsTodo_0 = 0; /* 057 */ if (range_nextBatchTodo_0 == 0) break; /* 058 */ } /* 059 */ range_batchEnd_0 += range_nextBatchTodo_0 * 1L; /* 060 */ } /* 061 */ /* 062 */ int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); /* 063 */ for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { /* 064 */ long range_value_0 = ((long)range_localIdx_0 * 1L) + range_nextIndex_0; /* 065 */ /* 066 */ agg_doConsume_0(range_value_0); /* 067 */ /* 068 */ // shouldStop check is eliminated /* 069 */ } /* 070 */ range_nextIndex_0 = range_batchEnd_0; /* 071 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(range_localEnd_0); /* 072 */ range_inputMetrics_0.incRecordsRead(range_localEnd_0); /* 073 */ range_taskContext_0.killTaskIfInterrupted(); /* 074 */ } /* 075 */ /* 076 */ } /* 077 */ /* 078 */ private void initRange(int idx) { /* 079 */ java.math.BigInteger index = java.math.BigInteger.valueOf(idx); /* 080 */ java.math.BigInteger numSlice = java.math.BigInteger.valueOf(12L); /* 081 */ java.math.BigInteger numElement = java.math.BigInteger.valueOf(99L); /* 082 */ java.math.BigInteger step = java.math.BigInteger.valueOf(1L); /* 083 */ java.math.BigInteger start = java.math.BigInteger.valueOf(1L); /* 084 */ long partitionEnd; /* 085 */ /* 086 */ java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); /* 087 */ if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { /* 088 */ range_nextIndex_0 = Long.MAX_VALUE; /* 089 */ } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { /* 090 */ range_nextIndex_0 = Long.MIN_VALUE; /* 091 */ } else { /* 092 */ range_nextIndex_0 = st.longValue(); /* 093 */ } /* 094 */ range_batchEnd_0 = range_nextIndex_0; /* 095 */ /* 096 */ java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) /* 097 */ .multiply(step).add(start); /* 098 */ if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { /* 099 */ partitionEnd = Long.MAX_VALUE; /* 100 */ } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { /* 101 */ partitionEnd = Long.MIN_VALUE; /* 102 */ } else { /* 103 */ partitionEnd = end.longValue(); /* 104 */ } /* 105 */ /* 106 */ java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( /* 107 */ java.math.BigInteger.valueOf(range_nextIndex_0)); /* 108 */ range_numElementsTodo_0 = startToEnd.divide(step).longValue(); /* 109 */ if (range_numElementsTodo_0 < 0) { /* 110 */ range_numElementsTodo_0 = 0; /* 111 */ } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { /* 112 */ range_numElementsTodo_0++; /* 113 */ } /* 114 */ } /* 115 */ /* 116 */ private void agg_doConsume_0(long agg_expr_0_0) throws java.io.IOException { /* 117 */ // do aggregate /* 118 */ // common sub-expressions /* 119 */ /* 120 */ // evaluate aggregate functions and update aggregation buffers /* 121 */ /* 122 */ agg_agg_isNull_2_0 = true; /* 123 */ long agg_value_2 = -1L; /* 124 */ /* 125 */ if (!agg_bufIsNull_0 && (agg_agg_isNull_2_0 || /* 126 */ agg_value_2 > agg_bufValue_0)) { /* 127 */ agg_agg_isNull_2_0 = false; /* 128 */ agg_value_2 = agg_bufValue_0; /* 129 */ } /* 130 */ /* 131 */ if (!false && (agg_agg_isNull_2_0 || /* 132 */ agg_value_2 > agg_expr_0_0)) { /* 133 */ agg_agg_isNull_2_0 = false; /* 134 */ agg_value_2 = agg_expr_0_0; /* 135 */ } /* 136 */ /* 137 */ agg_bufIsNull_0 = agg_agg_isNull_2_0; /* 138 */ agg_bufValue_0 = agg_value_2; /* 139 */ /* 140 */ } /* 141 */ /* 142 */ protected void processNext() throws java.io.IOException { /* 143 */ while (!agg_initAgg_0) { /* 144 */ agg_initAgg_0 = true; /* 145 */ long agg_beforeAgg_0 = System.nanoTime(); /* 146 */ agg_doAggregateWithoutKey_0(); /* 147 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] /* aggTime */).add((System.nanoTime() - agg_beforeAgg_0) / 1000000); /* 148 */ /* 149 */ // output the result /* 150 */ /* 151 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] /* numOutputRows */).add(1); /* 152 */ range_mutableStateArray_0[2].reset(); /* 153 */ /* 154 */ range_mutableStateArray_0[2].zeroOutNullBytes(); /* 155 */ /* 156 */ if (agg_bufIsNull_0) { /* 157 */ range_mutableStateArray_0[2].setNullAt(0); /* 158 */ } else { /* 159 */ range_mutableStateArray_0[2].write(0, agg_bufValue_0); /* 160 */ } /* 161 */ append((range_mutableStateArray_0[2].getRow())); /* 162 */ } /* 163 */ } /* 164 */ /* 165 */ } ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. After this change, users can see subquery code by `EXPLAIN CODEGEN`. ### How was this patch tested? New test. Closes #30859 from sarutak/explain-codegen-subqueries. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f4e1069bb835e3e132f7758e5842af79f26cd162) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 21 December 2020, 11:29:31 UTC
ce75ce0 [SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable ### What changes were proposed in this pull request? This PR proposes to expose `DataStreamReader.table` (SPARK-32885) and `DataStreamWriter.toTable` (SPARK-32896) to PySpark, which are the only way to read and write with table in Structured Streaming. ### Why are the changes needed? Please refer SPARK-32885 and SPARK-32896 for rationalizations of these public APIs. This PR only exposes them to PySpark. ### Does this PR introduce _any_ user-facing change? Yes, PySpark users will be able to read and write with table in Structured Streaming query. ### How was this patch tested? Manually tested. > v1 table >> create table A and ingest to the table A ``` spark.sql(""" create table table_pyspark_parquet ( value long, `timestamp` timestamp ) USING parquet """) df = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query = df.writeStream.toTable('table_pyspark_parquet', checkpointLocation='/tmp/checkpoint5') query.lastProgress query.stop() ``` >> read table A and ingest to the table B which doesn't exist ``` df2 = spark.readStream.table('table_pyspark_parquet') query2 = df2.writeStream.toTable('table_pyspark_parquet_nonexist', format='parquet', checkpointLocation='/tmp/checkpoint2') query2.lastProgress query2.stop() ``` >> select tables ``` spark.sql("DESCRIBE TABLE table_pyspark_parquet").show() spark.sql("SELECT * FROM table_pyspark_parquet").show() spark.sql("DESCRIBE TABLE table_pyspark_parquet_nonexist").show() spark.sql("SELECT * FROM table_pyspark_parquet_nonexist").show() ``` > v2 table (leveraging Apache Iceberg as it provides V2 table and custom catalog as well) >> create table A and ingest to the table A ``` spark.sql(""" create table iceberg_catalog.default.table_pyspark_v2table ( value long, `timestamp` timestamp ) USING iceberg """) df = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query = df.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table', checkpointLocation='/tmp/checkpoint_v2table_1') query.lastProgress query.stop() ``` >> ingest to the non-exist table B ``` df2 = spark.readStream.format('rate').option('rowsPerSecond', 100).load() query2 = df2.select('value', 'timestamp').writeStream.toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist', checkpointLocation='/tmp/checkpoint_v2table_2') query2.lastProgress query2.stop() ``` >> ingest to the non-exist table C partitioned by `value % 10` ``` df3 = spark.readStream.format('rate').option('rowsPerSecond', 100).load() df3a = df3.selectExpr('value', 'timestamp', 'value % 10 AS partition').repartition('partition') query3 = df3a.writeStream.partitionBy('partition').toTable('iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned', checkpointLocation='/tmp/checkpoint_v2table_3') query3.lastProgress query3.stop() ``` >> select tables ``` spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table").show() spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist").show() spark.sql("DESCRIBE TABLE iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show() spark.sql("SELECT * FROM iceberg_catalog.default.table_pyspark_v2table_nonexist_partitioned").show() ``` Closes #30835 from HeartSaVioR/SPARK-33836. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8d4d43319191ada0e07e3b27abe41929aa3eefe5) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 December 2020, 10:43:20 UTC
b9282f7 [SPARK-33829][SQL][3.1] Renaming v2 tables should recreate the cache ### What changes were proposed in this pull request? Backport of #30825 Currently, renaming v2 tables does not invalidate/recreate the cache, leading to an incorrect behavior (cache not being used) when v2 tables are renamed. This PR fixes the behavior. ### Why are the changes needed? Fixing a bug since the cache associated with the renamed table is not being cleaned up/recreated. ### Does this PR introduce _any_ user-facing change? Yes, now when a v2 table is renamed, cache is correctly updated. ### How was this patch tested? Added a new test Closes #30856 from imback82/backport_30825. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 December 2020, 06:41:09 UTC
f544848 [SPARK-26341][WEBUI][FOLLOWUP] Update stage memory metrics on stage end ### What changes were proposed in this pull request? This is a followup PR for #30573 . After this change applied, stage memory metrics will be updated on stage end. ### Why are the changes needed? After #30573, executor memory metrics is updated on stage end but stage memory metrics is not updated. It's better to update both metrics like `updateStageLevelPeakExecutorMetrics` does. ### Does this PR introduce _any_ user-facing change? Yes. stage memory metrics is updated more accurately. ### How was this patch tested? After I run a job and visited `/api/v1/<appid>/stages`, I confirmed `peakExecutorMemory` metrics is shown even though the life time of each stage is very short . I also modify the json files for `HistoryServerSuite`. Closes #30858 from sarutak/followup-SPARK-26341. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8e2633962f789a6ba5eb9448596f6ac4b7b1c2ff) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 21 December 2020, 01:38:59 UTC
5124558 [SPARK-33756][SQL] Make BytesToBytesMap's MapIterator idempotent ### What changes were proposed in this pull request? Make MapIterator of BytesToBytesMap `hasNext` method idempotent ### Why are the changes needed? The `hasNext` maybe called multiple times, if not guarded, second call of hasNext method after reaching the end of iterator will throw NoSuchElement exception. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Update a unit test to cover this case. Closes #30728 from advancedxy/SPARK-33756. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 13391683e7a863671d3d719dc81e20ec2a870725) Signed-off-by: Sean Owen <srowen@gmail.com> 20 December 2020, 14:51:27 UTC
f931b13 [SPARK-33854][BUILD] Use ListBuffer instead of Stack in SparkBuild.scala ### What changes were proposed in this pull request? This PR aims to use ListBuffer instead of Stack in SparkBuild.scala to remove deprecation warning. ### Why are the changes needed? Stack is deprecated in Scala 2.12.0. ```scala % build/sbt compile ... [warn] /Users/william/spark/project/SparkBuild.scala:1112:25: class Stack in package mutable is deprecated (since 2.12.0): Stack is an inelegant and potentially poorly-performing wrapper around List. Use a List assigned to a var instead. [warn] val stack = new Stack[File]() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #30860 from williamhyun/SPARK-33854. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2b6ef5606bec1a4547c8e850440bf12cc3422e1d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 December 2020, 22:19:57 UTC
c0ac578 [SPARK-33850][SQL] EXPLAIN FORMATTED doesn't show the plan for subqueries if AQE is enabled ### What changes were proposed in this pull request? This PR fixes an issue that when AQE is enabled, EXPLAIN FORMATTED doesn't show the plan for subqueries. ```scala val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("FORMATTED") == Physical Plan == AdaptiveSparkPlan (3) +- Project (2) +- Scan OneRowRelation (1) (1) Scan OneRowRelation Output: [] Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0) (2) Project Output [1]: [Subquery subquery#3, [id=#20] AS scalarsubquery()#5L] Input: [] (3) AdaptiveSparkPlan Output [1]: [scalarsubquery()#5L] Arguments: isFinalPlan=false ``` After this change, the plan for the subquerie is shown. ```scala == Physical Plan == * Project (2) +- * Scan OneRowRelation (1) (1) Scan OneRowRelation [codegen id : 1] Output: [] Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0) (2) Project [codegen id : 1] Output [1]: [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L] Input: [] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#3, [id=#24] * HashAggregate (6) +- Exchange (5) +- * HashAggregate (4) +- * Range (3) (3) Range [codegen id : 1] Output [1]: [id#0L] Arguments: Range (1, 100, step=1, splits=Some(12)) (4) HashAggregate [codegen id : 1] Input [1]: [id#0L] Keys: [] Functions [1]: [partial_min(id#0L)] Aggregate Attributes [1]: [min#7L] Results [1]: [min#8L] (5) Exchange Input [1]: [min#8L] Arguments: SinglePartition, ENSURE_REQUIREMENTS, [id=#20] (6) HashAggregate [codegen id : 2] Input [1]: [min#8L] Keys: [] Functions [1]: [min(id#0L)] Aggregate Attributes [1]: [min(id#0L)#4L] Results [1]: [min(id#0L)#4L AS v#2L] ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. Users can see the formatted plan for subqueries. ### How was this patch tested? New test. Closes #30855 from sarutak/fix-aqe-explain. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 70da86a085b61a0981c3f9fc6dbd897716472642) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 December 2020, 22:10:44 UTC
1ce8000 [SPARK-32976][SQL][FOLLOWUP] SET and RESTORE hive.exec.dynamic.partition.mode for HiveSQLInsertTestSuite to avoid flakiness ### What changes were proposed in this pull request? As https://github.com/apache/spark/pull/29893#discussion_r545303780 mentioned: > We need to set spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") before executing this suite; otherwise, test("insert with column list - follow table output order + partitioned table") will fail. The reason why it does not fail because some test cases [running before this suite] do not change the default value of hive.exec.dynamic.partition.mode back to strict. However, the order of test suite execution is not deterministic. ### Why are the changes needed? avoid flakiness in tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30843 from yaooqinn/SPARK-32976-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dd44ba5460c3850c87e93c2c126d980cb1b3a8b4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 December 2020, 16:00:18 UTC
ad2718d [SPARK-33843][BUILD] Upgrade to Zstd 1.4.8 ### What changes were proposed in this pull request? This PR aims to upgrade Zstd library to 1.4.8. ### Why are the changes needed? This will bring Zstd 1.4.7 and 1.4.8 improvement and bug fixes and the following from `zstd-jni`. - https://github.com/facebook/zstd/releases/tag/v1.4.7 - https://github.com/facebook/zstd/releases/tag/v1.4.8 - https://github.com/luben/zstd-jni/issues/153 (Apple M1 architecture) ### Does this PR introduce _any_ user-facing change? This will unblock Apple Silicon usage. ### How was this patch tested? Pass the CIs. Closes #30848 from dongjoon-hyun/SPARK-33843. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 00642ee19e6969ca7996fb44d16d001fcf17b407) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 December 2020, 14:59:57 UTC
50d5c23 [SPARK-33841][CORE][3.1] Fix issue with jobs disappearing intermittently from the SHS under high load ### What changes were proposed in this pull request? Mark SHS event log entries that were `processing` at the beginning of the `checkForLogs` run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again. ### Why are the changes needed? The issue is caused by [SPARK-29043](https://issues.apache.org/jira/browse/SPARK-29043), which is designated to improve the concurrent performance of the History Server. The [change](https://github.com/apache/spark/pull/25797/files#) breaks the ["app deletion" logic](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563) because of missing proper synchronization for `processing` event log entries. Since SHS now [filters out](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) all `processing` event log entries, such entries do not have a chance to be [updated with the new `lastProcessed`](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472) time and thus any entity that completes processing right after [filtering](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) and before [the check for stale entities](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560) will be identified as stale and will be deleted from the UI until the next `checkForLogs` run. This is because [updated `lastProcessed` time is used as criteria](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557), and event log entries that missed to be updated with a new time, will match that criteria. The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 439(49.6 MB) copies of an event log file were created using [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/branch-3.1) script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS. 252 entities are displayed at `21:20:23`: ![1-252-entries-at-21-20](https://user-images.githubusercontent.com/61428392/102653857-40901f00-4178-11eb-9d61-6a20e359abb2.png) 178 entities are displayed at `21:22:15`: ![2-178-at-21-22](https://user-images.githubusercontent.com/61428392/102653900-530a5880-4178-11eb-94fb-3f28b082b25a.png) ### Does this PR introduce _any_ user-facing change? Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically. ### How was this patch tested? Tested using [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/branch-3.1) script: * Build SHS with the proposed change * Download Hadoop AWS and AWS Java SDK * Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key * Configure SHS to read event logs from S3 * Start [monitor](https://github.com/vladhlinsky/shs-monitor/blob/branch-3.1/monitor.sh) script to query SHS API * Run [producers](https://github.com/vladhlinsky/shs-monitor/blob/branch-3.1/producer.sh) * Wait for SHS to load all the applications * Verify that the number of loaded applications increases continuously over time For more details, please refer to the [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/branch-3.1) repository. Closes #30847 from vladhlinsky/SPARK-33841-branch-3.1. Authored-by: Vlad Glinsky <vladhlinsky@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 December 2020, 23:16:45 UTC
d89c87e [SPARK-33840][DOCS] Add spark.sql.files.minPartitionNum to performence tuning doc ### What changes were proposed in this pull request? Add `spark.sql.files.minPartitionNum` and it's description to sql-performence-tuning.md. ### Why are the changes needed? Help user to find it. ### Does this PR introduce _any_ user-facing change? Yes, it's the doc. ### How was this patch tested? Pass CI. Closes #30838 from ulysses-you/SPARK-33840. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit bc46d273e0ae0d13d0e31e30e39198ac19dcd27b) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 18 December 2020, 11:27:51 UTC
8a269c7 [SPARK-33593][SQL] Vector reader got incorrect data with binary partition value Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT ```scala test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) | USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } ``` Fix data incorrect issue No Added UT Closes #30824 from AngersZhuuuu/SPARK-33593. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0603913c666bae1a9640f2f1469fe50bc59e461d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 December 2020, 08:02:09 UTC
0f9f950 [MINOR][INFRA] Add -Pspark-ganglia-lgpl to the build definition with Scala 2.13 on GitHub Actions ### What changes were proposed in this pull request? This PR adds `-Pspark-ganglia-lgpl` to the build definition with Scala 2.13 on GitHub Actions. ### Why are the changes needed? Keep the code build-able with Scala 2.13. With this change, all the sub-modules seems to be built-able with Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed Scala 2.13 build pass with the following command. ``` $ ./dev/change-scala-version.sh 2.13 $ build/sbt -Pspark-ganglia-lgpl -Pscala-2.13 compile test:compile ``` Closes #30834 from sarutak/ganglia-scala-2.13. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b0da2bcd464b24d58e2ce56d4f93f1f9527839ff) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 18 December 2020, 06:10:27 UTC
df68376 [SPARK-26341][WEBUI] Expose executor memory metrics at the stage level, in the Stages tab ### What changes were proposed in this pull request? Expose executor memory metrics at the stage level, in the Stages tab, Current like below, and I am not sure which column we will truly need. ![image](https://user-images.githubusercontent.com/46485123/101170248-2256f900-3679-11eb-8c34-794fcf8e94a8.png) ![image](https://user-images.githubusercontent.com/46485123/101170359-4dd9e380-3679-11eb-984b-b0430f236160.png) ![image](https://user-images.githubusercontent.com/46485123/101314915-86a1d480-3894-11eb-9b6f-8050d326e11f.png) ### Why are the changes needed? User can know executor jvm usage more directly in SparkUI ### Does this PR introduce any user-facing change? User can know executor jvm usage more directly in SparkUI ### How was this patch tested? Manual Tested Closes #30573 from AngersZhuuuu/SPARK-26341. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 25c6cc25f74e8a24aa424f6596a574f26ae80e1d) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 18 December 2020, 05:27:35 UTC
1a0c291 [SPARK-33831][UI] Update to jetty 9.4.34 Update Jetty to 9.4.34 Picks up fixes and improvements, including a possible CVE fix. https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020 https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102 No. Existing tests. Closes #30828 from srowen/SPARK-33831. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 131a23d88a56280d47584aed93bc8fb617550717) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 December 2020, 03:10:52 UTC
16739f3 [SPARK-33824][PYTHON][DOCS] Restructure and improve Python package management page ### What changes were proposed in this pull request? This PR proposes to restructure and refine the Python dependency management page. I lately wrote a blog post which will be published soon, and decided contribute some of the contents back to PySpark documentation. FWIW, it has been reviewed by some tech writers and engineers. I built the site for making the review easier: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It's doc change but only in unreleased bracnhs for now. ### How was this patch tested? I manually built the docs as: ```bash cd python/docs make clean html open ``` Closes #30822 from HyukjinKwon/SPARK-33824. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 6315118676c99ccef2566c50ab9873de8876e468) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 18 December 2020, 01:03:22 UTC
258ed8b [SPARK-33822][SQL] Use the `CastSupport.cast` method in HashJoin ### What changes were proposed in this pull request? This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](https://github.com/apache/spark/commit/031c5ef280e0cba8c4718a6457a44b6cccb17f46)): ``` java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path. at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321) at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... ``` I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows: ``` +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#2183] +- BroadcastQueryStage 2 +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#1963] ``` A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`: https://github.com/apache/spark/blob/1e85707738a830d33598ca267a6740b3f06b1861/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L47-L50 The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked that q5 passed. Closes #30818 from maropu/BugfixInAQE. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 51ef4430dcbc934d43315ee6bdc851c9be84a1f2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 December 2020, 00:16:20 UTC
d5265ee [SPARK-33173][CORE][TESTS][FOLLOWUP] Use `local[2]` and AtomicInteger ### What changes were proposed in this pull request? Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe. ### Why are the changes needed? The test is still flaky after the fix https://github.com/apache/spark/pull/30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642 And it's easy to reproduce if you test it multiple times (e.g. 100) locally. The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one. The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested manually after the fix and the test is no longer flaky. Closes #30823 from Ngone51/debug-flaky-spark-33088. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 15616f499aca93c98a71732add2a80de863d3d5f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 December 2020, 17:28:43 UTC
919f964 [SPARK-33774][UI][CORE] Back to Master" returns 500 error in Standalone cluster ### What changes were proposed in this pull request? Initiate the `masterWebUiUrl` with the `webUi. webUrl` instead of the `masterPublicAddress`. ### Why are the changes needed? Since [SPARK-21642](https://issues.apache.org/jira/browse/SPARK-21642), `WebUI` has changed from `localHostName` to `localCanonicalHostName` as the hostname to set up the web UI. However, the `masterPublicAddress` is from `RpcEnv`'s host address, which still uses `localHostName`. As a result, it returns the wrong Master web URL to the Worker. ### Does this PR introduce _any_ user-facing change? Yes, when users click "Back to Master" in the Worker page: Before this PR: <img width="3258" alt="WeChat4acbfd163f51c76a5f9bc388c7479785" src="https://user-images.githubusercontent.com/16397174/102057951-b9664280-3e29-11eb-8749-5ee293902bdf.png"> After this PR: ![image](https://user-images.githubusercontent.com/16397174/102058016-d438b700-3e29-11eb-8641-a23a6b2f542e.png) (Return to the Master page successfully.) ### How was this patch tested? Tested manually. Closes #30759 from Ngone51/fix-back-to-master. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 34e4d87023535c086a0aa43fe194f794b41e09b7) Signed-off-by: Sean Owen <srowen@gmail.com> 17 December 2020, 14:52:20 UTC
b557147 [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate ### What changes were proposed in this pull request? Change the strategy for how the varargs are handled in the default `mutate` method ### Why are the changes needed? Bugfix -- `deparse` + `sapply` not working as intended due to `width.cutoff` ### Does this PR introduce any user-facing change? Yes, bugfix. Shouldn't change any working code. ### How was this patch tested? None! yet. Closes #28386 from MichaelChirico/r-mutate-deparse. Lead-authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 17 December 2020, 11:40:26 UTC
799ebd1 [SPARK-33819][CORE][3.1] SingleFileEventLogFileReader/RollingEventLogFilesFileReader should be `package private` ### What changes were proposed in this pull request? This PR aims to convert `EventLogFileReader`'s derived classes into `package private`. - SingleFileEventLogFileReader - RollingEventLogFilesFileReader `EventLogFileReader` itself is used in `scheduler` module during tests. ### Why are the changes needed? This classes were designed to be internal. This PR hides it explicitly to reduce the maintenance burden. ### Does this PR introduce _any_ user-facing change? Yes, but these were exposed accidentally. ### How was this patch tested? Pass CIs. Closes #30819 from dongjoon-hyun/SPARK-33819-3.1. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 17 December 2020, 10:07:11 UTC
95cceda [SPARK-33821][BUILD] Upgrade SBT to 1.4.5 ### What changes were proposed in this pull request? This PR aims to upgrade SBT to 1.4.5 to support Apple Silicon. ### Why are the changes needed? The following is the release note including `sbt 1.4.5 adds support for Apple silicon (AArch64 also called ARM64)`. - https://github.com/sbt/sbt/releases/tag/v1.4.5 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30817 from dongjoon-hyun/SPARK-33821. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b1950cc9162999c2200a0a988fa28aee640fb459) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 17 December 2020, 06:50:03 UTC
6b0a48e [SPARK-33697][SQL] RemoveRedundantProjects should require column ordering by default ### What changes were proposed in this pull request? This PR changes the rule `RemoveRedundantProjects` from by default passing column ordering requirements from parent nodes to always require column orders regardless of the requirements from parent nodes unless otherwise specified. More specifically, instead of excluding a few nodes like GenerateExec, UnionExec that are known to require children columns to be ordered, the rule now includes a whitelist of nodes that allow passing through the ordering requirements from their parents. ### Why are the changes needed? Currently, this rule passes through ordering requirements from parents directly to children except for a few excluded nodes. This incorrectly removes the necessary project nodes below a UnionExec since it is not excluded. An earlier PR also fixed a similar issue for GenerateExec (SPARK-32861). In order to prevent similar issues, the rule should be changed to always require column ordering except for a few specific nodes that we know for sure can pass through the requirements. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #30659 from allisonwang-db/spark-33697-remove-project-union. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1e85707738a830d33598ca267a6740b3f06b1861) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 December 2020, 05:47:53 UTC
af91f87 [SPARK-33775][BUILD] Suppress sbt compilation warnings in Scala 2.13 ### What changes were proposed in this pull request? There are too many compilation warnings in Scala 2.13, this pr add some `-Wconf:msg= regexes` rules to `SparkBuild.scala` to suppress compilation warnings and the suppressed will not be printed to the console. The suppressed compilation warnings includes: - All warnings related to `method\value\type\object\trait\inheritance` deprecated since 2.13 - All warnings related to `Widening conversion from XXX to YYY is deprecated because it loses precision` - Auto-application to `()` is deprecated. Supply the empty argument list `()` explicitly to invoke method `methodName`, or remove the empty argument list from its definition (Java-defined methods are exempt).In Scala 3, an unapplied method like this will be eta-expanded into a function. - method with a single empty parameter list overrides method without any parameter list - method without a parameter list overrides a method with a single empty one Not suppressed compilation warnings includes: - Unicode escapes in triple quoted strings are deprecated, use the literal character instead. - view bounds are deprecated - symbol literal is deprecated ### Why are the changes needed? Suppress unimportant compilation warnings in Scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30760 from LuciferYang/SPARK-33775. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 477046c63fab281570d26a183be4b0b8b77ac41a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 December 2020, 23:06:39 UTC
7269867 [SPARK-33514][SQL][FOLLOW-UP] Remove unused TruncateTableStatement case class This PR removes unused `TruncateTableStatement`: https://github.com/apache/spark/pull/30457#discussion_r544433820 To remove unused `TruncateTableStatement` from #30457. No Not needed. Closes #30811 from imback82/remove_truncate_table_stmt. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e7e29fd0affe81a24959ecc0286ec4c85f319722) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 December 2020, 22:13:59 UTC
fea2a8d [SPARK-33810][TESTS] Reenable test cases disabled in SPARK-31732 ### What changes were proposed in this pull request? The test failures were due to machine being slow in Jenkins. We switched to Ubuntu 20 if I am not wrong. Looks like all machines are functioning properly unlike the past, and the tests pass without a problem anymore. This PR proposes to enable them back. ### Why are the changes needed? To restore test coverage. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins jobs in this PR show the flakiness. Closes #30798 from HyukjinKwon/do-not-merge-test. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3d0323401f7a3e4369a3d3f4ff98f15d19e8a643) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 December 2020, 16:34:36 UTC
0c3efb6 [SPARK-32991][SQL] [FOLLOWUP] Reset command relies on session initials first ### What changes were proposed in this pull request? As a follow-up of https://github.com/apache/spark/pull/30045, we modify the RESET command here to respect the session initial configs per session first then fall back to the `SharedState` conf, which makes each session could maintain a different copy of initial configs for resetting. ### Why are the changes needed? to make reset command saner. ### Does this PR introduce _any_ user-facing change? yes, RESET will respect session initials first not always go to the system defaults ### How was this patch tested? add new tests Closes #30642 from yaooqinn/SPARK-32991-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 205d8e40bc8446c5953c9a082ffaede3029d1d53) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 December 2020, 14:36:50 UTC
dab9664 [SPARK-33752][SQL][3.1] Avoid the getSimpleMessage of AnalysisException adds semicolon repeatedly ### What changes were proposed in this pull request? This PR related to #30724. This PR backport the version to branch-3.1 ### Why are the changes needed? Fix a bug, because it adds semicolon repeatedly. ### Does this PR introduce _any_ user-facing change? Yes. the message of AnalysisException will be correct. ### How was this patch tested? Jenkins test. Closes #30792 from beliefer/SPARK-33752-backport. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 16 December 2020, 14:34:49 UTC
2e55fc8 [SPARK-33788][SQL][3.1][3.0][2.4] Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions() ### What changes were proposed in this pull request? Throw `NoSuchPartitionsException` from `ALTER TABLE .. DROP TABLE` for not existing partitions of a table in V1 Hive external catalog. ### Why are the changes needed? The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `NoSuchPartitionsException`. To improve user experience with Spark SQL, it would be better to throw the same exception. ### Does this PR introduce _any_ user-facing change? Yes, the command throws `NoSuchPartitionsException` instead of the general exception `AnalysisException`. ### How was this patch tested? By running new UT via: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveDDLSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 3dfdcf4f92ef5e739f15c22c93d673bb2233e617) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #30802 from MaxGekk/hive-drop-partition-exception-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 16 December 2020, 14:30:47 UTC
22ef681 [SPARK-33803][SQL] Sort table properties by key in DESCRIBE TABLE command ### What changes were proposed in this pull request? This PR proposes to sort table properties in DESCRIBE TABLE command. This is consistent with DSv2 command as well: https://github.com/apache/spark/blob/e3058ba17cb4512537953eb4ded884e24ee93ba2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala#L63 This PR fixes the test case in Scala 2.13 build as well where the table properties have different order in the map. ### Why are the changes needed? To keep the deterministic and pretty output, and fix the tests in Scala 2.13 build. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/49/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/describe_sql/ ``` describe.sql&#010;Expected "...spark_catalog, view.[query.out.col.2=c, view.referredTempFunctionsNames=[], view.catalogAndNamespace.part.1=default]]", but got "...spark_catalog, view.[catalogAndNamespace.part.1=default, view.query.out.col.2=c, view.referredTempFunctionsNames=[]]]" Result did not match for query #29&#010;DESC FORMATTED v ``` ### Does this PR introduce _any_ user-facing change? Yes, it will change the text output from `DESCRIBE [EXTENDED|FORMATTED] table_name`. Now the table properties are sorted by its key. ### How was this patch tested? Related unittests were fixed accordingly. Closes #30799 from HyukjinKwon/SPARK-33803. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7845865b8d5c03a4daf82588be0ff2ebb90152a7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 December 2020, 13:42:38 UTC
f3709a0 [SPARK-33786][SQL] The storage level for a cache should be respected when a table name is altered ### What changes were proposed in this pull request? This PR proposes to retain the cache's storage level when a table name is altered by `ALTER TABLE ... RENAME TO ...`. ### Why are the changes needed? Currently, when a table name is altered, the table's cache is refreshed (if exists), but the storage level is not retained. For example: ```scala def getStorageLevel(tableName: String): StorageLevel = { val table = spark.table(tableName) val cachedData = spark.sharedState.cacheManager.lookupCachedData(table).get cachedData.cachedRepresentation.cacheBuilder.storageLevel } Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath) sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'") sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')") val oldStorageLevel = getStorageLevel("old") sql("ALTER TABLE old RENAME TO new") val newStorageLevel = getStorageLevel("new") ``` `oldStorageLevel` will be `StorageLevel(memory, deserialized, 1 replicas)` whereas `newStorageLevel` will be `StorageLevel(disk, memory, deserialized, 1 replicas)`, which is the default storage level. ### Does this PR introduce _any_ user-facing change? Yes, now the storage level for the cache will be retained. ### How was this patch tested? Added a unit test. Closes #30774 from imback82/alter_table_rename_cache_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef7f6903b4fa28c554a1f0b58b9da194979b61ee) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 December 2020, 05:45:55 UTC
50e72dc [SPARK-33793][TESTS] Introduce withExecutor to ensure proper cleanup in tests ### What changes were proposed in this pull request? This PR introduces a helper method `withExecutor` that handles the creation of an Executor object and ensures that it is always stopped in a finally block. The tests in ExecutorSuite have been refactored to use this method. ### Why are the changes needed? Recently an issue was discovered that leaked Executors (which are not explicitly stopped after a test) can cause other tests to fail due to the JVM being killed after 10 min. It is therefore crucial that tests always stop the Executor. By introducing this helper method, a simple pattern is established that can be easily adopted in new tests, which reduces the risk of regressions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run the ExecutorSuite locally. Closes #30783 from sander-goos/SPARK-33793-close-executors. Authored-by: Sander Goos <sander.goos@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit ddff94fd32f85072cbc5c752c337f3b89ae00bed) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 16 December 2020, 02:27:08 UTC
e30118c [SPARK-33796][DOCS] Show hidden text from the left menu of Spark Doc ### What changes were proposed in this pull request? If the text in the left menu of Spark is too long, it will be hidden. ![sql1](https://user-images.githubusercontent.com/1097932/102249583-5ae7a580-3eb7-11eb-813c-f2e2fe019d28.jpeg) This PR is to fix the style issue. ### Why are the changes needed? Improve the UI of Spark documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test After changes: ![sql2](https://user-images.githubusercontent.com/1097932/102249603-5fac5980-3eb7-11eb-806d-4e7b8248e6b6.jpeg) Closes #30786 from gengliangwang/fixDocStyle. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit dd042f58e7a0fd2289f6889c324c0d5e4c18ad7f) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 16 December 2020, 01:07:50 UTC
back to top