https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
4f25b3f Preparing Spark release v3.2.1-rc2 20 January 2022, 05:03:12 UTC
3860ac5 [SPARK-37957][SQL] Correctly pass deterministic flag for V2 scalar functions ### What changes were proposed in this pull request? Pass `isDeterministic` flag to `ApplyFunctionExpression`, `Invoke` and `StaticInvoke` when processing V2 scalar functions. ### Why are the changes needed? A V2 scalar function can be declared as non-deterministic. However, currently Spark doesn't pass the flag when converting the V2 function to a catalyst expression, which could lead to incorrect results if being applied with certain optimizations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test. Closes #35243 from sunchao/SPARK-37957. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com> 19 January 2022, 22:09:27 UTC
5cf8108 [SPARK-37959][ML] Fix the UT of checking norm in KMeans & BiKMeans ### What changes were proposed in this pull request? In `KMeansSuite` and `BisectingKMeansSuite`, there are some unused lines: ``` model1.clusterCenters.forall(Vectors.norm(_, 2) == 1.0 ``` For cosine distance, the norm of centering vector should be 1, so the norm checking is meaningful; For euclidean distance, the norm checking is meaningless; ### Why are the changes needed? to enable norm checking for cosine distance, and diable it for euclidean distance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated testsuites Closes #35247 from zhengruifeng/fix_kmeans_ut. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: huaxingao <huaxin.gao11@gmail.com> (cherry picked from commit 789fce8c8b200eba5f94c2d83b4b83e3bfb9a2b1) Signed-off-by: huaxingao <huaxin.gao11@gmail.com> 19 January 2022, 17:17:47 UTC
75ea645 Preparing development version 3.2.2-SNAPSHOT 15 January 2022, 01:24:15 UTC
ea8ce99 Preparing Spark release v3.2.1-rc2 15 January 2022, 01:24:08 UTC
66e1231 [SPARK-37859][SQL] Do not check for metadata during schema comparison ### What changes were proposed in this pull request? Ignores the metadata when comparing the user-provided schema and the actual schema during BaseRelation resolution. ### Why are the changes needed? Makes it possible to read tables with Spark 3.2 that were written with Spark 3.1, as https://github.com/apache/spark/blob/bd24b4884b804fc85a083f82b864823851d5980c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L312 added a new metadata field that broke compatibility. ### Does this PR introduce _any_ user-facing change? Yes. Previously, an error was thrown when a SQL table written with JDBC in Spark 3.1 was read in Spark 3.2. Now, no error is thrown. ### How was this patch tested? Unit test and manual test with a SQL table written with Spark 3.1. Query: ``` select * from jdbc_table ``` Before: ``` org.apache.spark.sql.AnalysisException: The user-specified schema doesn't match the actual schema: ``` After: no error Closes #35158 from karenfeng/SPARK-37859. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a1e86373253d77329b2b252c653a69ae8ac0bd6c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 14 January 2022, 02:55:08 UTC
a58b8a8 [SPARK-37855][SQL][3.2] IllegalStateException when transforming an array inside a nested struct This is a backport of #35170 for branch-3.2. ### What changes were proposed in this pull request? Skip alias the `ExtractValue` whose children contains `NamedLambdaVariable`. ### Why are the changes needed? Since #32773, the `NamedLambdaVariable` can produce the references, however it cause the rule `NestedColumnAliasing` alias the `ExtractValue` which contains `NamedLambdaVariable`. It fails since we can not match a `NamedLambdaVariable` to an actual attribute. Talk more: During `NamedLambdaVariable#replaceWithAliases`, it uses the references of nestedField to match the output attributes of grandchildren. However `NamedLambdaVariable` is created at analyzer as a virtual attribute, and it is not resolved from the output of children. So we can not get any attribute when use the references of `NamedLambdaVariable` to match the grandchildren's output. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? Add new test Closes #35175 from ulysses-you/SPARK-37855-branch-3.2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> 12 January 2022, 19:44:04 UTC
31dfbde [SPARK-37871][TESTS] Use `python3` instead of `python` in BaseScriptTransformation tests ### What changes were proposed in this pull request? This PR aims to use `python3` instead of `python` in `BaseScriptTransformation` tests. ### Why are the changes needed? Since Apache Spark deprecated `Python 2`, this PR aims to make it sure. In addition, sometimes `python3` or `python3.x` command exists. ``` [info] - SPARK-25158: Executor accidentally exit because ScriptTransformationWriterThread throw Exception *** FAILED *** (248 milliseconds) [info] "Job aborted due to stage failure: Task 0 in stage 2162.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2162.0 (TID 2627) (0ac7628d09c6 executor driver): org.apache.spark.SparkException: Subprocess exited with status 127. Error: /bin/bash: python: command not found ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #35171 from dongjoon-hyun/SPARK-37871. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 922723189e61d7228d6b5669d836ed86ef95e3ef) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 January 2022, 23:36:14 UTC
deb6776 [SPARK-37866][TESTS] Set `file.encoding` to UTF-8 for SBT tests This PR aims to set `-Dfile.encoding=UTF-8` for SBT tests. To make the tests robust on various OSs with different locales. **BEFORE** ``` $ LANG=C.UTF-8 build/sbt "sql/testOnly *.JDBCV2Suite -- -z non-ascii" ... [info] - column name with non-ascii *** FAILED *** (2 seconds, 668 milliseconds) [info] "== Parsed Logical Plan == [info] 'Project [unresolvedalias('COUNT('?), None)] [info] +- 'UnresolvedRelation [h2, test, person], [], false [info] [info] == Analyzed Logical Plan == [info] count(?): bigint [info] Aggregate [count(?#x) AS count(?)#xL] [info] +- SubqueryAlias h2.test.person [info] +- RelationV2[?#x] test.person [info] [info] == Optimized Logical Plan == [info] Project [COUNT(U&"\540d")#xL AS count(?#x)#xL AS count(?)#xL] [info] +- RelationV2[COUNT(U&"\540d")#xL] test.person [info] [info] == Physical Plan == [info] *(1) Project [COUNT(U&"\540d")#xL AS count(?)#xL] [info] +- *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1df881a4 [COUNT(U&"\540d")#xL] PushedAggregates: [COUNT(`?`)], PushedFilters: [], PushedGroupByColumns: [], ReadSchema: struct<COUNT(U&"\540d"):bigint> [info] [info] " did not contain "PushedAggregates: [COUNT(`名`)]" (ExplainSuite.scala:66) ``` **AFTER** ``` [info] JDBCV2Suite: ... [info] - column name with non-ascii (2 seconds, 950 milliseconds) ... [info] All tests passed. [success] Total time: 21 s, completed Jan 11, 2022, 2:18:29 AM ``` No. Pass the CIs. Closes #35165 from dongjoon-hyun/SPARK-37866. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 122a59986f6ab358452a293256ad45be0fe96a1c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 January 2022, 19:49:36 UTC
db1023c [SPARK-37860][UI] Fix taskindex in the stage page task event timeline ### What changes were proposed in this pull request? This reverts commit 450b415028c3b00f3a002126cd11318d3932e28f. ### Why are the changes needed? In #32888, shahidki31 change taskInfo.index to taskInfo.taskId. However, we generally use `index.attempt` or `taskId` to distinguish tasks within a stage, not `taskId.attempt`. Thus #32888 was a wrong fix issue, we should revert it. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? origin test suites Closes #35160 from stczwd/SPARK-37860. Authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 3d2fde5242c8989688c176b8ed5eb0bff5e1f17f) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 11 January 2022, 06:23:31 UTC
385b34d [SPARK-37818][DOCS] Add option in the description document for show create table ### What changes were proposed in this pull request? Add options in the description document for `SHOW CREATE TABLE ` command. <img width="767" alt="1" src="https://user-images.githubusercontent.com/41178002/148747443-ecd6586f-e4c4-4ae4-8ea5-969896b7d416.png"> <img width="758" alt="2" src="https://user-images.githubusercontent.com/41178002/148747457-873bc0c3-08fa-4d31-89e7-b44440372462.png"> ### Why are the changes needed? [#discussion](https://github.com/apache/spark/pull/34719#discussion_r758189709) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? SKIP_API=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 SKIP_SCALADOC=1 bundle exec jekyll build Closes #35107 from Peng-Lei/SPARK-37818. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 2f70e4f84073ac75457263a2b3f3cb835ee63d49) Signed-off-by: Gengliang Wang <gengliang@apache.org> 10 January 2022, 11:40:46 UTC
ae309f0 Preparing development version 3.2.2-SNAPSHOT 07 January 2022, 17:38:42 UTC
2b0ee22 Preparing Spark release v3.2.1-rc1 07 January 2022, 17:38:35 UTC
4b5d2d7 [SPARK-37802][SQL][3.2] Composite field name should work with Aggregate push down ### What changes were proposed in this pull request? Currently, composite filed name such as dept id doesn't work with aggregate push down ``` sql("SELECT COUNT(`dept id`) FROM h2.test.dept") org.apache.spark.sql.catalyst.parser.ParseException: extraneous input 'id' expecting <EOF>(line 1, pos 5) == SQL == dept id -----^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:271) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:132) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseMultipartIdentifier(ParseDriver.scala:63) at org.apache.spark.sql.connector.expressions.LogicalExpressions$.parseReference(expressions.scala:39) at org.apache.spark.sql.connector.expressions.FieldReference$.apply(expressions.scala:365) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translateAggregate(DataSourceStrategy.scala:717) at org.apache.spark.sql.execution.datasources.v2.PushDownUtils$.$anonfun$pushAggregates$1(PushDownUtils.scala:125) at scala.collection.immutable.List.flatMap(List.scala:366) at org.apache.spark.sql.execution.datasources.v2.PushDownUtils$.pushAggregates(PushDownUtils.scala:125) ``` ## Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test Closes #35125 from huaxingao/backport. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> 07 January 2022, 07:43:31 UTC
2470640 [SPARK-37800][SQL][FOLLOW-UP] Remove duplicated LogicalPlan inheritance ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/35084 that removes duplicated `LogicalPlan` inheritance because `LeafNode` always inherits `LogicalPlan`. ### Why are the changes needed? To make the codes easier to read and less error-prone. e.g., if we switch `LeafNode` to an abstract class like `LogicalPlan` the complication fails. To clarify, this is the only `LeafNode` instance in the current codebase that duplicately inherits `LogicalPlan`. ### Does this PR introduce _any_ user-facing change? No, virtually no-op for now. ### How was this patch tested? Existing test cases should cover. Closes #35105 from HyukjinKwon/SPARK-37800. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit b902ecc21c29e62e729830c22d5e81ed17ab994f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 05 January 2022, 11:08:42 UTC
731e056 [SPARK-37807][SQL] Fix a typo in HttpAuthenticationException message ### What changes were proposed in this pull request? The error message is not correct, So we update the error message. ### Why are the changes needed? The exception message when password is left empty in HTTP mode of hive thrift server is not correct.. Updated the text to reflect it. Please check JIRA ISSUE: https://issues.apache.org/jira/browse/SPARK-37807 ### Does this PR introduce _any_ user-facing change? Yes, The exception messages in HiveServer2 is changed. ### How was this patch tested? This was tested manually Closes #35097 from RamakrishnaChilaka/feature/error_string_fix. Authored-by: Chilaka Ramakrishna <ramakrishna@nference.net> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 27d5575f13fe69459d7fa72cee11d4166c9e1a10) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 05 January 2022, 07:30:30 UTC
9b6be1a [SPARK-30789][SQL][DOCS][FOLLOWUP] Add document for syntax `(IGNORE | RESPECT) NULLS` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/30943 supports syntax `(IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE`, but update document. The screen snapshot before this PR ![screenshot-20211231-174803](https://user-images.githubusercontent.com/8486025/147816336-debca074-0b84-48e8-9ed2-cb13f562cf12.png) This PR adds document for syntax `(IGNORE | RESPECT) NULLS` The screen snapshot after this PR ![image](https://user-images.githubusercontent.com/8486025/148141568-506e9232-a3c4-4a25-a5c6-65a5d5a2e066.png) ![image](https://user-images.githubusercontent.com/8486025/148061495-b7198417-9d4c-4c03-9060-385271ea9a46.png) ### Why are the changes needed? Add document for syntax `(IGNORE | RESPECT) NULLS` ### Does this PR introduce _any_ user-facing change? 'No'. Just update docs. ### How was this patch tested? Manual check. Closes #35079 from beliefer/SPARK-30789-docs. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 93c614bf1e6aba092d82bcd8616b5ea31eb191a2) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 January 2022, 04:57:42 UTC
537de84 [SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv ### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes #35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: yi.wu <yi.wu@databricks.com> (cherry picked from commit 639d6f40e597d79c680084376ece87e40f4d2366) Signed-off-by: yi.wu <yi.wu@databricks.com> 05 January 2022, 02:48:50 UTC
45b7b7e [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() ### What changes were proposed in this pull request? This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator). The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row. The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes. This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784 So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment). Closes #35066 from JoshRosen/SPARK-37784. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5) Signed-off-by: Josh Rosen <joshrosen@databricks.com> 04 January 2022, 19:00:51 UTC
5f9b92c [SPARK-37728][SQL][3.2] Reading nested columns with ORC vectorized reader can cause ArrayIndexOutOfBoundsException ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/35002 . When an OrcColumnarBatchReader is created, method initBatch will be called only once. In method initBatch: `orcVectorWrappers[i] = OrcColumnVectorUtils.toOrcColumnVector(dt, wrap.batch().cols[colId]);` When the second argument of toOrcColumnVector is a ListColumnVector/MapColumnVector, orcVectorWrappers[i] is initialized with the ListColumnVector or MapColumnVector's offsets and lengths. However, when method nextBatch of OrcColumnarBatchReader is called, method ensureSize of ColumnVector (and its subclasses, like MultiValuedColumnVector) could be called, then the ListColumnVector/MapColumnVector's offsets and lengths could refer to new array objects. This could result in the ArrayIndexOutOfBoundsException. This PR makes OrcArrayColumnVector.getArray and OrcMapColumnVector.getMap always get offsets and lengths from the underlying ColumnVector, which can resolve this issue. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the CIs with the newly added test case. Closes #35038 from yym1995/branch-3.2. Lead-authored-by: Yimin <yimin.y@outlook.com> Co-authored-by: Yimin Yang <26797163+yym1995@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 04 January 2022, 17:26:34 UTC
c280f08 [SPARK-37800][SQL] TreeNode.argString incorrectly formats arguments of type Set[_] ### What changes were proposed in this pull request? Fixing [SPARK-37800: TreeNode.argString incorrectly formats arguments of type Set\[_\]](https://issues.apache.org/jira/browse/SPARK-37800). ### Why are the changes needed? On May 21, 2021, 746d80d87a480344d4cc0d30b481047cf85c0aa9 by maropu introduced this bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test added to `TreeNodeSuite`. Closes #35084 from ssimeonov/ss_spark37800. Authored-by: Simeon Simeonov <sim@fastignite.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a8addd487bcd4118fd87dabe686d07f658216eb7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 January 2022, 04:39:03 UTC
edf0f6b [SPARK-37779][SQL] Make ColumnarToRowExec plan canonicalizable after (de)serialization This PR proposes to add a driver-side check on `supportsColumnar` sanity check at `ColumnarToRowExec`. SPARK-23731 fixed the plans to be serializable by leveraging lazy but SPARK-28213 happened to refer to the lazy variable at: https://github.com/apache/spark/blob/77b164aac9764049a4820064421ef82ec0bc14fb/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L68 This can fail during canonicalization during, for example, eliminating sub common expressions (on executor side): ``` java.lang.NullPointerException at org.apache.spark.sql.execution.FileSourceScanExec.supportsColumnar$lzycompute(DataSourceScanExec.scala:280) at org.apache.spark.sql.execution.FileSourceScanExec.supportsColumnar(DataSourceScanExec.scala:279) at org.apache.spark.sql.execution.InputAdapter.supportsColumnar(WholeStageCodegenExec.scala:509) at org.apache.spark.sql.execution.ColumnarToRowExec.<init>(Columnar.scala:67) ... at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:581) at org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:580) at org.apache.spark.sql.execution.ScalarSubquery.canonicalized$lzycompute(subquery.scala:110) ... at org.apache.spark.sql.catalyst.expressions.ExpressionEquals.hashCode(EquivalentExpressions.scala:275) ... at scala.collection.mutable.HashTable.findEntry$(HashTable.scala:135) at scala.collection.mutable.HashMap.findEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.get(HashMap.scala:74) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTreeHelper$1(EquivalentExpressions.scala:147) at org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:170) at org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.$anonfun$proxyExpressions$1(SubExprEvaluationRuntime.scala:89) at org.apache.spark.sql.catalyst.expressions.SubExprEvaluationRuntime.$anonfun$proxyExpressions$1$adapted(SubExprEvaluationRuntime.scala:89) at scala.collection.immutable.List.foreach(List.scala:392) ``` This fix is still a bandaid fix but at least addresses the issue with minimized change - this fix should ideally be backported too. Pretty unlikely - when `ColumnarToRowExec` has to be canonicalized on the executor side (see the stacktrace), but yes. it would fix a bug. Unittest was added. Closes #35058 from HyukjinKwon/SPARK-37779. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 195f1aaf4361fb8f5f31ef7f5c63464767ad88bd) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 30 December 2021, 03:40:07 UTC
906994f [SPARK-37705][SQL][3.2] Rebase timestamps in the session time zone saved in Parquet/Avro metadata ### What changes were proposed in this pull request? In the PR, I propose to add new metadata key `org.apache.spark.timeZone` which Spark writes to Parquet/Avro matadata while performing of datetimes rebase in the `LEGACY` mode (see the SQL configs: - `spark.sql.parquet.datetimeRebaseModeInWrite`, - `spark.sql.parquet.int96RebaseModeInWrite` and - `spark.sql.avro.datetimeRebaseModeInWrite`). The writers uses the current session time zone (see the SQL config `spark.sql.session.timeZone`) in rebasing of Parquet/Avro timestamp columns. At the reader side, Spark tries to get info about the writer's time zone from the new metadata property: ``` $ java -jar ~parquet-tools-1.12.0.jar meta ./part-00000-b0d90bf0-ce60-4b4f-b453-b33f61ab2b2a-c000.snappy.parquet ... extra: org.apache.spark.timeZone = America/Los_Angeles extra: org.apache.spark.legacyDateTime = ``` and use it in rebasing timestamps to the Proleptic Gregorian calendar. In the case when the reader cannot retrieve the original time zone from Parquet/Avro metadata, it uses the default JVM time zone for backward compatibility. ### Why are the changes needed? Before the changes, Spark assumes that a writer uses the default JVM time zone while rebasing of dates/timestamps. And if a reader and the writer have different JVM time zone settings, the reader cannot load such columns in the `LEGACY` mode correctly. So, the reader will have full info about writer settings after the changes. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, Parquet/Avro writers use the session time zone while timestamp rebasing in the `LEGACY` mode instead of the default JVM time zone. Need to highlight that the session time zone is set to the JVM time zone by default. ### How was this patch tested? 1. By running new tests: ``` $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite" $ build/sbt "test:testOnly *ParquetRebaseDatetimeV2Suite" $ build/sbt "test:testOnly *AvroV1Suite" $ build/sbt "test:testOnly *AvroV2Suite" ``` 2. And related existing test suites: ``` $ build/sbt "test:testOnly *DateTimeUtilsSuite" $ build/sbt "test:testOnly *RebaseDateTimeSuite" $ build/sbt "test:testOnly *TimestampFormatterSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroCatalystDataConversionSuite" $ build/sbt "test:testOnly *AvroRowReaderSuite" $ build/sbt "test:testOnly *AvroSerdeSuite" $ build/sbt "test:testOnly *ParquetVectorizedSuite" ``` 3. Also modified the test `SPARK-31159: rebasing timestamps in write` to check loading timestamps in the LEGACY mode when the session time zone and JVM time zone are different. 4. Generated parquet files by Spark 3.2.0 (the commit https://github.com/apache/spark/commit/5d45a415f3a29898d92380380cfd82bfc7f579ea) using the test `"SPARK-31806: generate test files for checking compatibility with Spark 2.4"`. The parquet files don't contain info about the original time zone: ``` $ java -jar ~/Downloads/parquet-tools-1.12.0.jar meta sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v3_2_0.snappy.parquet file: file:/Users/maximgekk/proj/parquet-rebase-save-tz/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v3_2_0.snappy.parquet creator: parquet-mr version 1.12.1 (build 2a5c06c58fa987f85aa22170be14d927d5ff6e7d) extra: org.apache.spark.version = 3.2.0 extra: org.apache.spark.legacyINT96 = extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"dict","type":"timestamp","nullable":true,"metadata":{}},{"name":"plain","type":"timestamp","nullable":true,"metadata":{}}]} extra: org.apache.spark.legacyDateTime = file schema: spark_schema -------------------------------------------------------------------------------- dict: OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1 plain: OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:1 ``` By running the test `"SPARK-31159, SPARK-37705: compatibility with Spark 2.4/3.2 in reading dates/timestamps"`, check loading of mixed parquet files generated by Spark 2.4.5/2.4.6 and 3.2.0/master. 5. Generated avro files by Spark 3.2.0 (the commit https://github.com/apache/spark/commit/5d45a415f3a29898d92380380cfd82bfc7f579ea) using the test `"SPARK-31855: generate test files for checking compatibility with Spark 2.4"`. The avro files don't contain info about the original time zone: ``` $ java -jar ~/Downloads/avro-tools-1.9.2.jar getmeta external/avro/src/test/resources/before_1582_timestamp_micros_v3_2_0.avro avro.schema {"type":"record","name":"topLevelRecord","fields":[{"name":"dt","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]}]} org.apache.spark.version 3.2.0 avro.codec snappy org.apache.spark.legacyDateTime ``` By running the test `"SPARK-31159, SPARK-37705: compatibility with Spark 2.4/3.2 in reading dates/timestamps"`, check loading of mixed avro files generated by Spark 2.4.5/2.4.6 and 3.2.0/master. Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit ef3a47038606ea426c15844b0400f5141acd5108) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #35042 from MaxGekk/parquet-rebase-save-tz-3.2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 29 December 2021, 00:52:05 UTC
01fff64 [SPARK-37659][UI][3.2] Fix FsHistoryProvider race condition between list and delet log info backport https://github.com/apache/spark/pull/34919 to branch-3.2 ### What changes were proposed in this pull request? Add lock between list and delet log info in `FsHistoryProvider.checkForLogs`. ### Why are the changes needed? After [SPARK-29043](https://issues.apache.org/jira/browse/SPARK-29043), `FsHistoryProvider` will list the log info without waitting all `mergeApplicationListing` task finished. However the `LevelDBIterator` of list log info is not thread safe if some other threads delete the related log info at same time. There is the error msg: ``` 21/12/15 14:12:02 ERROR FsHistoryProvider: Exception in checking for event log updates java.util.NoSuchElementException: 1^__main__^+hdfs://xxx/application_xxx.inprogress at org.apache.spark.util.kvstore.LevelDB.get(LevelDB.java:132) at org.apache.spark.util.kvstore.LevelDBIterator.next(LevelDBIterator.java:137) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47) at scala.collection.TraversableLike.to(TraversableLike.scala:678) at scala.collection.TraversableLike.to$(TraversableLike.scala:675) at scala.collection.AbstractTraversable.to(Traversable.scala:108) at scala.collection.TraversableOnce.toList(TraversableOnce.scala:299) at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:299) at scala.collection.AbstractTraversable.toList(Traversable.scala:108) at org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:588) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:299) ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? manual test, after this patch the exception go away Closes #35003 from ulysses-you/SPARK-37659-3.2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 December 2021, 01:53:52 UTC
35d7c4a [MINOR][PYTHON][DOCS] Update pandas_pyspark.rst ### What changes were proposed in this pull request? * 'to'-> 'the' * 'it requires to create new default index in case' -> 'a new default index is created when' ### Why are the changes needed? Grammar fix ### Does this PR introduce _any_ user-facing change? Documentation fix ### How was this patch tested? No test needed Closes #34998 from kamelCased/patch-1. Authored-by: Kamel Gazzaz <60620606+kamelCased@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 072e7ca76f6a6145cf5af9a4435d7c4a841ee19d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 December 2021, 08:35:35 UTC
0888622 [SPARK-37302][BUILD][FOLLOWUP] Extract the versions of dependencies accurately ### What changes were proposed in this pull request? This PR changes `dev/test-dependencies.sh` to extract the versions of dependencies accurately. In the current implementation, the versions are extracted like as follows. ``` GUAVA_VERSION=`build/mvn help:evaluate -Dexpression=guava.version -q -DforceStdout` ``` But, if the output of the `mvn` command includes not only the version but also other messages like warnings, a following command referring the version will fail. ``` build/mvn dependency:get -Dartifact=com.google.guava:guava:${GUAVA_VERSION} -q ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-dependency-plugin:3.1.1:get (default-cli) on project spark-parent_2.12: Couldn't download artifact: org.eclipse.aether.resolution.DependencyResolutionException: com.google.guava:guava:jar:Falling was not found in https://maven-central.storage-download.googleapis.com/maven2/ during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of gcs-maven-central-mirror has elapsed or updates are forced -> [Help 1] ``` Actually, this causes the recent linter failure. https://github.com/apache/spark/runs/4623297663?check_suite_focus=true ### Why are the changes needed? To recover the CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually run `dev/test-dependencies.sh`. Closes #35006 from sarutak/followup-SPARK-37302. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit dd0decff5f1e95cedd8fe83de7e4449be57cb31c) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 24 December 2021, 02:30:05 UTC
662c6e8 [SPARK-37391][SQL] JdbcConnectionProvider tells if it modifies security context # branch-3.2 version! For master version see : https://github.com/apache/spark/pull/34745 Augments the JdbcConnectionProvider API such that a provider can indicate that it will need to modify the global security configuration when establishing a connection, and as such, if access to the global security configuration should be synchronized to prevent races. ### What changes were proposed in this pull request? As suggested by gaborgsomogyi [here](https://github.com/apache/spark/pull/29024/files#r755788709), augments the `JdbcConnectionProvider` API to include a `modifiesSecurityContext` method that can be used by `ConnectionProvider` to determine when `SecurityConfigurationLock.synchronized` is required to avoid race conditions when establishing a JDBC connection. ### Why are the changes needed? Provides a path forward for working around a significant bottleneck introduced by synchronizing `SecurityConfigurationLock` every time a connection is established. The synchronization isn't always needed and it should be at the discretion of the `JdbcConnectionProvider` to determine when locking is necessary. See [SPARK-37391](https://issues.apache.org/jira/browse/SPARK-37391) or [this thread](https://github.com/apache/spark/pull/29024/files#r754441783). ### Does this PR introduce _any_ user-facing change? Any existing implementations of `JdbcConnectionProvider` will need to add a definition of `modifiesSecurityContext`. I'm also open to adding a default implementation, but it seemed to me that requiring an explicit implementation of the method was preferable. A drop-in implementation that would continue the existing behavior is: ```scala override def modifiesSecurityContext( driver: Driver, options: Map[String, String] ): Boolean = true ``` ### How was this patch tested? Unit tests. Also ran a real workflow by swapping in a locally published version of `spark-sql` into my local spark 3.2.0 installation's jars. Closes #34989 from tdg5/SPARK-37391-opt-in-security-configuration-sync-branch-3.2. Authored-by: Danny Guinther <dguinther@seismic.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 24 December 2021, 01:47:24 UTC
d1cd110 [SPARK-37695][CORE][SHUFFLE] Skip diagnosis ob merged blocks from push-based shuffle ### What changes were proposed in this pull request? Skip diagnosis ob merged blocks from push-based shuffle ### Why are the changes needed? Because SPARK-36284 has not been addressed yet, skip it to suppress exceptions. ``` 21/12/19 18:46:37 WARN TaskSetManager: Lost task 166.0 in stage 1921.0 (TID 138855) (beta-spark5 executor 218): java.lang.AssertionError: assertion failed: Expected ShuffleBlockId, but got shuffleChunk_464_0_5645_0 at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.storage.ShuffleBlockFetcherIterator.diagnoseCorruption(ShuffleBlockFetcherIterator.scala:1043) at org.apache.spark.storage.BufferReleasingInputStream.$anonfun$tryOrFetchFailedException$1(ShuffleBlockFetcherIterator.scala:1308) at scala.Option.map(Option.scala:230) at org.apache.spark.storage.BufferReleasingInputStream.tryOrFetchFailedException(ShuffleBlockFetcherIterator.scala:1307) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:1293) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.<init>(UnsafeRowSerializer.scala:120) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.asKeyValueIterator(UnsafeRowSerializer.scala:110) at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$2(BlockStoreShuffleReader.scala:98) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.sort_addToSorter_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.smj_findNextJoinRows_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:778) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? Yes, suppress the exceptions. ### How was this patch tested? Run 1T TPCDS manually. Closes #34961 from pan3793/SPARK-37695. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: yi.wu <yi.wu@databricks.com> (cherry picked from commit 57ca75f3f0b2695434e47464b2201210edd58fde) Signed-off-by: yi.wu <yi.wu@databricks.com> 23 December 2021, 04:02:19 UTC
3768734 [MINOR][DOCS][PYTHON] Fix docstring of SeriesGroupBy.nsmallest ### What changes were proposed in this pull request? Changes `SeriesGroupBy.nsmallest` docstring to match behavior. ### Why are the changes needed? The current description mentions columns, which are not applicable to `Series` objects and are not argument for this variant. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing linter. Closes #34975 from zero323/SERIES-GROUP-BY-NSMALLEST-DOC. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 6278c367f219296ab658b0c3baefbebdff73f420) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 December 2021, 00:31:08 UTC
9249c56 [MINOR][SQL] Fix the typo in function names: crete ### What changes were proposed in this pull request? Fix the typo: crete -> create. ### Why are the changes needed? To improve code maintenance. Find the functions by names should be easer after the changes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By compiling and running related test suites: ``` $ build/sbt "test:testOnly *ParquetRebaseDatetimeV2Suite" $ build/sbt "test:testOnly *AvroV1Suite" ``` Closes #34978 from MaxGekk/fix-typo-crete. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 72c278a4bb906cd7c500d223f80bc83e0f5c1ef0) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 22 December 2021, 00:29:14 UTC
669f6df [SPARK-37692][DOCS] Fix an example of mixed interval fields in the SQL migration guide ### What changes were proposed in this pull request? Change a wrong example in the SQL migration guide, and align it to the description of behavior change. <img width="1022" alt="Screenshot 2021-12-21 at 12 00 53" src="https://user-images.githubusercontent.com/1580697/146901704-c635720b-a8ca-4ea0-bc79-0220b83f544f.png"> ### Why are the changes needed? Current example is incorrect, and can confuse users. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the example: ```sql spark-sql> select INTERVAL 1 day 1 hour; 1 01:00:00.000000000 spark-sql> select INTERVAL 1 month 1 hour; Error in query: Cannot mix year-month and day-time fields: INTERVAL 1 month 1 hour(line 1, pos 7) == SQL == select INTERVAL 1 month 1 hour -------^^^ spark-sql> set spark.sql.legacy.interval.enabled=true; spark.sql.legacy.interval.enabled true spark-sql> select INTERVAL 1 month 1 hour; 1 months 1 hours ``` Closes #34969 from MaxGekk/fix-migr-guide-for-mixed-intervals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit e2eaffec48e30ba95b7663c04a00970ae1b3941d) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 21 December 2021, 10:09:39 UTC
ebf7e61 [SPARK-37678][PYTHON] Fix _cleanup_and_return signature ### What changes were proposed in this pull request? This PR fixes return type annotation for `pandas.groupby.SeriesGroupBy._cleanup_and_return`. ### Why are the changes needed? Current annotation is incorrect (mixes pandas an pyspark.pandas types). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34950 from zero323/SPARK-37678. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit 012939077627e4f35d9585c5a46281776b770190) Signed-off-by: Takuya UESHIN <ueshin@databricks.com> 20 December 2021, 22:15:17 UTC
63f09c1 [MINOR][PYTHON][DOCS] Fix documentation for Python's recentProgress & lastProgress ### What changes were proposed in this pull request? This small PR fixes incorrect documentation in Structured Streaming Guide where Python's `recentProgress` & `lastProgress` where shown as functions although they are [properties](https://github.com/apache/spark/blob/master/python/pyspark/sql/streaming.py#L117), so if they are called as functions it generates error: ``` >>> query.lastProgress() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'dict' object is not callable ``` ### Why are the changes needed? The documentation was erroneous, and needs to be fixed to avoid confusion by readers ### Does this PR introduce _any_ user-facing change? yes, it's a fix of the documentation ### How was this patch tested? Not necessary Closes #34947 from alexott/fix-python-recent-progress-docs. Authored-by: Alex Ott <alexott@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit bad96e6d029dff5be9efaf99f388cd9436741b6f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 19 December 2021, 08:02:52 UTC
255e82e [SPARK-37654][SQL] Fix NPE in Row.getSeq when field is Null ### What changes were proposed in this pull request? Fix NPE ``` scala> Row(null).getSeq(0) java.lang.NullPointerException at org.apache.spark.sql.Row.getSeq(Row.scala:319) at org.apache.spark.sql.Row.getSeq$(Row.scala:319) at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:166) ``` ### Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new UT Closes #34928 from huaxingao/npe. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit fcf636d9eb8d645c24be3db2d599aba2d7e2955a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 17 December 2021, 04:43:34 UTC
22eb1eb [SPARK-37060][CORE][3.2] Handle driver status response from backup masters ### What changes were proposed in this pull request? After an improvement in SPARK-31486, contributor uses 'asyncSendToMasterAndForwardReply' method instead of 'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's status is only available in active master and the 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we have to handle the response from the backup masters in the client, which the developer did not consider in the SPARK-31486 change. So drivers running in cluster mode and on a cluster with multi masters affected by this bug. ### Why are the changes needed? We need to find if the response received from a backup master client must ignore it. ### Does this PR introduce _any_ user-facing change? No, It's only fixed a bug and brings back the ability to deploy in cluster mode on multi-master clusters. ### How was this patch tested? Closes #34912 from mohamadrezarostami/fix-bug-in-report-driver-status-in-multi-master. Authored-by: Mohamadreza Rostami <mohamadrezarostami2@gmail.com> Signed-off-by: yi.wu <yi.wu@databricks.com> 16 December 2021, 07:19:39 UTC
6b2b6ea [SPARK-37656][BUILD] Upgrade SBT to 1.5.7 This PR upgrades SBT to `1.5.7`. SBT 1.5.7 was released a few hours ago, which includes a fix for CVE-2021-45046. Spark doesn't use Log4J 2.x for now, but it's to avoid potential vulnerability just in case. No. GA and AppVayor. Closes #34915 from sarutak/upgrade-sbt-1.5.7. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 972489a5708ee4ff89bd2e93aed80020cc2fb8d9) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 December 2021, 06:25:26 UTC
49b830c [SPARK-37633][SQL] Unwrap cast should skip if downcast failed with an… ### What changes were proposed in this pull request? Use non-ANSI cast when applying `UnwrapCastInBinaryComparison` rule. ### Why are the changes needed? Since `UnwrapCastInBinaryComparison` is an optimizer rule, it should not fail the application in cast. ### Does this PR introduce _any_ user-facing change? With `spark.sql.ansi.enabled=true`, application won't fail if downcast fail when applying `UnwrapCastInBinaryComparison` rule. ### How was this patch tested? Update UT. Closes #34888 from manuzhang/spark-37633. Authored-by: manuzhang <tianlzhang@ebay.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit 71d4b277f742ba0d487a26082f2ea4b342c498cd) Signed-off-by: Chao Sun <sunchao@apple.com> 15 December 2021, 19:32:16 UTC
50af717 Revert "[SPARK-37575][SQL] null values should be saved as nothing rather than quoted empty Strings "" by default settings" This reverts commit 62e4202b65d76b05f9f9a15819a631524c6e7985. 15 December 2021, 03:54:28 UTC
9cd64f8 [SPARK-37217][SQL][3.2] The number of dynamic partitions should early check when writing to external tables ### What changes were proposed in this pull request? SPARK-29295 introduces a mechanism that writes to external tables is a dynamic partition method, and the data in the target partition will be deleted first. Assuming that 1001 partitions are written, the data of 10001 partitions will be deleted first, but because `hive.exec.max.dynamic.partitions` is 1000 by default, loadDynamicPartitions will fail at this time, but the data of 1001 partitions has been deleted. So we can check whether the number of dynamic partitions is greater than `hive.exec.max.dynamic.partitions` before deleting, it should fail quickly at this time. ### Why are the changes needed? Avoid data that cannot be recovered when the job fails. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add UT Closes #34889 from cxzl25/SPARK-37217-3.2. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Chao Sun <sunchao@apple.com> 14 December 2021, 18:18:53 UTC
62e4202 [SPARK-37575][SQL] null values should be saved as nothing rather than quoted empty Strings "" by default settings ### What changes were proposed in this pull request? Fix the bug that null values are saved as quoted empty strings "" (as the same as empty strings) rather than nothing by default csv settings since Spark 2.4. ### Why are the changes needed? This is an unexpected bug, if don't fix it, we still can't distinguish null values and empty strings in saved csv files. As mentioned in [spark sql migration guide](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24)(2.3=>2.4), empty strings are saved as quoted empty string "", null values as saved as nothing since Spark 2.4. > Since Spark 2.4, empty strings are saved as quoted empty strings "". In version 2.3 and earlier, empty strings are equal to null values and do not reflect to any characters in saved CSV files. For example, the row of "a", null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as a,,"",1. To restore the previous behavior, set the CSV option emptyValue to empty (not quoted) string. But actually, we found that null values are also saved as quoted empty strings "" as the same as empty strings. For codes follows: ```scala Seq(("Tesla", null.asInstanceOf[String], "")) .toDF("make", "comment", "blank") .coalesce(1) .write.csv(path) ``` actual results: >Tesla,"","" expected results: >Tesla,,"" ### Does this PR introduce _any_ user-facing change? Yes, if this bug has been fixed, the output of null values would been changed to nothing rather than quoted empty strings "". But, users can set nullValue to "\\"\\""(same as emptyValueInWrite's default value) to restore the previous behavior since 2.4. ### How was this patch tested? Adding a test case. Closes #34853 from wayneguow/SPARK-37575. Lead-authored-by: wayneguow <guow93@gmail.com> Co-authored-by: Wayne Guo <guow93@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 6a59fba248359fb2614837fe8781dc63ac8fdc4c) Signed-off-by: Max Gekk <max.gekk@gmail.com> 14 December 2021, 08:26:47 UTC
e16d3f9 [SPARK-37577][SQL][3.2] Fix ClassCastException: ArrayType cannot be cast to StructType for Generate Pruning ### What changes were proposed in this pull request? This patch fixes a bug in nested column pruning for Generate. The bug happens after we replace extractor's reference attribute with generator's input, then when we try to replace a top extractor (e.g. `GetStructField`), if the replaced child expression is array type, `extractFieldName` will fail by throwing `ClassCastException`. This is the backport from #34845 to branch-3.2. ### Why are the changes needed? In particular, when we transform the extractor and replace with generator's input, we cannot simply do a transform up because because if we replace the attribute, `extractFieldName` could cause `ClassCastException` error. We need to get the field name before replacing down the attribute/other extractor. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test. Closes #34885 from viirya/SPARK-37577-3.2. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 December 2021, 04:23:27 UTC
35942fc [SPARK-37624][PYTHON][DOCS] Suppress warnings for live pandas-on-Spark quickstart notebooks This PR proposes to suppress warnings, in live pandas-on-Spark quickstart notebooks, as below: <img width="1109" alt="Screen Shot 2021-12-13 at 2 02 45 PM" src="https://user-images.githubusercontent.com/6477701/145756407-03b9d94e-a082-42a1-8052-d14b4989ae86.png"> <img width="1103" alt="Screen Shot 2021-12-13 at 2 02 25 PM" src="https://user-images.githubusercontent.com/6477701/145756419-b5dfa729-96fa-4646-bb26-40302d33632b.png"> <img width="1100" alt="Screen Shot 2021-12-13 at 2 02 20 PM" src="https://user-images.githubusercontent.com/6477701/145756420-b8e1b105-495b-4b1c-ab3c-4d47793ba80e.png"> <img width="1027" alt="Screen Shot 2021-12-13 at 1 32 05 PM" src="https://user-images.githubusercontent.com/6477701/145756424-bf93a4a1-2587-49fb-9abb-e2f6de032e48.png"> This is a user-facing quickstart that is interactive shall, and showing a lot of warnings makes difficult to follow the output. Note that we also set a lower log4j level to interpreters. Users will see clean output in live quickstart notebook: https://mybinder.org/v2/gh/apache/spark/9e614e265f?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb I manually tested at https://mybinder.org/v2/gh/HyukjinKwon/spark/cleanup-ps-quickstart?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Closes #34875 from HyukjinKwon/cleanup-ps-quickstart. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 195da5623c36ce54e85bc6584f7b49107899ffff) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 December 2021, 10:22:00 UTC
f20c408 [SPARK-37481][CORE][WEBUI] Fix disappearance of skipped stages after they retry ### What changes were proposed in this pull request? When skipped stages retry, their skipped info will be lost on the UI, and then we may see a stage with 200 tasks indeed, shows that it only has 3 tasks but its `retry 1` has 15 tasks and completely different inputs/outputs. A simple way to reproduce, ``` bin/spark-sql --packages com.github.yaooqinn:itachi_2.12:0.3.0 ``` and run ``` select * from (select v from (values (1), (2), (3) t(v))) t1 join (select stage_id_with_retry(3) from (select v from values (1), (2), (3) t(v) group by v)) t2; ``` Also, Detailed in the Gist here - https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c In this PR, we increase the nextAttempIds of these skipped stages once they get visited. ### Why are the changes needed? fix problems when we have skipped stage retries. ### Does this PR introduce _any_ user-facing change? Yes, the UI will keep the skipped stages info ![image](https://user-images.githubusercontent.com/8326978/144010378-02a688ce-0ead-4c41-ab9b-bc5fce4f8b90.png) ### How was this patch tested? manually as recorded in https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c existing tests Closes #34735 from yaooqinn/SPARK-37481. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: yi.wu <yi.wu@databricks.com> (cherry picked from commit 5880df41f50210f2df44f25640437d99b8979d70) Signed-off-by: yi.wu <yi.wu@databricks.com> 13 December 2021, 02:41:17 UTC
6b5ae04 [SPARK-37615][BUILD] Upgrade SBT to 1.5.6 This PR aims to upgrade SBT to 1.5.6. - https://github.com/sbt/sbt/releases/tag/v1.5.6 SBT 1.5.6 updates log4j 2 to 2.15.0, which fixes remote code execution vulnerability (CVE-2021-44228) No. This only affects build servers. Pass the CIs. Closes #34869 from williamhyun/sbt156. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 119da4e6185f234cce00ccfd1b27c6d28e9b249a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 December 2021, 05:27:52 UTC
0bd2dab [SPARK-37594][SPARK-34399][SQL][TESTS] Make UT more stable ### What changes were proposed in this pull request? There are some GA test failed caused by UT ` test("SPARK-34399: Add job commit duration metrics for DataWritingCommand") ` such as ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 0 was not greater than 0 at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.$anonfun$new$87(SQLMetricsSuite.scala:810) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.sql.test.SQLTestUtilsBase.withTable(SQLTestUtils.scala:305) at org.apache.spark.sql.test.SQLTestUtilsBase.withTable$(SQLTestUtils.scala:303) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.withTable(SQLMetricsSuite.scala:44) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.$anonfun$new$86(SQLMetricsSuite.scala:800) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(SQLMetricsSuite.scala:44) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.withSQLConf(SQLMetricsSuite.scala:44) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.$anonfun$new$85(SQLMetricsSuite.scala:800) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.sql.execution.adaptive.DisableAdaptiveExecutionSuite.$anonfun$test$5(AdaptiveTestUtils.scala:65) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf(SQLHelper.scala:54) at org.apache.spark.sql.catalyst.plans.SQLHelper.withSQLConf$(SQLHelper.scala:38) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(SQLMetricsSuite.scala:44) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:246) at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:244) at org.apache.spark.sql.execution.metric.SQLMetricsSuite.withSQLConf(SQLMetricsSuite.scala:44) at org.apache.spark.sql.execution.adaptive.DisableAdaptiveExecutionSuite.$anonfun$test$4(AdaptiveTestUtils.scala:65) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) ``` This pr to add a certain job commit delay and task commit delay to make it more stable. ### Why are the changes needed? Make unit test more stable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #34847 from AngersZhuuuu/SPARK-37594. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 471a5b55b80256ccd253c93623ff363add5f1985) Signed-off-by: Sean Owen <srowen@gmail.com> 10 December 2021, 16:53:45 UTC
1890348 [SPARK-37451][SQL] Fix cast string type to decimal type if spark.sql.legacy.allowNegativeScaleOfDecimal is enabled ### What changes were proposed in this pull request? Fix cast string type to decimal type only if `spark.sql.legacy.allowNegativeScaleOfDecimal` is enabled. For example: ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.Row spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", true) val data = Seq(Row("7.836725755512218E38")) val schema = StructType(Array(StructField("a", StringType, false))) val df =spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.select(col("a").cast(DecimalType(37,-17))).show ``` The result is null since [SPARK-32706](https://issues.apache.org/jira/browse/SPARK-32706). ### Why are the changes needed? Fix regression bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #34811 from wangyum/SPARK-37451. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a1214a98f4b3f7715fb984ad3df514471b1e33c7) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 December 2021, 09:08:46 UTC
59380fe [SPARK-37392][SQL] Fix the performance bug when inferring constraints for Generate ### What changes were proposed in this pull request? This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295 If you run the query in the JIRA ticket ``` Seq( (1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x") ).toDF() .checkpoint() // or save and reload to truncate lineage .createOrReplaceTempView("sub") session.sql(""" SELECT * FROM ( SELECT EXPLODE( ARRAY( * ) ) result FROM ( SELECT _1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u FROM sub ) ) WHERE result != '' """).show() ``` You will hit OOM. The reason is that: 1. We infer additional predicates with `Generate`. In this case, it's `size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0` 2. Because of the cast, the `ConstantFolding` rule can't optimize this `size(array(...))`. 3. We end up with a plan containing this part ``` +- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126] +- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0) +- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41] ``` When calculating the constraints of the `Project`, we generate around 2^20 expressions, due to this code ``` var allConstraints = child.constraints projectList.foreach { case a Alias(l: Literal, _) => allConstraints += EqualNullSafe(a.toAttribute, l) case a Alias(e, _) => // For every alias in `projectList`, replace the reference in constraints by its attribute. allConstraints ++= allConstraints.map(_ transform { case expr: Expression if expr.semanticEquals(e) => a.toAttribute }) allConstraints += EqualNullSafe(e, a.toAttribute) case _ => // Don't change. } ``` There are 3 issues here: 1. We may infer complicated predicates from `Generate` 2. `ConstanFolding` rule is too conservative. At least `Cast` has no side effect with ANSI-off. 3. When calculating constraints, we should have a upper bound to avoid generating too many expressions. This fixes the first 2 issues, and leaves the third one for the future. ### Why are the changes needed? fix a performance issue ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests, and run the query in JIRA ticket locally. Closes #34823 from cloud-fan/perf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1fac7a9d9992b7c120f325cdfa6a935b52c7f3bc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 December 2021, 05:04:56 UTC
fd7bd3e Preparing development version 3.2.2-SNAPSHOT 07 December 2021, 22:29:06 UTC
e9d2a27 Preparing Spark release v3.2.1-rc1 07 December 2021, 22:28:57 UTC
ce414f8 [SPARK-37556][SQL] Deser void class fail with Java serialization **What changes were proposed in this pull request?** Change the deserialization mapping for primitive type void. **Why are the changes needed?** The void primitive type in Scala should be classOf[Unit] not classOf[Void]. Spark erroneously [map it](https://github.com/apache/spark/blob/v3.2.0/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala#L80) differently than all other primitive types. Here is the code: ``` private object JavaDeserializationStream { val primitiveMappings = Map[String, Class[_]]( "boolean" -> classOf[Boolean], "byte" -> classOf[Byte], "char" -> classOf[Char], "short" -> classOf[Short], "int" -> classOf[Int], "long" -> classOf[Long], "float" -> classOf[Float], "double" -> classOf[Double], "void" -> classOf[Void] ) } ``` Spark code is Here is the demonstration: ``` scala> classOf[Long] val res0: Class[Long] = long scala> classOf[Double] val res1: Class[Double] = double scala> classOf[Byte] val res2: Class[Byte] = byte scala> classOf[Void] val res3: Class[Void] = class java.lang.Void <--- this is wrong scala> classOf[Unit] val res4: Class[Unit] = void <---- this is right ``` It will result in Spark deserialization error if the Spark code contains void primitive type: `java.io.InvalidClassException: java.lang.Void; local class name incompatible with stream class name "void"` **Does this PR introduce any user-facing change?** no **How was this patch tested?** Changed test, also tested e2e with the code results deserialization error and it pass now. Closes #34816 from daijyc/voidtype. Authored-by: Daniel Dai <jdai@pinterest.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit fb40c0e19f84f2de9a3d69d809e9e4031f76ef90) Signed-off-by: Sean Owen <srowen@gmail.com> 07 December 2021, 14:51:59 UTC
9a17d8b [SPARK-37004][PYTHON] Upgrade to Py4J 0.10.9.3 This PR upgrades Py4J from 0.10.9.2 to 0.10.9.3 which contains the bug fix (https://github.com/bartdag/py4j/pull/440) that directly affected us. For example, once you cancel a cell in Jupyter, all following cells simply fail. This PR fixes the bug by upgrading Py4J. To fix a regression in Spark 3.2.0 in notebooks like Jupyter. Fixes a regression described in SPARK-37004 Manually tested the fix when I land https://github.com/bartdag/py4j/pull/440 to Py4J. Closes #34814 from HyukjinKwon/SPARK-37004. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 72669b574ecbcfd35873aaf751807c90bb415c8f) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 06 December 2021, 08:35:33 UTC
449c330 [SPARK-37534][BUILD][3.2] Bump dev.ludovic.netlib to 2.2.1 ### What changes were proposed in this pull request? Bump the version of dev.ludovic.netlib from 2.2.0 to 2.2.1. ### Why are the changes needed? This fixes a computation bug in sgemm. See [the issue](https://github.com/luhenry/netlib/issues/7) and [the diff](https://github.com/luhenry/netlib/compare/v2.2.0...v2.2.1). ### Does this PR introduce _any_ user-facing change? No. It fixes the underlying computation of a single floating point matrix-matrix multiplication. ### How was this patch tested? Additional tests were added to the dev.ludovic.netlib project, and a CI run has validated the change. Closes #34797 from luhenry/branch-3.2-bump-netlib. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 03 December 2021, 16:29:56 UTC
805db87 [SPARK-37471][SQL] spark-sql support `;` in nested bracketed comment ### What changes were proposed in this pull request? In current spark-sql, when use -e and -f, it can't support nested bracketed comment such as ``` /* SELECT /*+ BROADCAST(b) */ 4; */ SELECT 1 ; ``` When run `spark-sql -f` with `--verbose` got below error ``` park master: yarn, Application Id: application_1632999510150_6968442 /* sielect /* BROADCAST(b) */ 4 Error in query: mismatched input '4' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 30) == SQL == /* sielect /* BROADCAST(b) */ 4 ------------------------------^^^ ``` In current code ``` else if (line.charAt(index) == '/' && !insideSimpleComment) { val hasNext = index + 1 < line.length if (insideSingleQuote || insideDoubleQuote) { // Ignores '/' in any case of quotes } else if (insideBracketedComment && line.charAt(index - 1) == '*' ) { // Decrements `bracketedCommentLevel` at the beginning of the next loop leavingBracketedComment = true } else if (hasNext && !insideBracketedComment && line.charAt(index + 1) == '*') { bracketedCommentLevel += 1 } } ``` If it meet an `*/`, it will mark `leavingBracketedComment` as true, then when call next char, bracketed comment level -1. ``` if (leavingBracketedComment) { bracketedCommentLevel -= 1 leavingBracketedComment = false } ``` But when meet `/*`, it need `!insideBracketedComment`, then means if we have a case ``` /* aaa /* bbb */ ; ccc */ select 1; ``` when meet second `/*` , `insideBracketedComment` is true, so this `/*` won't be treat as a start of bracket comment. Then meet the first `*/`, bracketed comment end, this query is split as ``` /* aaa /* bbb */; => comment ccc */ select 1; => query ``` Then query failed. So here we remove the condition of `!insideBracketedComment`, then we can have `bracketedCommentLevel > 1` and since ``` def insideBracketedComment: Boolean = bracketedCommentLevel > 0 ``` So chars between all level of bracket are treated as comment. ### Why are the changes needed? In spark #37389 we support nested bracketed comment in SQL, here for spark-sql we should support too. ### Does this PR introduce _any_ user-facing change? User can use nested bracketed comment in spark-sql ### How was this patch tested? Since spark-sql console mode have special logic about handle `;` ``` while (line != null) { if (!line.startsWith("--")) { if (prefix.nonEmpty) { prefix += '\n' } if (line.trim().endsWith(";") && !line.trim().endsWith("\\;")) { line = prefix + line ret = cli.processLine(line, true) prefix = "" currentPrompt = promptWithCurrentDB } else { prefix = prefix + line currentPrompt = continuedPromptWithDBSpaces } } line = reader.readLine(currentPrompt + "> ") } ``` If we write sql as below ``` /* SELECT /*+ BROADCAST(b) */ 4\\; */ SELECT 1 ; ``` the `\\;` is escaped. Manuel test wit spark-sql -f ``` (spark.submit.pyFiles,) (spark.submit.deployMode,client) (spark.master,local[*]) Classpath elements: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/26 16:32:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 21/11/26 16:32:10 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 21/11/26 16:32:10 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 21/11/26 16:32:13 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 21/11/26 16:32:13 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore yi.zhu10.12.189.175 Spark master: local[*], Application Id: local-1637915529831 /* select /* BROADCAST(b) */ 4; */ select 1 1 Time taken: 3.851 seconds, Fetched 1 row(s) C02D45VVMD6T:spark yi.zhu$ ``` In current PR, un completed bracket comment won't execute now, for SQL file ``` /* select /* BROADCAST(b) */ 4; */ select 1 ; /* select /* braoad */ ; select 1; ``` It only execute ``` /* select /* BROADCAST(b) */ 4; */ select 1 ; ``` The next part ``` /* select /* braoad */ ; select 1; ``` are still treated as inprogress SQL. Closes #34721 from AngersZhuuuu/SPARK-37471. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6e1912590dcdad56d82b4fe1a5ae0b62560a1a08) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 December 2021, 16:20:07 UTC
170943b [SPARK-37524][SQL] We should drop all tables after testing dynamic partition pruning ### What changes were proposed in this pull request? Drop all tables after testing dynamic partition pruning. ### Why are the changes needed? We should drop all tables after testing dynamic partition pruning. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exist unittests Closes #34768 from weixiuli/SPARK-11150-fix. Authored-by: weixiuli <weixiuli@jd.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 2433c942ca39b948efe804aeab0185a3f37f3eea) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 03 December 2021, 03:34:53 UTC
753e7c2 [SPARK-37522][PYTHON][TESTS] Fix MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction This PR aims to update a PySpark unit test case by increasing the tolerance by `10%` from `0.1` to `0.11`. ``` $ java -version openjdk version "17.0.1" 2021-10-19 LTS OpenJDK Runtime Environment Zulu17.30+15-CA (build 17.0.1+12-LTS) OpenJDK 64-Bit Server VM Zulu17.30+15-CA (build 17.0.1+12-LTS, mixed mode, sharing) $ build/sbt test:package $ python/run-tests --testname 'pyspark.ml.tests.test_algorithms MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction' --python-executables=python3 ... ====================================================================== FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/ml/tests/test_algorithms.py", line 104, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, rtol=0.102)) AssertionError: False is not true ---------------------------------------------------------------------- Ran 1 test in 7.385s FAILED (failures=1) ``` No. Manually on native AppleSilicon Java 17. Closes #34784 from dongjoon-hyun/SPARK-37522. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit eec9fecf5e3c1d0631caed427d4d468f44f98de9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 December 2021, 18:31:31 UTC
fa67b5b [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled Immediately materialize underlying rdd cache (using .count) for an InMemoryRelation when `buildBuffers` is called. Currently, when `CachedRDDBuilder.buildBuffers` is called, `InMemoryRelation.computeStats` will try to read the accumulators to determine what the relation size is. However, the accumulators are not actually accurate until the cachedRDD is executed and finishes. While this has not happened, the accumulators will report a range from 0 bytes to the accumulator value when the cachedRDD finishes. In AQE, join planning can happen during this time and, if it reads the size as 0 bytes, will likely plan a broadcast join mistakenly believing the build side is very small. If the InMemoryRelation is actually very large in size, then this will cause many issues during execution such as job failure due to broadcasting over 8GB. Yes. Before, cache materialization doesn't happen until the job starts to run. Now, it happens when trying to get the rdd representing an InMemoryRelation. Tests added Closes #34684 from ChenMichael/SPARK-37442-InMemoryRelation-statistics-inaccurate-during-join-planning. Authored-by: Michael Chen <mike.chen@workday.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c37b726bd09d34e1115a8af1969485e60dc02592) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2021, 14:34:51 UTC
87af2fd [SPARK-37480][K8S][DOC][3.2] Sync Kubernetes configuration to latest in running-on-k8s.md ### What changes were proposed in this pull request? Sync Kubernetes configurations to 3.2.0 in doc ### Why are the changes needed? Configurations in docs/running-on-kubernetes.md are not uptodate ### Does this PR introduce _any_ user-facing change? No, docs only ### How was this patch tested? CI passed Closes #34770 from Yikun/k8s-doc-32. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 December 2021, 17:01:05 UTC
46fc98c [SPARK-37460][DOCS] Add the description of ALTER DATABASE SET LOCATION ### What changes were proposed in this pull request? Added the description of `ALTER DATABASE SET LOCATION` command in `sql-ref-syntax-ddl-alter-database.md` ### Why are the changes needed? This command can be used but not documented anywhere. ### Does this PR introduce _any_ user-facing change? Yes, docs changes. ### How was this patch tested? `SKIP_API=1 bundle exec jekyll build` <img width="395" alt="alterdb1" src="https://user-images.githubusercontent.com/87687356/143523751-4cc32ad1-390c-491b-a6ea-bf1664535c28.png"> <img width="396" alt="alterdb2" src="https://user-images.githubusercontent.com/87687356/143523757-aa741d74-0e51-4e17-8768-9da7eb86a7d8.png"> Closes #34718 from yutoacts/SPARK-37460. Authored-by: Yuto Akutsu <yuto.akutsu@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5924961e4ac308d3910b8430c1e72dea275073a9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 December 2021, 12:14:09 UTC
ad5ac3a [SPARK-37389][SQL][FOLLOWUP] SET command shuold not parse comments ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/34668 , to fix a breaking change. The SET command uses wildcard which may contain unclosed comment, e.g. `/path/to/*`, and we shouldn't fail it. This PR fixes it by skipping the unclosed comment check if we are parsing SET command. ### Why are the changes needed? fix a breaking change ### Does this PR introduce _any_ user-facing change? no, the breaking change is not released yet. ### How was this patch tested? new tests Closes #34763 from cloud-fan/set. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit eaa135870a30fb89c2f1087991328a6f72a1860c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 December 2021, 08:34:06 UTC
c836e93 [SPARK-37513][SQL][DOC] date +/- interval with only day-time fields returns different data type between Spark3.2 and Spark3.1 ### What changes were proposed in this pull request? The SQL show below previously returned the date type, now it returns the timestamp type. `select date '2011-11-11' + interval 12 hours;` `select date '2011-11-11' - interval 12 hours;` The basic reason is: In Spark3.1 https://github.com/apache/spark/blob/75cac1fe0a46dbdf2ad5b741a3a49c9ab618cdce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L338 In Spark3.2 https://github.com/apache/spark/blob/ceae41ba5cafb479cdcfc9a6a162945646a68f05/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L376 Because Spark3.2 have been released, so we add the migration guide for it. ### Why are the changes needed? Give a migration guide for different behavior between Spark3.1 and Spark3.2. ### Does this PR introduce _any_ user-facing change? 'No'. Just modify the docs. ### How was this patch tested? No need. Closes #34766 from beliefer/SPARK-37513. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ec47c3c4394b2410a277e7f7105cf896c28b2ed4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 December 2021, 08:20:07 UTC
ceae41b [SPARK-37497][K8S] Promote `ExecutorPods[PollingSnapshot|WatchSnapshot]Source` to DeveloperApi ### What changes were proposed in this pull request? This PR aims to promote `ExecutorPodsWatchSnapshotSource` and `ExecutorPodsPollingSnapshotSource` as **stable** `DeveloperApi` in order to maintain it officially in a backward compatible way at Apache Spark 3.3.0. ### Why are the changes needed? - Since SPARK-24248 at Apache Spark 2.4.0, `ExecutorPodsWatchSnapshotSource` and `ExecutorPodsPollingSnapshotSource` have been used to monitor executor pods without any interface changes for over 3 years. - Apache Spark 3.1.1 makes `Kubernetes` module GA and provides an extensible external cluster manager framework. New `ExternalClusterManager` for K8s environment need to depend on this to monitor pods. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes #34751 from dongjoon-hyun/SPARK-37497. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2b044962cd6eff5a3a76f2808ee93b40bdf931df) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 December 2021, 02:41:36 UTC
5f1ad08 [SPARK-37505][MESOS][TESTS] Add a log4j.properties for `mesos` module UT ### What changes were proposed in this pull request? `scalatest-maven-plugin` configure `file:src/test/resources/log4j.properties` as the UT log configuration, so this PR adds this `log4j.properties` file to the mesos module for UT. ### Why are the changes needed? Supplement missing log4j configuration file for mesos module . ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test **Before** Run ``` mvn clean install -pl resource-managers/mesos -Pmesos -am -DskipTests mvn test -pl resource-managers/mesos -Pmesos ``` will print the following log: ``` log4j:ERROR Could not read configuration file from URL [file:src/test/resources/log4j.properties]. java.io.FileNotFoundException: src/test/resources/log4j.properties (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at java.io.FileInputStream.<init>(FileInputStream.java:93) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) at org.apache.log4j.LogManager.<clinit>(LogManager.java:127) at org.slf4j.impl.Log4jLoggerFactory.<init>(Log4jLoggerFactory.java:66) at org.slf4j.impl.StaticLoggerBinder.<init>(StaticLoggerBinder.java:72) at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:45) at org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j12(Logging.scala:222) at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:127) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:111) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) at org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101) at org.apache.spark.SparkFunSuite.initializeLogIfNecessary(SparkFunSuite.scala:62) at org.apache.spark.internal.Logging.log(Logging.scala:49) at org.apache.spark.internal.Logging.log$(Logging.scala:47) at org.apache.spark.SparkFunSuite.log(SparkFunSuite.scala:62) at org.apache.spark.SparkFunSuite.<init>(SparkFunSuite.scala:74) at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackendSuite.<init>(MesosCoarseGrainedSchedulerBackendSuite.scala:43) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:66) at org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.scalatest.tools.DiscoverySuite.<init>(DiscoverySuite.scala:37) at org.scalatest.tools.Runner$.genDiscoSuites$1(Runner.scala:1132) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1226) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1482) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971) at org.scalatest.tools.Runner$.main(Runner.scala:775) at org.scalatest.tools.Runner.main(Runner.scala) log4j:ERROR Ignoring configuration file [file:src/test/resources/log4j.properties]. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties ``` and test log will print to console. **After** No above log in console and test log will print to `resource-managers/mesos/target/unit-tests.log` as other module. Closes #34759 from LuciferYang/SPARK-37505. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fdb33dd9e27ac5d69ea875ca5bb85dfd369e71f1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 30 November 2021, 22:41:05 UTC
d1ea322 [SPARK-37452][SQL] Char and Varchar break backward compatibility between v3.1 and v2 ### What changes were proposed in this pull request? We will store table schema in table properties for the read-side to restore. In Spark 3.1, we add char/varchar support natively. In some commands like `create table`, `alter table` with these types, the `char(x)` or `varchar(x)` will be stored directly to those properties. If a user uses Spark 2 to read such a table it will fail to parse the schema. FYI, https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L136 A table can be a newly created one by Spark 3.1 and later or an existing one modified by Spark 3.1 and on. ### Why are the changes needed? backward compatibility ### Does this PR introduce _any_ user-facing change? That's not necessarily user-facing as a bugfix and only related to internal table properties. ### How was this patch tested? manully Closes #34697 from yaooqinn/SPARK-37452. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 0c3c4e2fd06629b77b86eeb36490ecf07d5283fc) Signed-off-by: Kent Yao <yao@apache.org> 29 November 2021, 09:06:15 UTC
a51ad2a [SPARK-37446][SQL] Use reflection for getWithoutRegisterFns to allow different Hive versions for building ### What changes were proposed in this pull request? Since Hive 2.3.9 start have function `getWithoutRegisterFns`, but user may use hive 2.3.8 or lower version. Here we should use reflection to let user build with hive 2.3.8 or lower version ### Why are the changes needed? Support build with hive version lower than 2.3.9 since many user will build spark with it 's own hive code and their own jar (they may do some optimize or. other thing in their own code). This pr make it easier to integrate and won't hurt current logic. ### Does this PR introduce _any_ user-facing change? User can build spark with hive version lower than 2.3.9 ### How was this patch tested? build with command ``` ./dev/make-distribution.sh --tgz -Pyarn -Phive -Phive-thriftserver -Dhive.version=2.3.8 ``` Jars under dist ![image](https://user-images.githubusercontent.com/46485123/143162194-d505b151-f23d-4268-af19-6dfeccea4a74.png) Closes #34690 from AngersZhuuuu/SPARK-37446. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 04671bd04d3d5688bd0d3fa616365ee32cc1ff41) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 25 November 2021, 01:06:54 UTC
a7184d5 [SPARK-37389][SQL] Check unclosed bracketed comments The SQL below has unclosed bracketed comment. ``` /*abc*/ select 1 as a /* 2 as b /*abc*/ , 3 as c /**/ ; ``` But Spark outputs: <!DOCTYPE html> a | -- | 1 | PostgreSQL also supports the feature, but output: ``` SQL 错误 [42601]: Unterminated block comment started at position 22 in SQL /*abc*/ select 1 as a /* 2 as b /*abc*/ , 3 as c /**/ ;. Expected */ sequence ``` The execute plan is not expected, if we don't check unclosed bracketed comments. 'Yes'. The behavior of bracketed comments will more correctly. New tests. Closes #34668 from beliefer/SPARK-37389. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ff249c3bad2d9a42e42934ebb44d9d6132f49fb8) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 November 2021, 08:18:46 UTC
344658a [SPARK-37388][SQL] Fix NPE in WidthBucket in WholeStageCodegenExec This PR fixes a `NullPointerException` in `WholeStageCodegenExec` caused by the expression `WidthBucket`. The cause of this NPE is that `WidthBucket` calls `WidthBucket.computeBucketNumber`, which can return `null`, but the generated code cannot deal with `null`s. This fixes a `NullPointerException` in Spark SQL. No Added tests to `MathExpressionsSuite`. This suite already had tests for `WidthBucket` with interval inputs, but lacked tests with double inputs. I checked that the tests failed without the fix, and succeed with the fix. Closes #34670 from tomvanbussel/SPARK-37388. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 77f397464823a547c27d98ed306703cb9c73cec3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 22 November 2021, 08:20:01 UTC
b27be1f [SPARK-37209][YARN][TESTS] Fix `YarnShuffleIntegrationSuite` releated UTs when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### What changes were proposed in this pull request? `YarnShuffleIntegrationSuite`, `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` will failed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars`, the fail reason is `java.lang.NoClassDefFoundError: breeze/linalg/Matrix`. The above UTS can succeed when using `hadoop-2.7` profile without `assembly/target/scala-%s/jars` because `KryoSerializer.loadableSparkClasses` can workaroud when `Utils.isTesting` is true, but `Utils.isTesting` is false when using `hadoop-3.2` profile. After investigated, I found that when `hadoop-2.7` profile is used, `SPARK_TESTING` will be propagated to AM and Executor, but when `hadoop-3.2` profile is used, `SPARK_TESTING` will not be propagated to AM and Executor. In order to ensure the consistent behavior of using `hadoop-2.7` and ``hadoop-3.2``, this pr change to manually propagate `SPARK_TESTING` environment variable if it exists to ensure `Utils.isTesting` is true in above test scenario. ### Why are the changes needed? Ensure `YarnShuffleIntegrationSuite` releated UTs can succeed when using `hadoop-3.2` profile without `assembly/target/scala-%s/jars` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `YarnShuffleIntegrationSuite`. `YarnShuffleAuthSuite` and `YarnShuffleAlternateNameConfigSuite` can be verified in the same way. Please ensure that the `assembly/target/scala-%s/jars` directory does not exist before executing the test command, we can clean up the whole project by executing follow command or clone a new local code repo. 1. run with `hadoop-3.2` profile ``` mvn clean install -Phadoop-3.2 -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite ``` **Before** ``` YarnShuffleIntegrationSuite: - external shuffle service *** FAILED *** FAILED did not equal FINISHED (stdout/stderr was not captured) (BaseYarnClusterSuite.scala:227) Run completed in 48 seconds, 137 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** ``` Error stack as follows: ``` 21/11/20 23:00:09.682 main ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (localhost executor 1): java.lang.NoClassDefFoundError: breeze/linalg/Matrix at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:216) at org.apache.spark.serializer.KryoSerializer$.$anonfun$loadableSparkClasses$1(KryoSerializer.scala:537) at scala.collection.immutable.List.flatMap(List.scala:366) at org.apache.spark.serializer.KryoSerializer$.loadableSparkClasses$lzycompute(KryoSerializer.scala:535) at org.apache.spark.serializer.KryoSerializer$.org$apache$spark$serializer$KryoSerializer$$loadableSparkClasses(KryoSerializer.scala:502) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:226) at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) at org.apache.spark.serializer.KryoSerializationStream.<init>(KryoSerializer.scala:266) at org.apache.spark.serializer.KryoSerializerInstance.serializeStream(KryoSerializer.scala:432) at org.apache.spark.shuffle.ShufflePartitionPairsWriter.open(ShufflePartitionPairsWriter.scala:76) at org.apache.spark.shuffle.ShufflePartitionPairsWriter.write(ShufflePartitionPairsWriter.scala:59) at org.apache.spark.util.collection.WritablePartitionedIterator.writeNext(WritablePartitionedPairCollection.scala:83) at org.apache.spark.util.collection.ExternalSorter.$anonfun$writePartitionedMapOutput$1(ExternalSorter.scala:772) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) at org.apache.spark.util.collection.ExternalSorter.writePartitionedMapOutput(ExternalSorter.scala:775) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: breeze.linalg.Matrix at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 32 more ``` **After** ``` YarnShuffleIntegrationSuite: - external shuffle service Run completed in 35 seconds, 188 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` 2. run with `hadoop-2.7` profile ``` mvn clean install -Phadoop-2.7 -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite ``` **Before** ``` YarnShuffleIntegrationSuite: - external shuffle service Run completed in 30 seconds, 828 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` **After** ``` YarnShuffleIntegrationSuite: - external shuffle service Run completed in 30 seconds, 967 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #34620 from LuciferYang/SPARK-37209. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit a7b3fc7cef4c5df0254b945fe9f6815b072b31dd) Signed-off-by: Sean Owen <srowen@gmail.com> 22 November 2021, 01:11:56 UTC
51eef12 [SPARK-37390][PYTHON][DOCS] Add default value to getattr(app, add_javascript) ### What changes were proposed in this pull request? This PR adds `default` (`None`) value to `getattr(app, add_javascript)`. ### Why are the changes needed? Current logic fails on `getattr(app, "add_javascript")` when executed with Sphinx 4.x. The problem can be illustrated with a simple snippet. ```python >>> class Sphinx: ... def add_js_file(self, filename: str, priority: int = 500, **kwargs: Any) -> None: ... ... >>> app = Sphinx() >>> getattr(app, "add_js_file", getattr(app, "add_javascript")) Traceback (most recent call last): File "<ipython-input-11-442ab6dfc933>", line 1, in <module> getattr(app, "add_js_file", getattr(app, "add_javascript")) AttributeError: 'Sphinx' object has no attribute 'add_javascript' ``` After this PR is merged we'll fallback to `getattr(app, "add_js_file")`: ```python >>> getattr(app, "add_js_file", getattr(app, "add_javascript", None)) <bound method Sphinx.add_js_file of <__main__.Sphinx object at 0x7f444456abe0>> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual docs build. Closes #34669 from zero323/SPARK-37390. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com> (cherry picked from commit b9e9167f589db372f0416425cdf35b48a6216b50) Signed-off-by: zero323 <mszymkiewicz@gmail.com> 20 November 2021, 10:45:45 UTC
31faa59 [SPARK-36900][SPARK-36464][CORE][TEST] Refactor `: size returns correct positive number even with over 2GB data` to pass with Java 8, 11 and 17 ### What changes were proposed in this pull request? Refactor `SPARK-36464: size returns correct positive number even with over 2GB data` in `ChunkedByteBufferOutputStreamSuite` to reduce the total use of memory for this test case, then this case can pass with Java 8, Java 11 and Java 17 use `-Xmx4g`. ### Why are the changes needed? `SPARK-36464: size returns correct positive number even with over 2GB data` pass with Java 8 but OOM with Java 11 and Java 17. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test ``` mvn clean install -pl core -am -Dtest=none -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite ``` with Java 8, Java 11 and Java 17, all tests passed. Closes #34284 from LuciferYang/SPARK-36900. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit cf436233072b75e083a4455dc53b22edba0b3957) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 November 2021, 23:06:00 UTC
4d05b9e [SPARK-37270][3.2][SQL] Fix push foldable into CaseWhen branches if elseValue is empty Backport #34580 ### What changes were proposed in this pull request? This pr fix push foldable into CaseWhen branches if elseValue is empty. For example: ```scala spark.sql("CREATE TABLE t1(bool boolean, number int) using parquet") spark.sql("INSERT INTO t1 VALUES(false, 1)") spark.sql( """ |SELECT * |FROM (SELECT *, | CASE | WHEN bool THEN 'I am not null' | END AS conditions | FROM t1) t |WHERE conditions IS NULL |""".stripMargin).show ``` How do we optimize the filter conditions before this pr: ``` Filter isnull(CASE WHEN bool#7 THEN I am not null END) -> Filter CASE WHEN bool#7 THEN isnull(I am not null) END -> // After PushFoldableIntoBranches Filter (bool#7 AND isnull(I am not null)) -> Filter (bool#7 AND false) -> Filter false ``` How do we optimize the filter conditions after this pr: ``` Filter isnull(CASE WHEN bool#7 THEN I am not null END) -> Filter CASE WHEN bool#7 THEN isnull(I am not null) ELSE isnull(null) END -> // After PushFoldableIntoBranches Filter CASE WHEN bool#7 THEN false ELSE isnull(null) END -> Filter CASE WHEN bool#7 THEN false ELSE true END -> Filter NOT (bool#7 <=> true) ``` ### Why are the changes needed? Fix correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #34627 from wangyum/SPARK-37270-3.2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> 19 November 2021, 01:44:16 UTC
7c6ccee [MINOR][R][DOCS] Fix minor issues in SparkR docs examples ### What changes were proposed in this pull request? This PR fixes two minor problems in SpakR `examples` - Replace misplaced standard comment (`#`) with roxygen comment (`#'`) in `sparkR.session` `examples` - Add missing comma in `write.stream` examples. ### Why are the changes needed? - `sparkR.session` examples are not fully rendered. - `write.stream` example is not syntactically valid. ### Does this PR introduce _any_ user-facing change? Docs only. ### How was this patch tested? Manual inspection of build docs. Closes #34654 from zero323/sparkr-docs-fixes. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 207dd4ce72a5566c554f224edb046106cf97b952) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 19 November 2021, 00:47:31 UTC
1cd3e32 [SPARK-37214][SQL][3.2][FOLLOWUP] No db prefix for temp view is an analyzer error ### What changes were proposed in this pull request? followup of #34490 in branch-3.2, which moves the test for checking `notAllowedToAddDBPrefixForTempViewError` in the parser phase. But it only passes in master. In branch-3.2, the error happens in the analyzer phase. diverge happens in PR #34283 ### Why are the changes needed? fix test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass the restored test. Closes #34633 from linhongliu-db/SPARK-37214-3.2. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 November 2021, 02:25:28 UTC
c362819 [SPARK-37702][SQL][FOLLOWUP] Store referred temp functions for CacheTableAsSelect ### What changes were proposed in this pull request? This is a follow-up of PR #34546 which uses the `AnalysisContext` to store referred temporary functions when creating temp views. But the `CacheTableAsSelect` also needs to update the temporary functions because it's a temp view under the hood. ### Why are the changes needed? Followup of #34546 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? newly added ut Closes #34603 from linhongliu-db/SPARK-37702-followup. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b6c472ee5a6a2384a4e7b795a2d3a97a06fe77c5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 November 2021, 05:02:05 UTC
a52e43f [SPARK-37346][DOC] Add migration guide link for structured streaming ### What changes were proposed in this pull request? Add migration guide link for structured streaming ### Why are the changes needed? Add migration guide link for structured streaming ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #34618 from AngersZhuuuu/SPARK-37346. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 1b84a6b3cc6ace4f8cf3b2ca700991ae43bcf4bd) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 16 November 2021, 23:56:33 UTC
21d901b [SPARK-37322][TESTS] `run_scala_tests` should respect test module order ### What changes were proposed in this pull request? This PR aims to make `run_scala_tests` respect test module order ### Why are the changes needed? Currently the execution order is random. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually through the following and check if the catalyst module runs first. ``` $ SKIP_MIMA=1 SKIP_UNIDOC=1 ./dev/run-tests --parallelism 1 --modules "catalyst,hive-thriftserver" ``` Closes #34590 from williamhyun/order. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c2221a8e3d95ce22d76208c705179c5954318567) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 November 2021, 20:32:43 UTC
45aee7a [SPARK-37317][MLLIB][TESTS] Reduce weights in GaussianMixtureSuite ### What changes were proposed in this pull request? This PR aims to reduce the test weight `100` to `90` to improve the robustness of test case `GMM support instance weighting`. ```scala test("GMM support instance weighting") { val gm1 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed) val gm2 = new GaussianMixture().setK(k).setMaxIter(20).setSeed(seed).setWeightCol("weight") - Seq(1.0, 10.0, 100.0).foreach { w => + Seq(1.0, 10.0, 90.0).foreach { w => ``` ### Why are the changes needed? As mentioned in https://github.com/apache/spark/pull/26735#discussion_r352551174, the weights of model changes when the weights grow. And, the final weight `100` seems to be high enough to cause failures on some JVMs. This is observed in Java 17 M1 native mode. ``` $ java -version openjdk version "17" 2021-09-14 LTS OpenJDK Runtime Environment Zulu17.28+13-CA (build 17+35-LTS) OpenJDK 64-Bit Server VM Zulu17.28+13-CA (build 17+35-LTS, mixed mode, sharing) ``` **BEFORE** ``` $ build/sbt "mllib/test" ... [info] - GMM support instance weighting *** FAILED *** (1 second, 722 milliseconds) [info] Expected 0.10476714410584752 and 1.209081654091291E-14 to be within 0.001 using absolute tolerance. (TestingUtils.scala:88) ... [info] *** 1 TEST FAILED *** [error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7 [error] Failed tests: [error] org.apache.spark.ml.clustering.GaussianMixtureSuite [error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM ``` **AFTER** ``` [info] Total number of tests run: 1638 [info] Suites: completed 205, aborted 0 [info] Tests: succeeded 1638, failed 0, canceled 0, ignored 7, pending 0 [info] All tests passed. [info] Passed: Total 1760, Failed 0, Errors 0, Passed 1760, Ignored 7 [success] Total time: 568 s (09:28), completed Nov 13, 2021, 6:09:16 PM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #34584 from dongjoon-hyun/SPARK-37317. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a1f2ae0a637629796db95643ea7443a1c33bad41) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 November 2021, 16:48:07 UTC
96c17d3 [SPARK-37320][K8S][TESTS] Delete py_container_checks.zip after the test in DepsTestsSuite finishes ### What changes were proposed in this pull request? This PR fixes an issue that `py_container_checks.zip` still remains in `resource-managers/kubernetes/integration-tests/tests/` even after the test `Launcher python client dependencies using a zip file` in `DepsTestsSuite` finishes. ### Why are the changes needed? To keep the repository clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed that the zip file will be removed after the test finishes with the following command using MiniKube. ``` PVC_TESTS_HOST_PATH=/path PVC_TESTS_VM_PATH=/path build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes -Pkubernetes-integration-tests -pl resource-managers/kubernetes/integration-tests integration-test ``` Closes #34588 from sarutak/remove-zip-k8s. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d8cee85e0264fe879f9d1eeec7541a8e94ff83f6) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 November 2021, 16:45:28 UTC
e0a0d4c [SPARK-37318][CORE][TESTS] Make `FallbackStorageSuite` robust in terms of DNS ### What changes were proposed in this pull request? This PR aims to make `FallbackStorageSuite` robust in terms of DNS. ### Why are the changes needed? The test case expects the hostname doesn't exist and it actually doesn't exist. ``` $ nslookup remote Server: 8.8.8.8 Address: 8.8.8.8#53 ** server can't find remote: NXDOMAIN $ ping remote ping: cannot resolve remote: Unknown host ``` However, in some DNS environments, all hostnames including non-existent names seems to be handled like the existing hostnames. ``` $ nslookup remote Server: 172.16.0.1 Address: 172.16.0.1#53 Non-authoritative answer: Name: remote Address: 23.217.138.110 $ ping remote PING remote (23.217.138.110): 56 data bytes 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms $ build/sbt "core/testOnly *.FallbackStorageSuite" ... [info] Run completed in 2 minutes, 31 seconds. [info] Total number of tests run: 9 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 3, failed 6, canceled 0, ignored 0, pending 0 [info] *** 6 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.storage.FallbackStorageSuite [error] (core / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` $ build/sbt "core/testOnly *.FallbackStorageSuite" ... [info] Run completed in 3 seconds, 322 milliseconds. [info] Total number of tests run: 3 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 3, failed 0, canceled 6, ignored 0, pending 0 [info] All tests passed. [success] Total time: 22 s, completed Nov 13, 2021 7:11:31 PM ``` Closes #34585 from dongjoon-hyun/SPARK-37318. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 41f9df92061ab96ce7729f0e2a107a3569046c58) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 November 2021, 09:02:28 UTC
ed9557c [SPARK-37307][CORE] Don't obtain JDBC connection for empty partitions ### What changes were proposed in this pull request? Return immediately from JdbcUtils:savePartition for empty partitions, rather than obtain a database connection that will be unused. ### Why are the changes needed? Avoids the overhead of opening and closing a DB connection that will not be used. ### Does this PR introduce _any_ user-facing change? It should not. ### How was this patch tested? Existing tests Closes #34572 from srowen/SPARK-37307. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> (cherry picked from commit 07134c70c846e03a7928bd1ccc070d16db8768cb) Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> 13 November 2021, 17:19:57 UTC
bdde693 [SPARK-37288][PYTHON][3.2] Backport since annotation update ### What changes were proposed in this pull request? This PR changes since annotation to support `float` arguments: ```python def since(version: Union[str, float]) -> Callable[[T], T]: ... ``` ### Why are the changes needed? `since` is used with both `str` and `float` both in Spark and related libraries and this change has been already done for Spark >= 3.3 (SPARK-36906), Note that this technically fixes a bug in the downstream projects that run mypy checks against `pyspark.since`. When they use it, for example, with `pyspark.since(3.2)`, mypy checks fails; however, this case is legitimate. After this change, the mypy check can pass in thier CIs. ### Does this PR introduce _any_ user-facing change? ```python since(3.2) def f(): ... ``` is going to type check if downstream projects run mypy to validate the types. Otherwise, it does not affect anything invasive or user-facing behavior change. ### How was this patch tested? Existing tests and manual testing. Closes #34555 from zero323/SPARK-37288. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 13 November 2021, 14:53:24 UTC
0c99f13 [SPARK-37302][BUILD] Explicitly downloads guava and jetty-io in test-dependencies.sh ### What changes were proposed in this pull request? This PR change `dev/test-dependencies.sh` to download `guava` and `jetty-io` explicitly. `dev/run-tests.py` fails if Scala 2.13 is used and `guava` or `jetty-io` is not in the both of Maven and Coursier local repository. ``` $ rm -rf ~/.m2/repository/* $ # For Linux $ rm -rf ~/.cache/coursier/v1/* $ # For macOS $ rm -rf ~/Library/Caches/Coursier/v1/* $ dev/change-scala-version.sh 2.13 $ dev/test-dependencies.sh $ build/sbt -Pscala-2.13 clean compile ... [error] /home/kou/work/oss/spark-scala-2.13/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java:24:1: error: package com.google.common.primitives does not exist [error] import com.google.common.primitives.Ints; [error] ^ [error] /home/kou/work/oss/spark-scala-2.13/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java:30:1: error: package com.google.common.annotations does not exist [error] import com.google.common.annotations.VisibleForTesting; [error] ^ [error] /home/kou/work/oss/spark-scala-2.13/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java:31:1: error: package com.google.common.base does not exist [error] import com.google.common.base.Preconditions; ... ``` ``` [error] /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala:87:25: Class org.eclipse.jetty.io.ByteBufferPool not found - continuing with a stub. [error] val connector = new ServerConnector( [error] ^ [error] /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala:87:21: multiple constructors for ServerConnector with alternatives: [error] (x$1: org.eclipse.jetty.server.Server,x$2: java.util.concurrent.Executor,x$3: org.eclipse.jetty.util.thread.Scheduler,x$4: org.eclipse.jetty.io.ByteBufferPool,x$5: Int,x$6: Int,x$7: org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector <and> [error] (x$1: org.eclipse.jetty.server.Server,x$2: org.eclipse.jetty.util.ssl.SslContextFactory,x$3: org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector <and> [error] (x$1: org.eclipse.jetty.server.Server,x$2: org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector <and> [error] (x$1: org.eclipse.jetty.server.Server,x$2: Int,x$3: Int,x$4: org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector [error] cannot be invoked with (org.eclipse.jetty.server.Server, Null, org.eclipse.jetty.util.thread.ScheduledExecutorScheduler, Null, Int, Int, org.eclipse.jetty.server.HttpConnectionFactory) [error] val connector = new ServerConnector( [error] ^ [error] /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:207:13: Class org.eclipse.jetty.io.ClientConnectionFactory not found - continuing with a stub. [error] new HttpClient(new HttpClientTransportOverHTTP(numSelectors), null) [error] ^ [error] /home/kou/work/oss/spark-scala-2.13/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:287:25: multiple constructors for ServerConnector with alternatives: [error] (x$1: org.eclipse.jetty.server.Server,x$2: java.util.concurrent.Executor,x$3: org.eclipse.jetty.util.thread.Scheduler,x$4: org.eclipse.jetty.io.ByteBufferPool,x$5: Int,x$6: Int,x$7: org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector <and> [error] (x$1: org.eclipse.jetty.server.Server,x$2: org.eclipse.jetty.util.ssl.SslContextFactory,x$3: org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector <and> [error] (x$1: org.eclipse.jetty.server.Server,x$2: org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector <and> [error] (x$1: org.eclipse.jetty.server.Server,x$2: Int,x$3: Int,x$4: org.eclipse.jetty.server.ConnectionFactory*)org.eclipse.jetty.server.ServerConnector [error] cannot be invoked with (org.eclipse.jetty.server.Server, Null, org.eclipse.jetty.util.thread.ScheduledExecutorScheduler, Null, Int, Int, org.eclipse.jetty.server.ConnectionFactory) [error] val connector = new ServerConnector( ``` The reason is that `exec-maven-plugin` used in `test-dependencies.sh` downloads pom of guava and jetty-io but doesn't downloads the corresponding jars, and skip dependency testing if Scala 2.13 is used (if dependency testing runs, Maven downloads those jars). ``` if [[ "$SCALA_BINARY_VERSION" != "2.12" ]]; then # TODO(SPARK-36168) Support Scala 2.13 in dev/test-dependencies.sh echo "Skip dependency testing on $SCALA_BINARY_VERSION" exit 0 fi ``` ``` $ find ~/.m2 -name "guava*" ... /home/kou/.m2/repository/com/google/guava/guava/14.0.1/guava-14.0.1.pom /home/kou/.m2/repository/com/google/guava/guava/14.0.1/guava-14.0.1.pom.sha1 ... /home/kou/.m2/repository/com/google/guava/guava-parent/14.0.1/guava-parent-14.0.1.pom /home/kou/.m2/repository/com/google/guava/guava-parent/14.0.1/guava-parent-14.0.1.pom.sha1 ... $ find ~/.m2 -name "jetty*" ... /home/kou/.m2/repository/org/eclipse/jetty/jetty-io/9.4.43.v20210629/jetty-io-9.4.43.v20210629.pom /home/kou/.m2/repository/org/eclipse/jetty/jetty-io/9.4.43.v20210629/jetty-io-9.4.43.v20210629.pom.sha1 ... ``` Under the circumstances, building Spark using SBT fails. `run-tests.py` builds Spark using SBT after the dependency testing so `run-tests.py` fails with Scala 2.13. Further, I noticed that this issue can even happen with Sala 2.12 if the script exit at the following part. ``` OLD_VERSION=$($MVN -q \ -Dexec.executable="echo" \ -Dexec.args='${project.version}' \ --non-recursive \ org.codehaus.mojo:exec-maven-plugin:1.6.0:exec | grep -E '[0-9]+\.[0-9]+\.[0-9]+') if [ $? != 0 ]; then echo -e "Error while getting version string from Maven:\n$OLD_VERSION" exit 1 fi ``` This phenomenon is similar to SPARK-34762 (#31862). So the fix is to get guava and jetty-io explicitly by `mvn dependency:get`. ### Why are the changes needed? To keep the Maven local repository sanity. Then, we can avoid such confusable build error. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed that we can successfully compile code with the following commands. ``` rm -rf ~/.m2/repository/* # For Linux rm -rf ~/.cache/coursier/v1/* # For macOS rm -rf ~/Library/Caches/Coursier/v1/* dev/change-scala-version.sh 2.13 dev/test-dependencies.sh ./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile test:compile ``` Closes #34570 from sarutak/fix-test-dependencies-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 01ab0bedb2ad423963ebff78272e0d045b0b9107) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 November 2021, 19:14:40 UTC
b7ec10a [SPARK-37702][SQL] Use AnalysisContext to track referred temp functions This PR uses `AnalysisContext` to track the referred temp functions in order to fix a temp function resolution issue when it's registered with a `FunctionBuilder` and referred by a temp view. During temporary view creation, we need to collect all the temp functions and save them to the metadata. So that next time when resolving the view SQL text, the functions can be resolved correctly. But if the temp function is registered with a `FunctionBuilder`, it's not a `UserDefinedExpression` so it cannot be collected as a temp function. As a result, the next time when the analyzer resolves a temp view, the registered function couldn't be found. Example: ```scala val func = CatalogFunction(FunctionIdentifier("tempFunc", None), ...) val builder = (e: Seq[Expression]) => e.head spark.sessionState.catalog.registerFunction(func, Some(builder)) sql("create temp view tv as select tempFunc(a, b) from values (1, 2) t(a, b)") sql("select * from tv").collect() ``` bug-fix no newly added test cases. Closes #34546 from linhongliu-db/SPARK-37702-ver3. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 68a0ab5960e847e0fa1a59da0316d0c111574af4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 November 2021, 14:22:26 UTC
86b7621 [MINOR][DOCS] Add Scala 2.13 to index.md and build-spark.md This PR aims to add Scala 2.13 to index.md and build-spark.md Since Spark 3.2, Scala 2.13 is supported. No. Manually review the generated webpage. Closes #34562 from williamhyun/web. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d018503c2d6604b4ef5d7906fa8d445b07e6d056) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 November 2021, 02:52:26 UTC
9f96524 [SPARK-37023][CORE] Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry ### What changes were proposed in this pull request? At high level, created a helper method `getMapSizesByExecutorIdImpl` on which `getMapSizesByExecutorId` and `getPushBasedShuffleMapSizesByExecutorId` can rely. It takes a parameter `useMergeResult`, which helps to check if fetching merge result is needed or not, and pass it as `canFetchMergeResult` into `getStatuses`. ### Why are the changes needed? During some stage retry cases, the `shuffleDependency.shuffleMergeEnabled` can be set to false, but there will be `mergeStatus` since the Driver has already collected the merged status for its shuffle dependency. If this is the case, the current implementation would set the enableBatchFetch to false, since there are mergeStatus, to cause the assertion in `MapOutoutputTracker.getMapSizesByExecutorId` failed: ``` assert(mapSizesByExecutorId.enableBatchFetch == true) ``` The proposed fix helps resolve the issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed the existing UTs. Closes #34461 from rmcyang/SPARK-37023. Authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit f1532a291179665a3b69dad640a770ecfcbed629) Signed-off-by: Mridul Muralidharan <mridulatgmail.com> 11 November 2021, 08:57:47 UTC
bda1ecd [SPARK-37253][PYTHON] `try_simplify_traceback` should not fail when `tb_frame.f_lineno` is None ### What changes were proposed in this pull request? This PR aims to handle the corner case when `tb_frame.f_lineno` is `None` in `try_simplify_traceback` which was added by https://github.com/apache/spark/pull/30309 at Apache Spark 3.1.0. ### Why are the changes needed? This will handle the following corner case. ```python Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/worker.py", line 630, in main tb = try_simplify_traceback(sys.exc_info()[-1]) File "/Users/dongjoon/APACHE/spark-merge/python/lib/pyspark.zip/pyspark/util.py", line 217, in try_simplify_traceback new_tb = types.TracebackType( TypeError: 'NoneType' object cannot be interpreted as an integer ``` Python GitHub Repo also has the test case for this corner case. - https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py#L2373 ```python None if frame.f_lineno is None else ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #34530 from dongjoon-hyun/SPARK-37253. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 8ae88d01b46d581367d0047b50fcfb65078ab972) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 09 November 2021, 11:03:42 UTC
c4090fc [SPARK-37252][PYTHON][TESTS] Ignore `test_memory_limit` on non-Linux environment ### What changes were proposed in this pull request? This PR aims to ignore `test_memory_limit` on non-Linux environment. ### Why are the changes needed? Like the documentation https://github.com/apache/spark/pull/23664, it fails on non-Linux environment like the following MacOS example. **BEFORE** ``` $ build/sbt -Phadoop-cloud -Phadoop-3.2 test:package $ python/run-tests --modules pyspark-core ... ====================================================================== FAIL: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/tests/test_worker.py", line 212, in test_memory_limit self.assertEqual(soft_limit, 2 * 1024 * 1024 * 1024) AssertionError: 9223372036854775807 != 2147483648 ---------------------------------------------------------------------- ``` **AFTER** ``` ... Tests passed in 104 seconds Skipped tests in pyspark.tests.test_serializers with /Users/dongjoon/.pyenv/versions/3.8.12/bin/python3: test_serialize (pyspark.tests.test_serializers.SciPyTests) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with /Users/dongjoon/.pyenv/versions/3.8.12/bin/python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual. Closes #34527 from dongjoon-hyun/SPARK-37252. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2c7f20151e99c212443a1f8762350d0a96a26440) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 November 2021, 02:37:18 UTC
8a8f4f6 [SPARK-37196][SQL] HiveDecimal enforcePrecisionScale failed return null ### What changes were proposed in this pull request? For case ``` withTempDir { dir => withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { withTable("test_precision") { val df = sql("SELECT 'dummy' AS name, 1000000000000000000010.7000000000000010 AS value") df.write.mode("Overwrite").parquet(dir.getAbsolutePath) sql( s""" |CREATE EXTERNAL TABLE test_precision(name STRING, value DECIMAL(18,6)) |STORED AS PARQUET LOCATION '${dir.getAbsolutePath}' |""".stripMargin) checkAnswer(sql("SELECT * FROM test_precision"), Row("dummy", null)) } } } ``` We write a data with schema It's caused by you create a df with ``` root |-- name: string (nullable = false) |-- value: decimal(38,16) (nullable = false) ``` but create table schema ``` root |-- name: string (nullable = false) |-- value: decimal(18,6) (nullable = false) ``` This will cause enforcePrecisionScale return `null` ``` public HiveDecimal getPrimitiveJavaObject(Object o) { return o == null ? null : this.enforcePrecisionScale(((HiveDecimalWritable)o).getHiveDecimal()); } ``` Then throw NPE when call `toCatalystDecimal ` We should judge if the return value is `null` to avoid throw NPE ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #34519 from AngersZhuuuu/SPARK-37196. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a4f8ffbbfb0158a03ff52f1ed0dde75241c3a90e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 November 2021, 20:08:41 UTC
90b7ee0 [SPARK-37238][BUILD][3.2] Upgrade ORC to 1.6.12 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC from 1.6.11 to 1.6.12 for Apache ORC 3.2.1. ### Why are the changes needed? Apache ORC 1.6.12 is a new maintenance version including the following bug fixes (https://orc.apache.org/news/2021/11/07/ORC-1.6.12/) - ORC-1008 Overflow detection code is incorrect in IntegerColumnStatisticsImpl - ORC-1024 BloomFilter hash computation is inconsistent between Java and C++ clients - ORC-1029 ServiceLoader is not thread-safe, avoid concurrent calls ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #34512 from dongjoon-hyun/SPARK-37238. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 08 November 2021, 07:05:31 UTC
e55bab5 [SPARK-37214][SQL] Fail query analysis earlier with invalid identifiers This is a followup of #31427 , which introduced two issues: 1. When we lookup `spark_catalog.t`, we failed earlier with `The namespace in session catalog must have exactly one name part` before that PR, now we fail very late in `CheckAnalysis` with `NoSuchTableException` 2. The error message is a bit confusing now. We report `Table t not found` even if table `t` exists. This PR fixes the 2 issues. save analysis time and improve error message no updated test Closes #34490 from cloud-fan/table. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8ab9d6327d7db20a4257f9fe6d0b17919576be5e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 08 November 2021, 05:35:42 UTC
6ecdde1 Revert "[SPARK-36998][CORE] Handle concurrent eviction of same application in SHS" This reverts commit 248e07b49187bc7082e6cb2b0d9daa4b48ffe3cb. 08 November 2021, 01:12:51 UTC
248e07b [SPARK-36998][CORE] Handle concurrent eviction of same application in SHS ### What changes were proposed in this pull request? To gracefully handle the error thrown when we try to make room for parsing of different applications and they try to evict the same application by deleting the directory path. Also, added a test for `deleteStore` in `HistoryServerDiskManagerSuite` ### Why are the changes needed? Otherwise, an NoSuchFileException is thrown when it cannot find the directory path to exist. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added a unit test for `deleteStore` but not specifically testing the concurrency fix. Closes #34276 from thejdeep/SPARK-36998. Authored-by: Thejdeep Gudivada <tgudivada@linkedin.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 39ad0d782cd708462d9dbd870fd23d7bb7c091a3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 06 November 2021, 00:33:00 UTC
035b700 [SPARK-37218][SQL][TEST] Parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark ### What changes were proposed in this pull request? This PR aims to parameterize `spark.sql.shuffle.partitions` in TPCDSQueryBenchmark. ### Why are the changes needed? This helps a single machine benchmark with many cores. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually. ``` bin/spark-submit \ --driver-memory 12g --driver-java-options '-Dspark.sql.test.master=local[10]' \ -c spark.serializer=org.apache.spark.serializer.KryoSerializer -c spark.sql.shuffle.partitions=10 \ --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark \ --jars core/target/scala-2.12/spark-core_2.12-3.3.0-SNAPSHOT-tests.jar,sql/catalyst/target/scala-2.12/spark-catalyst_2.12-3.3.0-SNAPSHOT-tests.jar \ sql/core/target/scala-2.12/spark-sql_2.12-3.3.0-SNAPSHOT-tests.jar \ --data-location ~/data/10g-parquet-gzip --query-filter q72 ``` Closes #34496 from dongjoon-hyun/SPARK-37218. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit 128d2c91c05253621563bc41af93b5449e395eb7) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 05 November 2021, 22:51:04 UTC
966c90c [SPARK-37208][YARN] Support mapping Spark gpu/fpga resource types to custom YARN resource types ### What changes were proposed in this pull request? Add configs to allow mapping the Spark gpu/fpga resource type to a custom YARN resource type. Currently Spark hardcodes the mapping of resource "gpu" to "yarn.io/gpu" and "fpga" to "yarn.io/fpga". This PR just allows the user to specify the "yarn.io/*" resource side. Note it would be nice to put this in 3.2.1 as well, let me know if any objections. ### Why are the changes needed? YARN supports custom resource types and in Hadoop 3.3.1 made it easier for users to plugin in custom resource types. This means users may create a custom resource type that represents a GPU or FPGAs because they want additional logic that the YARN built in versions don't have. Ideally Spark users still just use the generic "gpu" or "fpga" types in Spark. So this adds that ability so Spark end users don't need to know about changes to YARN resource types. ### Does this PR introduce _any_ user-facing change? Configs added. ### How was this patch tested? Tested manually with Hadoop 3.3.1 plugin https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DevelopYourOwnDevicePlugin.html and unit tests added here. Closes #34485 from tgravescs/migyarnlatest. Lead-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Thomas Graves <tgraves@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8e6f636260a1cee242e0953973b695e1a717929c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 05 November 2021, 21:19:24 UTC
11c4745 [SPARK-37203][SQL] Fix NotSerializableException when observe with TypedImperativeAggregate Currently, ``` val namedObservation = Observation("named") val df = spark.range(100) val observed_df = df.observe( namedObservation, percentile_approx($"id", lit(0.5), lit(100)).as("percentile_approx_val")) observed_df.collect() namedObservation.get ``` throws exception as follows: ``` 15:16:27.994 ERROR org.apache.spark.util.Utils: Exception encountered java.io.NotSerializableException: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2(TaskResult.scala:55) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$2$adapted(TaskResult.scala:55) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:55) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1434) at org.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:51) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:616) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` This PR will fix the issue. After the change, `assert(namedObservation.get === Map("percentile_approx_val" -> 49))` `java.io.NotSerializableException` will not happen. Fix `NotSerializableException` when observe with `TypedImperativeAggregate`. No. This PR change the implement of `AggregatingAccumulator` who uses serialize and deserialize of `TypedImperativeAggregate` now. New tests. Closes #34474 from beliefer/SPARK-37203. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3f3201a7882b817a8a3ecbfeb369dde01e7689d8) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 November 2021, 08:28:05 UTC
8804ed9 [MINOR][SQL][DOCS] Document JDBC aggregate push down is for DSV2 only ### What changes were proposed in this pull request? To specify JDBC aggregate push down is for DS V2 only. This change is for both 3.2 and master. ### Why are the changes needed? To make the doc clear so user won't use aggregate push down in DS v1. ### Does this PR introduce _any_ user-facing change? No. Doc change only ### How was this patch tested? Manually checked. Closes #34465 from huaxingao/doc_minor. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> (cherry picked from commit 44151be395852a5ef7e5784607f4e69da5e61569) Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> 03 November 2021, 21:34:49 UTC
c2f147e [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow in PySpark page ### What changes were proposed in this pull request? This PR proposes to fix the broken link in the legacy page. Currently it links to: ![Screen Shot 2021-11-03 at 6 34 32 PM](https://user-images.githubusercontent.com/6477701/140037221-b7963e47-12f5-49f3-8290-8560c99c62c2.png) ![Screen Shot 2021-11-03 at 6 34 30 PM](https://user-images.githubusercontent.com/6477701/140037225-c21070fc-907f-41bb-a421-747810ae5b0d.png) It should link to: ![Screen Shot 2021-11-03 at 6 34 35 PM](https://user-images.githubusercontent.com/6477701/140037246-dd14760f-5487-4b8b-b3f6-e9495f1d4ec9.png) ![Screen Shot 2021-11-03 at 6 34 38 PM](https://user-images.githubusercontent.com/6477701/140037251-3f97e992-6660-4ce9-9c66-77855d3c0a64.png) ### Why are the changes needed? For users to easily navigate from legacy page to newer page. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug in documentation. ### How was this patch tested? Manually built the side and checked the link Closes #34475 from HyukjinKwon/minor-doc-fix-py. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit ab7e5030b23ccb8ef6aa43645e909457f9d68ffa) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 03 November 2021, 11:55:20 UTC
a63d2d2 [MINOR][DOCS] Corrected spacing in structured streaming programming ### What changes were proposed in this pull request? There is no space between `with` and `<code>` as shown below: `... configured with<code>spark.sql.streaming.fileSource.cleaner.numThreads</code> ...` ### Why are the changes needed? Added space ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only documentation was changed and no code was change. Closes #34458 from mans2singh/structured_streaming_programming_guide_space. Authored-by: mans2singh <mans2singh@yahoo.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 675071a38e47dc2c55cf4f71de7ad0bebc1b4f2b) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 02 November 2021, 02:02:06 UTC
9e614e2 [SPARK-37170][PYTHON][DOCS] Pin PySpark version installed in the Binder environment for tagged commit ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR proposes to pin the version of PySpark to be installed in the live notebook environment for tagged commits. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> I noticed that the PySpark `3.1.2` is installed in the live notebook environment even though the notebook is for PySpark `3.2.0`. http://spark.apache.org/docs/3.2.0/api/python/getting_started/index.html I guess someone accessed to Binder and built the container image with `v3.2.0` before we published the `pyspark` package to PyPi. https://mybinder.org/ I think it's difficult to rebuild the image manually. To avoid such accident, I'll propose this change. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> No. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> Confirmed that if a commit is tagged, we can avoid building the container image with unexpected version of `pyspark` in Binder. ``` ... Downloading plotly-5.3.1-py2.py3-none-any.whl (23.9 MB) ERROR: Could not find a version that satisfies the requirement pyspark[ml,mllib,pandas_on_spark,sql]==3.3.0.dev0 (from versions: 2.1.2, 2.1.3, 2.2.0.post0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.1, 3.1.2, 3.2.0) ERROR: No matching distribution found for pyspark[ml,mllib,pandas_on_spark,sql]==3.3.0.dev0 Removing intermediate container de55eed5966e The command '/bin/sh -c ./binder/postBuild' returned a non-zero code: 1Built image, launching... Failed to connect to event stream ``` If a commit is not tagged, an old version of `pyspark` can be installed if the exactly specified version is not published to PyPi. https://hub.gke2.mybinder.org/user/sarutak-spark-ky222nbf/notebooks/python/docs/source/getting_started/quickstart_df.ipynb Closes #34449 from sarutak/pin-pyspark-version-binder. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 72a9f9c5847878b389aa09de82786d093684bf96) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 November 2021, 01:03:51 UTC
1c52446 [MINOR][SQL][DOCS] Correct typo in sql migration guide ### What changes were proposed in this pull request? Just correct a minor typo in the migration guide. Change from `Double.Nan` to `Double.NaN` ### Why are the changes needed? The correct `NaN` representation for `Double` is `Double.NaN` ### Does this PR introduce _any_ user-facing change? Yes, it corrects a typo in the migration guide. ### How was this patch tested? Docs change only. Closes #34452 from dieggoluis/spark-3-migration-guide-docs. Authored-by: Diego Luis <diego.luis@nubank.com.br> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a4382fe8c87dd909833a01fbd1a02b681886157a) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> 01 November 2021, 01:01:45 UTC
67abd7a [MINOR][SS] Remove unused config "pauseBackgroundWorkForCommit" from RocksDB ### What changes were proposed in this pull request? This PR proposes to remove unused config "pauseBackgroundWorkForCommit". ### Why are the changes needed? That is unused config which actually has to be always "true" even if it's being used. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs Closes #34417 from HeartSaVioR/MINOR-remove-unused-config-in-rocksdb. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit 00d9ed5dd278a10a2dff8f420fafab516f85c189) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> 30 October 2021, 22:49:59 UTC
back to top