https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
3fdfce3 Preparing Spark release v3.0.0-rc3 06 June 2020, 02:57:35 UTC
fa608b9 [SPARK-31904][SQL] Fix case sensitive problem of char and varchar partition columns ### What changes were proposed in this pull request? ```sql CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet; CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1; SELECT * FROM t2 WHERE b = 'A'; ``` Above SQL throws MetaException > Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810) ... 114 more Caused by: MetaException(message:Filtering is supported only on partition keys of type string, or integral types) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278) at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583) at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768) at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182) at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248) at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232) at org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101) at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2679) ... 119 more ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a unit test. Closes #28724 from LantaoJin/SPARK-31904. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 5079831106ba22f81a26b4a8104a253422fa1b6a) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 05 June 2020, 22:35:44 UTC
1fc09e3 [SPARK-31867][SQL][FOLLOWUP] Check result differences for datetime formatting ### What changes were proposed in this pull request? In this PR, we throw `SparkUpgradeException` when getting `DateTimeException` for datetime formatting in the `EXCEPTION` legacy Time Parser Policy. ### Why are the changes needed? `DateTimeException` is also declared by `java.time.format.DateTimeFormatter#format`, but in Spark, it can barely occur. We have suspected one that due to a JDK bug so far. see https://bugs.openjdk.java.net/browse/JDK-8079628. For `from_unixtime` function, we will suppress the DateTimeException caused by `DD` and result `NULL`. It is a silent date change that should be avoided in Java 8. ### Does this PR introduce _any_ user-facing change? Yes, when running on Java8 and using `from_unixtime` function with pattern `DD` to format datetimes, if dayofyear>=100, `SparkUpgradeException` will alert users instead of silently resulting null. For `date_format`, `SparkUpgradeException` take the palace of `DateTimeException`. ### How was this patch tested? add unit tests. Closes #28736 from yaooqinn/SPARK-31867-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fc6af9d900ec6f6a1cbe8f987857a69e6ef600d1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 June 2020, 16:44:29 UTC
8910dae [SPARK-31859][SPARK-31861][FOLLOWUP] Fix typo in tests ### What changes were proposed in this pull request? It appears I have unintentionally used nested JDBC statements in the two tests I added. ### Why are the changes needed? Cleanup a typo. Please merge to master/branch-3.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #28735 from juliuszsompolski/SPARK-31859-fixup. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ea010138e95cc9c0f62603d431b906bd5104f4e3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 June 2020, 10:01:16 UTC
4ee13b1 [SPARK-31879][SQL][TEST-JAVA11] Make week-based pattern invalid for formatting too After all these attempts https://github.com/apache/spark/pull/28692 and https://github.com/apache/spark/pull/28719 an https://github.com/apache/spark/pull/28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. add unit tests Closes #28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9d5b5d0a5849ac329bbae26d9884d8843d8a8571) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 June 2020, 08:30:41 UTC
706dbec [MINOR][PYTHON] Add one more newline between JVM and Python tracebacks ### What changes were proposed in this pull request? This PR proposes to add one more newline to clearly separate JVM and Python tracebacks: Before: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` After: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` This is kind of a followup of https://github.com/apache/spark/commit/e69466056fb2c121b7bbb6ad082f09deb1c41063 (SPARK-31849). ### Why are the changes needed? To make it easier to read. ### Does this PR introduce _any_ user-facing change? It's in the unreleased branches. ### How was this patch tested? Manually tested. Closes #28732 from HyukjinKwon/python-minor. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 53ce58da3419223ce2666ba4d77aa4f2db82bdb5) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 05 June 2020, 04:31:53 UTC
bbd9373 [SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI ### What changes were proposed in this pull request? In `Dataset.collectAsArrowToR` and `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 07 PM](https://user-images.githubusercontent.com/506656/83815735-bec22380-a675-11ea-8ecc-bf2954731f35.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 27 PM](https://user-images.githubusercontent.com/506656/83815804-de594c00-a675-11ea-933a-d0ffc0f534dd.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-04 at 3 19 41 PM](https://user-images.githubusercontent.com/506656/83816265-d77f0900-a676-11ea-84b8-2a8d80428bc6.png) Closes #28730 from ueshin/issues/SPARK-31903/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 632b5bce23c94d25712b43be83252b34ebfd3e72) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 05 June 2020, 03:54:19 UTC
3bb0824 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66) Signed-off-by: Sean Owen <srowen@gmail.com> 03 June 2020, 23:07:49 UTC
22d5d03 [SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return relations with fresh attribute IDs ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs. This PR is to eliminate this side effect. ### Why are the changes needed? There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28717 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dc0709fa0ca75751d2de4ef95ad077f3e805a6ac) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 June 2020, 19:10:27 UTC
08198fa Revert "[SPARK-31879][SQL] Using GB as default Locale for datetime formatters" This reverts commit 1b7ae62bf443f20c94daa6ea5cabbd18f96b7919. 03 June 2020, 18:01:49 UTC
47edae9 [SPARK-31896][SQL] Handle am-pm timestamp parsing when hour is missing ### What changes were proposed in this pull request? This PR set the hour to 12/0 when the AMPM_OF_DAY field exists ### Why are the changes needed? When the hour is absent but the am-pm is present, the time is incorrect for pm ### Does this PR introduce _any_ user-facing change? yes, the change is user-facing but to change back to 2.4 to keep backward compatibility e.g. ```sql spark-sql> select to_timestamp('33:33 PM', 'mm:ss a'); 1970-01-01 12:33:33 spark-sql> select to_timestamp('33:33 AM', 'mm:ss a'); 1970-01-01 00:33:33 ``` otherwise, the results are all `1970-01-01 00:33:33` ### How was this patch tested? add unit tests Closes #28713 from yaooqinn/SPARK-31896. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit afcc14c6d27f9e0bd113e0d86b64dc6fa4eed551) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 June 2020, 13:30:32 UTC
7cdd084 fix compilation 03 June 2020, 12:12:12 UTC
85a4891 [SPARK-31878][SQL] Create date formatter only once in `HiveResult` ### What changes were proposed in this pull request? 1. Replace `def dateFormatter` to `val dateFormatter`. 2. Modify the `date formatting in hive result` test in `HiveResultSuite` to check modified code on various time zones. ### Why are the changes needed? To avoid creation of `DateFormatter` per every incoming date in `HiveResult.toHiveString`. This should eliminate unnecessary creation of `SimpleDateFormat` instances and compilation of the default pattern `yyyy-MM-dd`. The changes can speed up processing of legacy date values of the `java.sql.Date` type which is collected by default. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified a test in `HiveResultSuite`. Closes #28687 from MaxGekk/HiveResult-val-dateFormatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 125a89ce0880ab4c53dcb1b879f1d65ff0f589df) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 June 2020, 12:00:30 UTC
4e9eafd [SPARK-31881][K8S][TESTS][FOLLOWUP] Activate hadoop-2.7 by default in K8S IT ### What changes were proposed in this pull request? This PR aims to activate `hadoop-2.7` profile by default in Kubernetes IT module. ### Why are the changes needed? While SPARK-31881 added Hadoop 3.2 support, one default test dependency was moved to `hadoop-2.7` profile. It works when we give one of `hadoop-2.7` and `hadoop-3.2`, but it fails when we don't give any profile. **BEFORE** ``` $ mvn test-compile -pl resource-managers/kubernetes/integration-tests -Pkubernetes-integration-tests ... [ERROR] [Error] /APACHE/spark-merge/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:23: object amazonaws is not a member of package com ``` **AFTER** ``` $ mvn test-compile -pl resource-managers/kubernetes/integration-tests -Pkubernetes-integration-tests .. [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ ``` The default activated profile will be override when we give `hadoop-3.2`. ``` $ mvn help:active-profiles -Pkubernetes-integration-tests ... Active Profiles for Project 'org.apache.spark:spark-kubernetes-integration-tests_2.12:jar:3.1.0-SNAPSHOT': The following profiles are active: - hadoop-2.7 (source: org.apache.spark:spark-kubernetes-integration-tests_2.12:3.1.0-SNAPSHOT) - kubernetes-integration-tests (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - test-java-home (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) ``` ``` $ mvn help:active-profiles -Pkubernetes-integration-tests -Phadoop-3.2 ... Active Profiles for Project 'org.apache.spark:spark-kubernetes-integration-tests_2.12:jar:3.1.0-SNAPSHOT': The following profiles are active: - hadoop-3.2 (source: org.apache.spark:spark-kubernetes-integration-tests_2.12:3.1.0-SNAPSHOT) - hadoop-3.2 (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - kubernetes-integration-tests (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - test-java-home (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins UT and IT. Currently, all Jenkins build and tests (UT & IT) passes without this patch. This should be tested manually with the above command. `hadoop-3.2` K8s IT also passed like the following. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 8 minutes, 33 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28716 from dongjoon-hyun/SPARK-31881-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e5b9b862e6011f37dfc0f646d6c3ae8e545e2cd6) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 03 June 2020, 09:17:40 UTC
f6a77c3 [SPARK-31886][WEBUI] Fix the wrong coloring of nodes in DAG-viz ### What changes were proposed in this pull request? This PR fixes a wrong coloring issue in the DAG-viz. In the Job Page and Stage Page, nodes which are associated with "barrier mode" in the DAG-viz will be colored pale green. But, with some type of jobs, nodes which are not associated with the mode will also colored. You can reproduce with the following operation. ``` sc.parallelize(1 to 10).barrier.mapPartitions(identity).repartition(1).collect() ``` <img width="376" alt="wrong-coloring" src="https://user-images.githubusercontent.com/4736016/83403670-1711df00-a444-11ea-9457-c683f75bc566.png"> In the screen shot above, `repartition` in `Stage 1` is not associated with barrier mode so the corresponding node should not be colored pale green. The cause of this issue is the logic which chooses HTML elements to be colored is wrong. The logic chooses such elements based on whether each element is associated with a style class (`clusterId` in the code). But when an operation crosses over shuffle (like `repartition` above), a `clusterId` can be duplicated and non-barrier mode node is also associated with the same `clusterId`. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added test case with the following command. ``` build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -- -z SPARK-31886" ``` Closes #28694 from sarutak/fix-wrong-barrier-color. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> (cherry picked from commit 8ed93c9355bc2af6fe456d88aa693c8db69d0bbf) Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 03 June 2020, 08:16:02 UTC
5a271c9 [SPARK-31892][SQL] Disable week-based date filed for parsing This PR disables week-based date filed for parsing closes #28674 1. It's an un-fixable behavior change to fill the gap between SimpleDateFormat and DateTimeFormater and backward-compatibility for different JDKs.A lot of effort has been made to prove it at https://github.com/apache/spark/pull/28674 2. The existing behavior itself in 2.4 is confusing, e.g. ```sql spark-sql> select to_timestamp('1', 'w'); 1969-12-28 00:00:00 spark-sql> select to_timestamp('1', 'u'); 1970-01-05 00:00:00 ``` The 'u' here seems not to go to the Monday of the first week in week-based form or the first day of the year in non-week-based form but go to the Monday of the second week in week-based form. And, e.g. ```sql spark-sql> select to_timestamp('2020 2020', 'YYYY yyyy'); 2020-01-01 00:00:00 spark-sql> select to_timestamp('2020 2020', 'yyyy YYYY'); 2019-12-29 00:00:00 spark-sql> select to_timestamp('2020 2020 1', 'YYYY yyyy w'); NULL spark-sql> select to_timestamp('2020 2020 1', 'yyyy YYYY w'); 2019-12-29 00:00:00 ``` I think we don't need to introduce all the weird behavior from Java 3. The current test coverage for week-based date fields is almost 0%, which indicates that we've never imagined using it. 4. Avoiding JDK bugs https://issues.apache.org/jira/browse/SPARK-31880 Yes, the 'Y/W/w/u/F/E' pattern cannot be used datetime parsing functions. more tests added Closes #28706 from yaooqinn/SPARK-31892. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit afe95bd9ad7a07c49deecf05f0a1000bb8f80caa) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 June 2020, 06:51:30 UTC
1b7ae62 [SPARK-31879][SQL] Using GB as default Locale for datetime formatters This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4 ```sql spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u'); 2019-12-29 00:00:00 spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u'); 2019-12-30 00:00:00 ``` These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day backward. For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, it restores the first day of week calculating for functions when using the default locale. Yes, but the behavior change is used to restore the old one of v2.4 add unit tests Closes #28692 from yaooqinn/SPARK-31879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c59f51bcc207725b8cbc4201df9367f874f5915c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 June 2020, 06:22:07 UTC
8cea968 [SPARK-31895][PYTHON][SQL] Support DataFrame.explain(extended: str) case to be consistent with Scala side ### What changes were proposed in this pull request? Scala: ```scala scala> spark.range(10).explain("cost") ``` ``` == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B) == Physical Plan == *(1) Range (0, 10, step=1, splits=12) ``` PySpark: ```python >>> spark.range(10).explain("cost") ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 333, in explain raise TypeError(err_msg) TypeError: extended (optional) should be provided as bool, got <class 'str'> ``` In addition, it is consistent with other codes too, for example, `DataFrame.sample` also can support `DataFrame.sample(1.0)` and `DataFrame.sample(False)`. ### Why are the changes needed? To provide the consistent API support across APIs. ### Does this PR introduce _any_ user-facing change? Nope, it's only changes in unreleased branches. If this lands to master only, yes, users will be able to set `mode` as `df.explain("...")` in Spark 3.1. After this PR: ```python >>> spark.range(10).explain("cost") ``` ``` == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B) == Physical Plan == *(1) Range (0, 10, step=1, splits=12) ``` ### How was this patch tested? Unittest was added and manually tested as well to make sure: ```python spark.range(10).explain(True) spark.range(10).explain(False) spark.range(10).explain("cost") spark.range(10).explain(extended="cost") spark.range(10).explain(mode="cost") spark.range(10).explain() spark.range(10).explain(True, "cost") spark.range(10).explain(1.0) ``` Closes #28711 from HyukjinKwon/SPARK-31895. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e1d52011401c1989f26b230eb8c82adc63e147e7) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 June 2020, 03:07:23 UTC
a28f6c6 [SPARK-31882][WEBUI] DAG-viz is not rendered correctly with pagination ### What changes were proposed in this pull request? This PR fix an issue related to DAG-viz. Because DAG-viz for a job fetches link urls for each stage from the stage table, rendering can fail with pagination. You can reproduce this issue with the following operation. ``` sc.parallelize(1 to 10).map(value => (value ,value)).repartition(1).repartition(1).repartition(1).reduceByKey(_ + _).collect ``` And then, visit the corresponding job page. There are 5 stages so show <5 stages in the paged table. <img width="1440" alt="dag-rendering-issue1" src="https://user-images.githubusercontent.com/4736016/83376286-c29f3d00-a40c-11ea-891b-eb8f42afbb27.png"> <img width="1439" alt="dag-rendering-issue2" src="https://user-images.githubusercontent.com/4736016/83376288-c3d06a00-a40c-11ea-8bb2-38542e5010c1.png"> ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added test case with following command. `build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -- -z SPARK-31882"` Closes #28690 from sarutak/fix-dag-rendering-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> (cherry picked from commit 271eb26c026490f7307ae888200bfc81a645592f) Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 02 June 2020, 23:45:50 UTC
32cb37c [SPARK-31756][WEBUI][3.0] Add real headless browser support for UI test ### What changes were proposed in this pull request? This PR backports the change of #28627 . Some fixes for UI like #28690 should be merged to `branch-3.0` including testcases so #28627 is need to be backported. This PR mainly adds two things. 1. Real headless browser support for UI test 2. A test suite using headless Chrome as one instance of those browsers. Also, for environment where Chrome and Chrome driver is not installed, `ChromeUITest` tag is added to filter out the test suite. By default, test suites with `ChromeUITest` is disabled. ### Why are the changes needed? In `branch-3.0`, there are two problems for UI test. 1. Lots of tests especially JavaScript related ones are done manually. Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally. 2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScript enough. I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test. The test I added works somehow but some JavaScript related error is shown in unit-tests.log. ``` ======= EXCEPTION START ======== Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException] com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)         at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904)         at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)         at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)         at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835)         at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807)         at com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216)         at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52)         at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102)         at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426)         at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157)         at java.lang.Thread.run(Thread.java:748) Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)         at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009)         at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)         at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)         at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)         at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)         at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)         at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828)         at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889)         ... 10 more JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". == CALLING JAVASCRIPT ==   function () {       throw e;   } ======= EXCEPTION END ======== ``` I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not working even though it works on real browsers like Chrome, Safari and Firefox without error. ``` [info] UISeleniumSuite: [info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 seconds, 745 milliseconds) [info]   The code passed to eventually never returned normally. Attempted 2 times over 12.910785232 seconds. Last failure message: com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1) ``` To resolve those problems, it's better to support headless browser for UI test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I tested with following patterns. Both Chrome and Chrome driver should be installed to test. 1. sbt / with default excluded tags (ChromeUISeleniumSuite is expected to be skipped and SQLQueryTestSuite is expected to succeed) `build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite"` 2. sbt / overwrite default excluded tags as empty string (Both suites are expected to succeed) `build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite"` 3. sbt / set `test.exclude.tags` to `org.apache.spark.tags.ExtendedSQLTest` (Both suites are expected to be skipped) `build/sbt -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite"` 4. Maven / with default excluded tags (ChromeUISeleniumSuite is expected to be skipped and SQLQueryTestSuite is expected to succeed) `build/mvn -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test` 5. Maven / overwrite default excluded tags as empty string (Both suites are expected to succeed) `build/mvn -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test` 6. Maven / set `test.exclude.tags` to `org.apache.spark.tags.ExtendedSQLTest` (Both suites are expected to be skipped) `build/mvn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test` Closes #28702 from sarutak/real-headless-browser-support-3.0. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 02 June 2020, 23:41:19 UTC
1292bf2 [SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com> (cherry picked from commit 25702281dc0c7cc333978f51a15ebf9fd02cc684) Signed-off-by: Holden Karau <hkarau@apple.com> 02 June 2020, 18:12:29 UTC
b814cbc [SPARK-31860][BUILD] only push release tags on succes ### What changes were proposed in this pull request? Only push the release tag after the build has finished. ### Why are the changes needed? If the build fails we don't need a release tag. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running locally with a fake user in https://github.com/apache/spark/pull/28667 Closes #28700 from holdenk/SPARK-31860-build-master-only-push-tags-on-success. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> (cherry picked from commit 69ba9b662e2ace592380e3cc5de041031dec2254) Signed-off-by: Holden Karau <hkarau@apple.com> 02 June 2020, 18:06:27 UTC
b85b954 [SPARK-31888][SQL] Support `java.time.Instant` in Parquet filter pushdown 1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.Instant` attributes. 2. Added `ParquetFilters.timestampToMicros()` to support both types `java.sql.Timestamp` and `java.time.Instant` in conversions to microseconds. 3. Re-used `timestampToMicros` in constructing of Parquet filters. To support pushed down filters with `java.time.Instant` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`. No Modified tests to `ParquetFilterSuite` to check the case when Java 8 API is enabled. Closes #28696 from MaxGekk/support-instant-parquet-filters. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 June 2020, 12:22:28 UTC
b1782af [SPARK-31834][SQL] Improve error message for incompatible data types ### What changes were proposed in this pull request? We should use dataType.catalogString to unified the data type mismatch message. Before: ```sql spark-sql> create table SPARK_31834(a int) using parquet; spark-sql> insert into SPARK_31834 select '1'; Error in query: Cannot write incompatible data to table '`default`.`spark_31834`': - Cannot safely cast 'a': StringType to IntegerType; ``` After: ```sql spark-sql> create table SPARK_31834(a int) using parquet; spark-sql> insert into SPARK_31834 select '1'; Error in query: Cannot write incompatible data to table '`default`.`spark_31834`': - Cannot safely cast 'a': string to int; ``` ### How was this patch tested? UT. Closes #28654 from lipzhu/SPARK-31834. Authored-by: lipzhu <lipzhu@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit d79a8a88b15645a29fabb245b6db3b2179d0f3c0) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 June 2020, 12:07:27 UTC
e2fded3 [SPARK-29137][ML][PYTHON][TESTS] Increase the timeout for StreamingLinearRegressionWithTests.test_train_prediction ### What changes were proposed in this pull request? It increases the timeout for `StreamingLinearRegressionWithTests.test_train_prediction` ``` Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 503, in test_train_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 69, in _eventually lastValue = condition() File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 498, in condition self.assertGreater(errors[1] - errors[-1], 2) AssertionError: 1.672640157855923 not greater than 2 ``` This could likely happen when the PySpark tests run in parallel and it become slow. ### Why are the changes needed? To make the tests less flaky. Seems it's being reported multiple times: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123144/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123146/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123141/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123142/testReport/ ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Jenkins will test it out. Closes #28701 from HyukjinKwon/SPARK-29137. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 64cb6f7066134a0b9e441291992d2da73de5d918) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 02 June 2020, 05:04:00 UTC
e3a63a8 [SPARK-31885][SQL][3.0] Fix filter push down for old millis timestamps to Parquet ### What changes were proposed in this pull request? Fixed conversions of `java.sql.Timestamp` to milliseconds in `ParquetFilter` by using existing functions from `DateTimeUtils` `fromJavaTimestamp()` and `microsToMillis()`. ### Why are the changes needed? The changes fix the bug: ```scala scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED") scala> Seq(java.sql.Timestamp.valueOf("1000-06-14 08:28:53.123")).toDF("ts").write.mode("overwrite").parquet("/Users/maximgekk/tmp/ts_millis_old_filter") scala> spark.read.parquet("/Users/maximgekk/tmp/ts_millis_old_filter").filter($"ts" === "1000-06-14 08:28:53.123").show(false) +---+ |ts | +---+ +---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, after the changes (for the example above): ```scala scala> spark.read.parquet("/Users/maximgekk/tmp/ts_millis_old_filter").filter($"ts" === "1000-06-14 08:28:53.123").show(false) +-----------------------+ |ts | +-----------------------+ |1000-06-14 08:28:53.123| +-----------------------+ ``` ### How was this patch tested? Modified tests in `ParquetFilterSuite` to check old timestamps. Closes #28699 from MaxGekk/parquet-ts-millis-filter-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 June 2020, 02:29:20 UTC
6dec082 [SPARK-31870][SQL][TESTS] Fix "Do not optimize skew join if additional shuffle" test having no skew join ### What changes were proposed in this pull request? Fix configurations and ensure there is skew join in the test "Do not optimize skew join if additional shuffle". ### Why are the changes needed? The existing "Do not optimize skew join if additional shuffle" test has no skew join at all. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Fixed existing test. Closes #28679 from manuzhang/spark-31870. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 283814a426fb67289ca7923b9d13c5a897f9a98a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 June 2020, 02:01:10 UTC
e8045fb [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan ### What changes were proposed in this pull request? This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` in the plan. Currently, the checking is too strict even to non-join queries. For example, the codes below don't have join at all but it fails as the ambiguous self-join: ```scala import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.sum val df = Seq(1, 1, 2, 2).toDF("A") val w = Window.partitionBy(df("A")) df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true) ``` It is because `ExtractWindowExpressions` can create a `AttributeReference` with the same metadata but a different expression ID, see: https://github.com/apache/spark/blob/0fd98abd859049dc3b200492487041eeeaa8f737/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2679 https://github.com/apache/spark/blob/71c73d58f6e88d2558ed2e696897767d93bac60f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L63 https://github.com/apache/spark/blob/5945d46c11a86fd85f9e65f24c2e88f368eee01f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L180 Before: ``` 'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Relation[A#19] parquet ``` After: ``` Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19] +- Project [A#19 AS X#21, A#19] +- Relation[A#19] parquet ``` `X#21` holds the same metadata of DataFrame ID and column position with `A#19` but it has a different expression ID which ends up with the checking fails. ### Why are the changes needed? To loose the checking and make users not surprised. ### Does this PR introduce _any_ user-facing change? It's the changes in unreleased branches only. ### How was this patch tested? Manually tested and unittest was added. Closes #28695 from HyukjinKwon/SPARK-28344-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ea45fc51921e64302b9220b264156bb4f757fe01) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 June 2020, 23:31:53 UTC
387f57c [SPARK-31889][BUILD] Docker release script does not allocate enough memory to reliably publish ### What changes were proposed in this pull request? Allow overriding the zinc options in the docker release and set a higher so the publish step can succeed consistently. ### Why are the changes needed? The publish step experiences memory pressure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running test locally with fake user to see if publish step (besides svn part) succeeds Closes #28698 from holdenk/SPARK-31889-docker-release-script-does-not-allocate-enough-memory-to-reliably-publish. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> (cherry picked from commit ab9e5a2fe9a0e9e93301096a027d7409fc3c9b64) Signed-off-by: Holden Karau <hkarau@apple.com> 01 June 2020, 22:50:34 UTC
b6c8366 [SPARK-31764][CORE][3.0] JsonProtocol doesn't write RDDInfo#isBarrier ### What changes were proposed in this pull request? This PR backports the change of #28583 (SPARK-31764) to branch-3.0, which changes JsonProtocol to write RDDInfos#isBarrier. ### Why are the changes needed? JsonProtocol reads RDDInfos#isBarrier but not write it so it's a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a testcase. Closes #28660 from sarutak/SPARK-31764-branch-3.0. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> 01 June 2020, 21:33:35 UTC
8a7fb26 [SPARK-31068][SQL] Avoid IllegalArgumentException in broadcast exchange ### What changes were proposed in this pull request? Fix the IllegalArgumentException in broadcast exchange when numRows over 341 million but less than 512 million. Since the maximum number of keys that `BytesToBytesMap` supports is 1 << 29, and only 70% of the slots can be used before growing in `HashedRelation`, So here the limitation should not be greater equal than 341 million (1 << 29 / 1.5(357913941)) instead of 512 million. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually test. Closes #27828 from LantaoJin/SPARK-31068. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Alan Jin <jinlantao@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 08bdc9c9b2e44e066e44d35733e0f3b450d24052) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 June 2020, 18:26:00 UTC
95f9f83 [SPARK-31881][K8S][TESTS] Support Hadoop 3.2 K8s integration tests ### What changes were proposed in this pull request? This PR aims to support Hadoop 3.2 K8s integration tests. ### Why are the changes needed? Currently, K8s integration suite assumes Hadoop 2.7 and has hard-coded parts. ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the Jenkins K8s IT (with Hadoop 2.7) and do the manual testing for Hadoop 3.2 as described in `README.md`. ``` ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-3.2 ``` I verified this manually like the following. ``` $ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh \ --spark-tgz .../spark-3.1.0-SNAPSHOT-bin-3.2.0.tgz \ --exclude-tags r \ --hadoop-profile hadoop-3.2 ... KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 8 minutes, 49 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28689 from dongjoon-hyun/SPARK-31881. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 17586f9ed25cbb1dd7dbb221aecb8749623e2175) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 June 2020, 18:19:59 UTC
76a0418 Revert "[SPARK-31885][SQL] Fix filter push down for old millis timestamps to Parquet" This reverts commit 805884ad10aca3a532314284b0f2c007d4ca1045. 01 June 2020, 17:20:17 UTC
805884a [SPARK-31885][SQL] Fix filter push down for old millis timestamps to Parquet ### What changes were proposed in this pull request? Fixed conversions of `java.sql.Timestamp` to milliseconds in `ParquetFilter` by using existing functions from `DateTimeUtils` `fromJavaTimestamp()` and `microsToMillis()`. ### Why are the changes needed? The changes fix the bug: ```scala scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED") scala> Seq(java.sql.Timestamp.valueOf("1000-06-14 08:28:53.123")).toDF("ts").write.mode("overwrite").parquet("/Users/maximgekk/tmp/ts_millis_old_filter") scala> spark.read.parquet("/Users/maximgekk/tmp/ts_millis_old_filter").filter($"ts" === "1000-06-14 08:28:53.123").show(false) +---+ |ts | +---+ +---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, after the changes (for the example above): ```scala scala> spark.read.parquet("/Users/maximgekk/tmp/ts_millis_old_filter").filter($"ts" === "1000-06-14 08:28:53.123").show(false) +-----------------------+ |ts | +-----------------------+ |1000-06-14 08:28:53.123| +-----------------------+ ``` ### How was this patch tested? Modified tests in `ParquetFilterSuite` to check old timestamps. Closes #28693 from MaxGekk/parquet-ts-millis-filter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9c0dc28a6c33361a412672d63f3b974b75944965) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 June 2020, 15:13:56 UTC
efa0269 [SPARK-31854][SQL] Invoke in MapElementsExec should not propagate null This PR intends to fix a bug of `Dataset.map` below when the whole-stage codegen enabled; ``` scala> val ds = Seq(1.asInstanceOf[Integer], null.asInstanceOf[Integer]).toDS() scala> sql("SET spark.sql.codegen.wholeStage=true") scala> ds.map(v=>(v,v)).explain == Physical Plan == *(1) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1.intValue AS _1#69, assertnotnull(input[0, scala.Tuple2, true])._2.intValue AS _2#70] +- *(1) MapElements <function1>, obj#68: scala.Tuple2 +- *(1) DeserializeToObject staticinvoke(class java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, value#1, true, false), obj#67: java.lang.Integer +- LocalTableScan [value#1] // `AssertNotNull` in `SerializeFromObject` will fail; scala> ds.map(v => (v, v)).show() java.lang.NullPointerException: Null value appeared in non-nullable fails: top level Product input object If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). // When the whole-stage codegen disabled, the query works well; scala> sql("SET spark.sql.codegen.wholeStage=false") scala> ds.map(v=>(v,v)).show() +----+----+ | _1| _2| +----+----+ | 1| 1| |null|null| +----+----+ ``` A root cause is that `Invoke` used in `MapElementsExec` propagates input null, and then [AssertNotNull](https://github.com/apache/spark/blob/1b780f364bfbb46944fe805a024bb6c32f5d2dde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L253-L255) in `SerializeFromObject` fails because a top-level row becomes null. So, `MapElementsExec` should not return `null` but `(null, null)`. NOTE: the generated code of the query above in the current master; ``` /* 033 */ private void mapelements_doConsume_0(java.lang.Integer mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws java.io.IOException { /* 034 */ boolean mapelements_isNull_1 = true; /* 035 */ scala.Tuple2 mapelements_value_1 = null; /* 036 */ if (!false) { /* 037 */ mapelements_resultIsNull_0 = false; /* 038 */ /* 039 */ if (!mapelements_resultIsNull_0) { /* 040 */ mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; /* 041 */ mapelements_mutableStateArray_0[0] = mapelements_expr_0_0; /* 042 */ } /* 043 */ /* 044 */ mapelements_isNull_1 = mapelements_resultIsNull_0; /* 045 */ if (!mapelements_isNull_1) { /* 046 */ Object mapelements_funcResult_0 = null; /* 047 */ mapelements_funcResult_0 = ((scala.Function1) references[1] /* literal */).apply(mapelements_mutableStateArray_0[0]); /* 048 */ /* 049 */ if (mapelements_funcResult_0 != null) { /* 050 */ mapelements_value_1 = (scala.Tuple2) mapelements_funcResult_0; /* 051 */ } else { /* 052 */ mapelements_isNull_1 = true; /* 053 */ } /* 054 */ /* 055 */ } /* 056 */ } /* 057 */ /* 058 */ serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1); /* 059 */ /* 060 */ } ``` The generated code w/ this fix; ``` /* 032 */ private void mapelements_doConsume_0(java.lang.Integer mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws java.io.IOException { /* 033 */ boolean mapelements_isNull_1 = true; /* 034 */ scala.Tuple2 mapelements_value_1 = null; /* 035 */ if (!false) { /* 036 */ mapelements_mutableStateArray_0[0] = mapelements_expr_0_0; /* 037 */ /* 038 */ mapelements_isNull_1 = false; /* 039 */ if (!mapelements_isNull_1) { /* 040 */ Object mapelements_funcResult_0 = null; /* 041 */ mapelements_funcResult_0 = ((scala.Function1) references[1] /* literal */).apply(mapelements_mutableStateArray_0[0]); /* 042 */ /* 043 */ if (mapelements_funcResult_0 != null) { /* 044 */ mapelements_value_1 = (scala.Tuple2) mapelements_funcResult_0; /* 045 */ mapelements_isNull_1 = false; /* 046 */ } else { /* 047 */ mapelements_isNull_1 = true; /* 048 */ } /* 049 */ /* 050 */ } /* 051 */ } /* 052 */ /* 053 */ serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1); /* 054 */ /* 055 */ } ``` Bugfix. No. Added tests. Closes #28681 from maropu/SPARK-31854. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b806fc458265578fddf544363b60fb5e122439b5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 June 2020, 04:52:12 UTC
a5a8ec2 [SPARK-31849][PYTHON][SQL] Make PySpark SQL exceptions more Pythonic ### What changes were proposed in this pull request? This PR proposes to make PySpark exception more Pythonic by hiding JVM stacktrace by default. It can be enabled by turning on `spark.sql.pyspark.jvmStacktrace.enabled` configuration. ``` Traceback (most recent call last): ... pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): ... ``` If this `spark.sql.pyspark.jvmStacktrace.enabled` is enabled, it appends: ``` JVM stacktrace: org.apache.spark.Exception: ... ... ``` For example, the codes below: ```python from pyspark.sql.functions import udf udf def divide_by_zero(v): raise v / 0 spark.range(1).select(divide_by_zero("id")).show() ``` will show an error messages that looks like Python exception thrown from the local. <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero ``` </details> <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 137, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero JVM stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 192.168.35.193, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` </details> <details> <summary>Python exception message without this change</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 5.0 failed 4 times, most recent failure: Lost task 10.3 in stage 5.0 (TID 37, 192.168.35.193, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda *a: f(*a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(*args, **kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` </details> <br/> Another example with Python 3.7: ```python sql("a") ``` <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ ``` </details> <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ JVM stacktrace: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` </details> <details> <summary>Python exception message without this change</summary> ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o26.sql. : org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 102, in deco raise converted pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ ``` </details> ### Why are the changes needed? Currently, PySpark exceptions are very unfriendly to Python users with causing a bunch of JVM stacktrace. See "Python exception message without this change" above. ### Does this PR introduce _any_ user-facing change? Yes, it will change the exception message. See the examples above. ### How was this patch tested? Manually tested by ```bash ./bin/pyspark --conf spark.sql.pyspark.jvmStacktrace.enabled=true ``` and running the examples above. Closes #28661 from HyukjinKwon/python-debug. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e69466056fb2c121b7bbb6ad082f09deb1c41063) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 01 June 2020, 00:45:38 UTC
bd7f5da [SPARK-31788][CORE][DSTREAM][PYTHON] Recover the support of union for different types of RDD and DStreams ### What changes were proposed in this pull request? This PR manually specifies the class for the input array being used in `(SparkContext|StreamingContext).union`. It fixes a regression introduced from SPARK-25737. ```python rdd1 = sc.parallelize([1,2,3,4,5]) rdd2 = sc.parallelize([6,7,8,9,10]) pairRDD1 = rdd1.zip(rdd2) sc.union([pairRDD1, pairRDD1]).collect() ``` in the current master and `branch-3.0`: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in __setitem__ File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` which works in Spark 2.4.5: ``` [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ``` It assumed the class of the input array is the same `JavaRDD` or `JavaDStream`; however, that can be different such as `JavaPairRDD`. This fix is based on redsanket's initial approach, and will be co-authored. ### Why are the changes needed? To fix a regression from Spark 2.4.5. ### Does this PR introduce _any_ user-facing change? No, it's only in unreleased branches. This is to fix a regression. ### How was this patch tested? Manually tested, and a unittest was added. Closes #28648 from HyukjinKwon/SPARK-31788. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 29c51d682b3735123f78cf9cb8610522a9bb86fd) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 01 June 2020, 00:43:21 UTC
a53394f [SPARK-31874][SQL] Use `FastDateFormat` as the legacy fractional formatter ### What changes were proposed in this pull request? 1. Replace `SimpleDateFormat` by `FastDateFormat` as the legacy formatter of `FractionTimestampFormatter`. 2. Optimise `LegacyFastTimestampFormatter` for `java.sql.Timestamp` w/o fractional part. ### Why are the changes needed? 1. By default `HiveResult`.`hiveResultString` retrieves timestamps values as instances of `java.sql.Timestamp`, and uses the legacy parser `SimpleDateFormat` to convert the timestamps to strings. After the fix https://github.com/apache/spark/pull/28024, the fractional formatter and its companion - legacy formatter `SimpleDateFormat` are created per every value. By switching from `LegacySimpleTimestampFormatter` to `LegacyFastTimestampFormatter`, we can utilize the internal cache of `FastDateFormat`, and avoid parsing the default pattern `yyyy-MM-dd HH:mm:ss`. 2. The second change in the method `def format(ts: Timestamp): String` of `LegacyFastTimestampFormatter` is needed to optimize the formatter for patterns without the fractional part and avoid conversions to microseconds. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `TimestampFormatter`. Closes #28678 from MaxGekk/fastdateformat-as-legacy-frac-formatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 47dc332258bec20c460f666de50d9a8c5c0fbc0a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 May 2020, 13:05:26 UTC
dc0d12a [SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10 As mentioned in https://github.com/apache/spark/pull/28673 and suggested via cloud-fan at https://github.com/apache/spark/pull/28673#discussion_r432817075 In this PR, we disable datetime pattern in the form of `y..y` and `Y..Y` whose lengths are greater than 10 to avoid sort of JDK bug as described below he new datetime formatter introduces silent data change like, ```sql spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); NULL spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); 00000001970-01-01 spark-sql> ``` For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when using the `NumberPrinterParser` to format it ```java switch (signStyle) { case EXCEEDS_PAD: if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) { buf.append(decimalStyle.getPositiveSign()); } break; .... ``` the `minWidth` == `len(y..y)` the `EXCEED_POINTS` is ```java /** * Array of 10 to the power of n. */ static final long[] EXCEED_POINTS = new long[] { 0L, 10L, 100L, 1000L, 10000L, 100000L, 1000000L, 10000000L, 100000000L, 1000000000L, 10000000000L, }; ``` So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` will be raised. And at the caller side, for `from_unixtime`, the exception will be suppressed and silent data change occurs. for `date_format`, the `ArrayIndexOutOfBoundsException` will continue. fix silent data change Yes, SparkUpgradeException will take place of `null` result when the pattern contains 10 or more continuous 'y' or 'Y' new tests Closes #28684 from yaooqinn/SPARK-31867-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 547c5bf55265772780098ee5e29baa6f095c246b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 31 May 2020, 12:40:11 UTC
5fa46eb [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference To make SQL reference complete <img width="1100" alt="Screen Shot 2020-05-29 at 6 46 38 PM" src="https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png"> <img width="1099" alt="Screen Shot 2020-05-29 at 6 43 30 PM" src="https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png"> Only the the above pages are changed. The following two pages are the same as before. <img width="1100" alt="Screen Shot 2020-05-28 at 10 05 27 PM" src="https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png"> <img width="1099" alt="Screen Shot 2020-05-28 at 10 05 08 PM" src="https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png"> Manually build and check Closes #28672 from huaxingao/coalesce_hint. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 1b780f364bfbb46944fe805a024bb6c32f5d2dde) Signed-off-by: Sean Owen <srowen@gmail.com> 30 May 2020, 19:53:17 UTC
51a68d9 [SPARK-31864][SQL] Adjust AQE skew join trigger condition This PR makes a minor change in deciding whether a partition is skewed by comparing the partition size to the median size of coalesced partitions instead of median size of raw partitions before coalescing. This change is line with target size criteria of splitting skew join partitions and can also cope with situations of extra empty partitions caused by over-partitioning. This PR has also improved skew join tests in AQE tests. No. Updated UTs. Closes #28669 from maryannxue/spark-31864. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b9737c3c228f465d332e41f1ea0cece2a5f7667e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 May 2020, 08:11:09 UTC
0507e6e [SPARK-31859][SPARK-31861][SPARK-31863] Fix Thriftserver session timezone issues ### What changes were proposed in this pull request? Timestamp literals in Spark are interpreted as timestamps in local timezone spark.sql.session.timeZone. If JDBC client is e.g. in TimeZone UTC-7, and sets spark.sql.session.timeZone to PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM timezone of the Spark cluster is e.g. UTC+2, then what currently happens is: * The timestamp literal in the query is interpreted as 12:00:00 UTC-7, i.e. 19:00:00 UTC. * When it's returned from the query, it is collected as a java.sql.Timestamp object with Dataset.collect(), and put into a Thriftserver RowSet. * Before sending it over the wire, the Timestamp is converted to String. This happens in explicitly in ColumnValue for RowBasedSet, and implicitly in ColumnBuffer for ColumnBasedSet (all non-primitive types are converted toString() there). The conversion toString uses JVM timezone, which results in a "21:00:00" (UTC+2) string representation. * The client JDBC application parses gets a "21:00:00" Timestamp back (in it's JVM timezone; if the JDBC application cares about the correct UTC internal value, it should set spark.sql.session.timeZone to be consistent with its JVM timezone) The problem is caused by the conversion happening in Thriftserver RowSet with the generic toString() function, instead of using HiveResults.toHiveString() that takes care of correct, timezone respecting conversions. This PR fixes it by converting the Timestamp values to String earlier, in SparkExecuteStatementOperation, using that function. This fixes SPARK-31861. Thriftserver also did not work spark.sql.datetime.java8API.enabled, because the conversions in RowSet expected an Timestamp object instead of Instant object. Using HiveResults.toHiveString() also fixes that. For this reason, we also convert Date values in SparkExecuteStatementOperation as well - so that HiveResults.toHiveString() handles LocalDate as well. This fixes SPARK-31859. Thriftserver also did not correctly set the active SparkSession. Because of that, configuration obtained using SQLConf.get was not the correct session configuration. This affected getting the correct spark.sql.session.timeZone. It is fixed by extending the use of SparkExecuteStatementOperation.withSchedulerPool to also set the correct active SparkSession. When the correct session is set, we also no longer need to maintain the pool mapping in a sessionToActivePool map. The scheduler pool can be just correctly retrieved from the session config. "withSchedulerPool" is renamed to "withLocalProperties" and moved into a mixin helper trait, because it should be applied with every operation. This fixes SPARK-31863. I used the opportunity to move some repetitive code from the operations to the mixin helper trait. Closes #28671 from juliuszsompolski/SPARK-31861. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit af35691de449aca2e8db8d6dd7092255a919a04b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 30 May 2020, 06:16:47 UTC
d7e6e08 [SPARK-31855][SQL][TESTS] Check reading date/timestamp from Avro files w/ and w/o Spark version ### What changes were proposed in this pull request? 1. Add the following parquet files to the resource folder `external/avro/src/test/resources`: - Files saved by Spark 2.4.5 (https://github.com/apache/spark/commit/cee4ecbb16917fa85f02c635925e2687400aa56b) without meta info `org.apache.spark.version` - `before_1582_date_v2_4_5.avro` with a date column: `avro.schema {"type":"record","name":"topLevelRecord","fields":[{"name":"dt","type":[{"type":"int","logicalType":"date"},"null"]}]}` - `before_1582_timestamp_millis_v2_4_5.avro` with a timestamp column: `avro.schema {"type":"record","name":"test","namespace":"logical","fields":[{"name":"dt","type":["null",{"type":"long","logicalType":"timestamp-millis"}],"default":null}]}` - `before_1582_timestamp_micros_v2_4_5.avro` with a timestamp column: `avro.schema {"type":"record","name":"topLevelRecord","fields":[{"name":"dt","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]}]}` - Files saved by Spark 2.4.6-rc3 (https://github.com/apache/spark/commit/570848da7c48ba0cb827ada997e51677ff672a39) with the meta info `org.apache.spark.version 2.4.6`: - `before_1582_date_v2_4_6.avro` is similar to `before_1582_date_v2_4_5.avro` except Spark version in parquet meta info. - `before_1582_timestamp_micros_v2_4_6.avro` is similar to `before_1582_timestamp_micros_v2_4_5.avro` except meta info. - `before_1582_timestamp_millis_v2_4_6.avro` is similar to `before_1582_timestamp_millis_v2_4_5.avro` except meta info. 2. Removed a few avro files becaused they are replaced by Avro files generated by Spark 2.4.5 above. 3. Add new test "generate test files for checking compatibility with Spark 2.4" to `AvroSuite` (marked as ignored). The parquet files above were generated by this test. 4. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in `AvroSuite` to use new parquet files. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By `AvroV1Suite` and `AvroV2Suite`. Closes #28664 from MaxGekk/avro-update-resource-files. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 37a1fb8d089190929b77421fbb449bf0ed1ab25e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 May 2020, 05:18:56 UTC
52c2988 [SPARK-31862][SQL] Remove exception wrapping in AQE ### What changes were proposed in this pull request? This PR removes the excessive exception wrapping in AQE so that error messages are less verbose and mostly consistent with non-aqe execution. Exceptions from stage materialization are now only wrapped with `SparkException` if there are multiple stage failures. Also, stage cancelling errors will not be included as part the exception thrown, but rather just be error logged. ### Why are the changes needed? This will make the AQE error reporting more readable and debuggable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #28668 from maryannxue/spark-31862. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 45864faaf2c9837a2ca48c456d3c2300736aa1ba) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 May 2020, 04:23:47 UTC
d2fb6e4 [SPARK-31833][SQL][TEST-HIVE1.2] Set HiveThriftServer2 with actual port while configured 0 ### What changes were proposed in this pull request? When I was developing some stuff based on the `DeveloperAPI ` `org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext`, I need to use thrift port randomly to avoid race on ports. But the `org.apache.hive.service.cli.thrift.ThriftCLIService#getPortNumber` do not respond to me with the actual bound port but always 0. And the server log is not right too, after starting the server, it's hard to form to the right JDBC connection string. ``` INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 0 with 5...500 worker threads ``` Indeed, the `53742` is the right port ```shell lsof -nP -p `cat ./pid/spark-kentyao-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1.pid` | grep LISTEN java 18990 kentyao 288u IPv6 0x22858e3e60d6a0a7 0t0 TCP 10.242.189.214:53723 (LISTEN) java 18990 kentyao 290u IPv6 0x22858e3e60d68827 0t0 TCP *:4040 (LISTEN) java 18990 kentyao 366u IPv6 0x22858e3e60d66987 0t0 TCP 10.242.189.214:53724 (LISTEN) java 18990 kentyao 438u IPv6 0x22858e3e60d65d47 0t0 TCP *:53742 (LISTEN) ``` In the PR, when the port is configured 0, the `portNum` will be set to the real used port during the start process. Also use 0 in thrift related tests to avoid potential flakiness. ### Why are the changes needed? 1 fix API bug 2 reduce test flakiness ### Does this PR introduce _any_ user-facing change? yes, `org.apache.hive.service.cli.thrift.ThriftCLIService#getPortNumber` will always give you the actual port when it is configured 0. ### How was this patch tested? modified unit tests Closes #28651 from yaooqinn/SPARK-31833. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fe1da296da152528fc159c23e3026a0c19343780) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 29 May 2020, 04:13:35 UTC
3ff4db9 [SPARK-31730][CORE][TEST][3.0] Fix flaky tests in BarrierTaskContextSuite ### What changes were proposed in this pull request? To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up. ### How was this patch tested? Existing tests. Closes #28658 from jiangxb1987/barrierTest. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> 28 May 2020, 23:29:40 UTC
9580266 [BUILD][INFRA] bump the timeout to match the jenkins PRB ### What changes were proposed in this pull request? bump the timeout to match what's set in jenkins ### Why are the changes needed? tests be timing out! ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? via jenkins Closes #28666 from shaneknapp/increase-jenkins-timeout. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com> (cherry picked from commit 9e68affd13a3875b92f0700b8ab7c9d902f1a08c) Signed-off-by: shane knapp <incomplete@gmail.com> 28 May 2020, 21:26:41 UTC
11bce50 [SPARK-31839][TESTS] Delete duplicate code in castsuit ### What changes were proposed in this pull request? Delete duplicate code castsuit ### Why are the changes needed? keep spark code clean ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? no need Closes #28655 from GuoPhilipse/delete-duplicate-code-castsuit. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit dfbc5edf20040e8163ee3beef61f2743a948c508) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 28 May 2020, 00:57:27 UTC
e7b88e8 Revert "[SPARK-31730][CORE][TEST] Fix flaky tests in BarrierTaskContextSuite" This reverts commit cb817bb1cf6b074e075c02880001ec96f2f39de7. 28 May 2020, 00:21:10 UTC
cb817bb [SPARK-31730][CORE][TEST] Fix flaky tests in BarrierTaskContextSuite ### What changes were proposed in this pull request? To wait until all the executors have started before submitting any job. This could avoid the flakiness caused by waiting for executors coming up. ### How was this patch tested? Existing tests. Closes #28584 from jiangxb1987/barrierTest. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> (cherry picked from commit efe7fd2b6bea4a945ed7f3f486ab279c505378b4) Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> 27 May 2020, 23:37:22 UTC
0383d1e [SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form If `LLL`/`qqq` is used in the datetime pattern string, and the current JDK in use has a bug for the stand-alone form (see https://bugs.openjdk.java.net/browse/JDK-8114833), throw an exception with a clear error message. to keep backward compatibility with Spark 2.4 Yes Spark 2.4 ``` scala> sql("select date_format('1990-1-1', 'LLL')").show +---------------------------------------------+ |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)| +---------------------------------------------+ | Jan| +---------------------------------------------+ ``` Spark 3.0 with Java 11 ``` scala> sql("select date_format('1990-1-1', 'LLL')").show +---------------------------------------------+ |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)| +---------------------------------------------+ | Jan| +---------------------------------------------+ ``` Spark 3.0 with Java 8 ``` // before this PR +---------------------------------------------+ |date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)| +---------------------------------------------+ | 1| +---------------------------------------------+ // after this PR scala> sql("select date_format('1990-1-1', 'LLL')").show org.apache.spark.SparkUpgradeException ``` manual test with java 8 and 11 Closes #28646 from cloud-fan/format. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 May 2020, 18:59:42 UTC
4611d21 [SPARK-31835][SQL][TESTS] Add zoneId to codegen related tests in DateExpressionsSuite ### What changes were proposed in this pull request? This PR modifies some codegen related tests to test escape characters for datetime functions which are time zone aware. If the timezone is absent, the formatter could result in `null` caused by `java.util.NoSuchElementException: None.get` and bypassing the real intention of those test cases. ### Why are the changes needed? fix tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing the modified test cases. Closes #28653 from yaooqinn/SPARK-31835. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 311fe6a880f371c20ca5156ca6eb7dec5a15eff6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 May 2020, 17:26:20 UTC
59f9a88 [SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in schema for expression Some alias of expression can not display correctly in schema. This PR will fix them. - `ln` - `rint` - `lcase` - `position` Improve the implement of some expression. 'Yes'. This PR will let user see the correct alias in schema. Jenkins test. Closes #28551 from beliefer/show-correct-alias-in-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8f2b6f3a0b9d4297cb1f62e682958239fd6f9dbd) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 May 2020, 06:07:39 UTC
ec9bd06 [SPARK-31821][BUILD] Remove mssql-jdbc dependencies from Hadoop 3.2 profile ### What changes were proposed in this pull request? There is an unnecessary dependency for `mssql-jdbc`. In this PR I've removed it. ### Why are the changes needed? Unnecessary dependency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with the following configuration. - [x] Pass the dependency test. - [x] SBT with Hadoop-3.2 (https://github.com/apache/spark/pull/28640#issuecomment-634192512) - [ ] Maven with Hadoop-3.2 Closes #28640 from gaborgsomogyi/SPARK-31821. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8f1d77488c07a50b90a113f7faacd3d96f25646b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 May 2020, 01:21:35 UTC
c03a875 Revert "[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs" This reverts commit be18718380fc501fe2a780debd089a1df91c1699. 27 May 2020, 01:16:08 UTC
b165b81 [SPARK-31820][SQL][TESTS] Fix flaky JavaBeanDeserializationSuite ### What changes were proposed in this pull request? Modified formatting of expected timestamp strings in the test `JavaBeanDeserializationSuite`.`testSpark22000` to correctly format timestamps with **zero** seconds fraction. Current implementation outputs `.0` but must be empty string. From SPARK-31820 failure: - should be `2020-05-25 12:39:17` - but incorrect expected string is `2020-05-25 12:39:17.0` ### Why are the changes needed? To make `JavaBeanDeserializationSuite` stable, and avoid test failures like https://github.com/apache/spark/pull/28630#issuecomment-633695723 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I changed https://github.com/apache/spark/blob/7dff3b125de23a4d6ce834217ee08973b259414c/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L207 to ```java new java.sql.Timestamp((System.currentTimeMillis() / 1000) * 1000), ``` to force zero seconds fraction. Closes #28639 from MaxGekk/fix-JavaBeanDeserializationSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 87d34e6b969e9ceba207b996506156bd36f3dd64) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 May 2020, 12:13:44 UTC
fcdffbd [SPARK-31771][SQL][3.0] Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' THIS PR BACKPORTS https://github.com/apache/spark/commit/695cb617d42507eded9c7e50bc7cd5333bbe6f83 TO BRANCH-3.0 ### What changes were proposed in this pull request? Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use `java.time.DateTimeFormatterBuilder` since 3.0.0, which output the leading single letter of the value, e.g. `December` would be `D`. In Spark 2.4 they mean Full-Text Style. In this PR, we explicitly disable Narrow-Text Style for these pattern characters. ### Why are the changes needed? Without this change, there will be a silent data change. ### Does this PR introduce _any_ user-facing change? Yes, queries with datetime operations using datetime patterns, e.g. `G/M/L/E/u` will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters. 1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns 2, datetime patterns that are not supported by both the new parser and the legacy, e.g. "QQQQQ", "qqqqq", will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users. ### How was this patch tested? add unit tests Closes #28637 from yaooqinn/SPARK-31771-30. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 May 2020, 11:30:14 UTC
cefdd09 [SPARK-31792][SS][DOCS] Introduce the structured streaming UI in the Web UI doc ### What changes were proposed in this pull request? This PR adds the structured streaming UI introduction to the Web UI doc. ![image](https://user-images.githubusercontent.com/1452518/82642209-92b99380-9bdb-11ea-9a0d-cbb26040b0ef.png) ### Why are the changes needed? The structured streaming web UI introduced before was missing from the Web UI documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N.A. Closes #28609 from xccui/ss-ui-doc. Authored-by: Xingcan Cui <xccui@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8ba2b47737f83bd27ebef75dea9e73b3649c2ec1) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 May 2020, 05:27:58 UTC
b35bd57 [SPARK-31806][SQL][TESTS] Check reading date/timestamp from legacy parquet: dictionary encoding, w/o Spark version ### What changes were proposed in this pull request? 1. Add the following parquet files to the resource folder `sql/core/src/test/resources/test-data`: - Files saved by Spark 2.4.5 (https://github.com/apache/spark/commit/cee4ecbb16917fa85f02c635925e2687400aa56b) without meta info `org.apache.spark.version` - `before_1582_date_v2_4_5.snappy.parquet` with 2 date columns of the type **INT32 L:DATE** - `PLAIN` (8 date values of `1001-01-01`) and `PLAIN_DICTIONARY` (`1001-01-01`..`1001-01-08`). - `before_1582_timestamp_micros_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MICROS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - `before_1582_timestamp_millis_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT64 L:TIMESTAMP(MILLIS,true)** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123`..`1001-01-08 01:02:03.123`). - `before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` with 2 timestamp columns of the type **INT96** - `PLAIN_DICTIONARY` (8 date values of `1001-01-01 01:02:03.123456`) and `PLAIN_DICTIONARY` (`1001-01-01 01:02:03.123456`..`1001-01-08 01:02:03.123456`). - Files saved by Spark 2.4.6-rc3 (https://github.com/apache/spark/commit/570848da7c48ba0cb827ada997e51677ff672a39) with the meta info `org.apache.spark.version = 2.4.6`: - `before_1582_date_v2_4_6.snappy.parquet` replaces `before_1582_date_v2_4.snappy.parquet`. And it is similar to `before_1582_date_v2_4_5.snappy.parquet` except Spark version in parquet meta info. - `before_1582_timestamp_micros_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_micros_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_micros_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_millis_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_millis_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_millis_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet` is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info. - `before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet` replaces `before_1582_timestamp_int96_v2_4.snappy.parquet`. And it is similar to `before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet` except meta info. 2. Add new test "generate test files for checking compatibility with Spark 2.4" to `ParquetIOSuite` (marked as ignored). The parquet files above were generated by this test. 3. Modified "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" in `ParquetIOSuite` to use new parquet files. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `ParquetIOSuite`. Closes #28630 from MaxGekk/parquet-files-update. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7e4f5bbd8a40011f6f99f023b05f8c15a4a5453d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 May 2020, 05:16:01 UTC
3638992 [SPARK-31810][TEST] Fix AlterTableRecoverPartitions test using incorrect api to modify RDD_PARALLEL_LISTING_THRESHOLD ### What changes were proposed in this pull request? Use the correct API in AlterTableRecoverPartition tests to modify the `RDD_PARALLEL_LISTING_THRESHOLD` conf. ### Why are the changes needed? The existing AlterTableRecoverPartitions test modify the RDD_PARALLEL_LISTING_THRESHOLD as a SQLConf using the withSQLConf API. But since, this is not a SQLConf, it is not overridden and so the test doesn't end up testing the required behaviour. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is UT Fix. UTs are still passing after the fix. Closes #28634 from prakharjain09/SPARK-31810-fix-recover-partitions. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 452594f5a43bc5d98fb42bf0927638d139f6d17c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 May 2020, 05:13:18 UTC
1ebc1df [SPARK-31808][SQL] Makes struct function's output name and class name pretty ### What changes were proposed in this pull request? This PR proposes to set the alias, and class name in its `ExpressionInfo` for `struct`. - Class name in `ExpressionInfo` - from: `org.apache.spark.sql.catalyst.expressions.NamedStruct` - to:`org.apache.spark.sql.catalyst.expressions.CreateNamedStruct` - Alias name: `named_struct(col1, v, ...)` -> `struct(v, ...)` This PR takes over https://github.com/apache/spark/pull/28631 ### Why are the changes needed? To show the correct output name and class names to users. ### Does this PR introduce _any_ user-facing change? Yes. **Before:** ```scala scala> sql("DESC FUNCTION struct").show(false) +------------------------------------------------------------------------------------+ |function_desc | +------------------------------------------------------------------------------------+ |Function: struct | |Class: org.apache.spark.sql.catalyst.expressions.NamedStruct | |Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.| +------------------------------------------------------------------------------------+ ``` ```scala scala> sql("SELECT struct(1, 2)").show(false) +------------------------------+ |named_struct(col1, 1, col2, 2)| +------------------------------+ |[1, 2] | +------------------------------+ ``` **After:** ```scala scala> sql("DESC FUNCTION struct").show(false) +------------------------------------------------------------------------------------+ |function_desc | +------------------------------------------------------------------------------------+ |Function: struct | |Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct | |Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.| +------------------------------------------------------------------------------------+ ``` ```scala scala> sql("SELECT struct(1, 2)").show(false) +------------+ |struct(1, 2)| +------------+ |[1, 2] | +------------+ ``` ### How was this patch tested? Manually tested, and Jenkins tests. Closes #28633 from HyukjinKwon/SPARK-31808. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit df2a1fe131476aac128d63df9b06ec4bca0c2c07) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 May 2020, 03:36:13 UTC
b8da5be [SPARK-31818][SQL] Fix pushing down filters with `java.time.Instant` values in ORC ### What changes were proposed in this pull request? Convert `java.time.Instant` to `java.sql.Timestamp` in pushed down filters to ORC datasource when Java 8 time API enabled. ### Why are the changes needed? The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`: ``` java.lang.IllegalArgumentException: Wrong value class java.time.Instant for TIMESTAMP.EQUALS leaf at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75) ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added tests to `OrcFilterSuite`. Closes #28636 from MaxGekk/orc-timestamp-filter-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6c80ebbccb7f354f645dd63a73114332d26f901f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 May 2020, 01:36:24 UTC
942c4fc [SPARK-31755][SQL][FOLLOWUP] Update date-time, CSV and JSON benchmark results ### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | ### Why are the changes needed? 1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 92685c014816e82c339f047f51c5363e32bc7338) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 May 2020, 15:00:24 UTC
1020890 [SPARK-31802][SQL] Format Java date-time types in `Row.jsonValue` directly ### What changes were proposed in this pull request? Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR https://github.com/apache/spark/pull/28582 added the methods to avoid conversions to days and microseconds. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `RowJsonSuite`. Closes #28620 from MaxGekk/toJson-format-Java-datetime-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 7f36310500a7ab6e80cc1b1e6c764ae7b9657758) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 25 May 2020, 03:50:52 UTC
be18718 [SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs ### What changes were proposed in this pull request? UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding ### Why are the changes needed? Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD. ### Does this PR introduce _any_ user-facing change? Yes Before: SparkSession available as 'spark'. >>> rdd1 = sc.parallelize([1,2,3,4,5]) >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in setitem File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) After: >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) >>> unionRDD1.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ### How was this patch tested? Tested with the reproduced piece of code above manually Closes #28603 from redsanket/SPARK-31788. Authored-by: schintap <schintap@verizonmedia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a61911c50c391e61038cf01611629d2186d17a76) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 25 May 2020, 01:29:20 UTC
40e5de5 [SPARK-31807][INFRA] Use python 3 style in release-build.sh ### What changes were proposed in this pull request? This PR aims to use the style that is compatible with both python 2 and 3. ### Why are the changes needed? This will help python 3 migration. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #28632 from williamhyun/use_python3_style. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 753636e86b1989a9b43ec691de5136a70d11f323) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 25 May 2020, 01:25:57 UTC
72c466e [SPARK-31761][SQL][3.0] cast integer to Long to avoid IntegerOverflow for IntegralDivide operator ### What changes were proposed in this pull request? `IntegralDivide` operator returns Long DataType, so integer overflow case should be handled. If the operands are of type Int it will be casted to Long ### Why are the changes needed? As `IntegralDivide` returns Long datatype, integer overflow should not happen ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT and also tested in the local cluster After fix ![image](https://user-images.githubusercontent.com/35216143/82603361-25eccc00-9bd0-11ea-9ca7-001c539e628b.png) SQL Test After fix ![image](https://user-images.githubusercontent.com/35216143/82637689-f0250300-9c22-11ea-85c3-886ab2c23471.png) Before Fix ![image](https://user-images.githubusercontent.com/35216143/82637984-878a5600-9c23-11ea-9e47-5ce2fb923c01.png) Closes #28628 from sandeep-katta/branch3Backport. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 24 May 2020, 12:39:16 UTC
576c224 [SPARK-31755][SQL][3.0] allow missing year/hour when parsing date/timestamp string ### What changes were proposed in this pull request? This PR allows missing year and hour fields when parsing date/timestamp string, with 1970 as the default year and 0 as the default hour. ### Why are the changes needed? To keep backward compatibility with Spark 2.4. ### Does this PR introduce _any_ user-facing change? Yes. Spark 2.4: ``` scala> sql("select to_timestamp('16', 'dd')").show +------------------------+ |to_timestamp('16', 'dd')| +------------------------+ | 1970-01-16 00:00:00| +------------------------+ scala> sql("select to_date('16', 'dd')").show +-------------------+ |to_date('16', 'dd')| +-------------------+ | 1970-01-16| +-------------------+ scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show +----------------------------------+ |to_timestamp('2019 40', 'yyyy mm')| +----------------------------------+ | 2019-01-01 00:40:00| +----------------------------------+ scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show +----------------------------------------------+ |to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')| +----------------------------------------------+ | 2019-01-01 10:10:10| +----------------------------------------------+ ``` in branch 3.0 ``` scala> sql("select to_timestamp('16', 'dd')").show +--------------------+ |to_timestamp(16, dd)| +--------------------+ | null| +--------------------+ scala> sql("select to_date('16', 'dd')").show +---------------+ |to_date(16, dd)| +---------------+ | null| +---------------+ scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show +------------------------------+ |to_timestamp(2019 40, yyyy mm)| +------------------------------+ | 2019-01-01 00:00:00| +------------------------------+ scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show +------------------------------------------+ |to_timestamp(2019 10:10:10, yyyy hh:mm:ss)| +------------------------------------------+ | 2019-01-01 00:00:00| +------------------------------------------+ ``` After this PR, the behavior becomes the same as 2.4. ### How was this patch tested? new tests Closes #28612 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 24 May 2020, 05:27:16 UTC
f05a26a [SPARK-31786][K8S][BUILD] Upgrade kubernetes-client to 4.9.2 ### What changes were proposed in this pull request? This PR aims to upgrade `kubernetes-client` library to bring the JDK8 related fixes. Please note that JDK11 works fine without any problem. - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.9.2 - JDK8 always uses http/1.1 protocol (Prevent OkHttp from wrongly enabling http/2) ### Why are the changes needed? OkHttp "wrongly" detects the Platform as Jdk9Platform on JDK 8u251. - https://github.com/fabric8io/kubernetes-client/issues/2212 - https://stackoverflow.com/questions/61565751/why-am-i-not-able-to-run-sparkpi-example-on-a-kubernetes-k8s-cluster Although there is a workaround `export HTTP2_DISABLE=true` and `Downgrade JDK or K8s`, we had better avoid this problematic situation. ### Does this PR introduce _any_ user-facing change? No. This will recover the failures on JDK 8u252. ### How was this patch tested? - [x] Pass the Jenkins UT (https://github.com/apache/spark/pull/28601#issuecomment-632474270) - [x] Pass the Jenkins K8S IT with the K8s 1.13 (https://github.com/apache/spark/pull/28601#issuecomment-632438452) - [x] Manual testing with K8s 1.17.3. (Below) **v1.17.6 result (on Minikube)** ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 8 minutes, 27 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28601 from dongjoon-hyun/SPARK-K8S-CLIENT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 64ffc6649623e3cb568315f57c9e06be3b547c00) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 May 2020, 01:06:55 UTC
1e79c0d [SPARK-30715][K8S][TESTS][FOLLOWUP] Update k8s client version in IT as well ### What changes were proposed in this pull request? This is a follow up for SPARK-30715 . Kubernetes client version in sync in integration-tests and kubernetes/core ### Why are the changes needed? More than once, the kubernetes client version has gone out of sync between integration tests and kubernetes/core. So brought them up in sync again and added a comment to save us from future need of this additional followup. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Closes #27948 from ScrapCodes/follow-up-spark-30715. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3799d2b9d842f4b9f4e78bf701f5e123f0061bad) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 May 2020, 01:06:17 UTC
5f966f9 [SPARK-30715][K8S] Bump fabric8 to 4.7.1 ### What changes were proposed in this pull request? Bump fabric8 kubernetes-client to 4.7.1 ### Why are the changes needed? New fabric8 version brings support for Kubernetes 1.17 clusters. Full release notes: - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0 - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.1 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit and integration tests cover creation of K8S objects. Adjusted them to work with the new fabric8 version Closes #27443 from onursatici/os/bump-fabric8. Authored-by: Onur Satici <onursatici@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 86fdb818bf5dfde7744bf2b358876af361ec9a68) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 May 2020, 01:05:54 UTC
d8b788e [SPARK-29854][SQL][TESTS][FOLLOWUP] Regenerate string-functions.sql.out 23 May 2020, 21:57:30 UTC
2183345 [SPARK-29854][SQL][TESTS] Add tests to check lpad/rpad throw an exception for invalid length input ### What changes were proposed in this pull request? This PR intends to add trivial tests to check https://github.com/apache/spark/pull/27024 has already been fixed in the master. Closes #27024 ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #28604 from maropu/SPARK-29854. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 7ca73f03fbc6e213c30e725bf480709ed036a376) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 22 May 2020, 23:48:44 UTC
23019aa [SPARK-31612][SQL][DOCS][FOLLOW-UP] Fix a few issues in SQL ref ### What changes were proposed in this pull request? Fix a few issues in SQL Reference ### Why are the changes needed? To make SQL Reference look better ### Does this PR introduce _any_ user-facing change? Yes. before: <img width="189" alt="Screen Shot 2020-05-21 at 11 41 34 PM" src="https://user-images.githubusercontent.com/13592258/82639052-d0f38a80-9bbc-11ea-81a4-22def4ca5cc0.png"> after: <img width="195" alt="Screen Shot 2020-05-21 at 11 41 17 PM" src="https://user-images.githubusercontent.com/13592258/82639063-d5b83e80-9bbc-11ea-84d1-8361e6bee949.png"> before: <img width="763" alt="Screen Shot 2020-05-21 at 11 45 22 PM" src="https://user-images.githubusercontent.com/13592258/82639252-3e9fb680-9bbd-11ea-863c-e6a6c2f83a06.png"> after: <img width="724" alt="Screen Shot 2020-05-21 at 11 45 02 PM" src="https://user-images.githubusercontent.com/13592258/82639265-42cbd400-9bbd-11ea-8df2-fc5c255b84d3.png"> before: <img width="437" alt="Screen Shot 2020-05-21 at 11 41 57 PM" src="https://user-images.githubusercontent.com/13592258/82639072-db158900-9bbc-11ea-9963-731881cda4fd.png"> after <img width="347" alt="Screen Shot 2020-05-21 at 11 42 26 PM" src="https://user-images.githubusercontent.com/13592258/82639082-dfda3d00-9bbc-11ea-9bd2-f922cc91f175.png"> ### How was this patch tested? Manually build and check Closes #28608 from huaxingao/doc_fix. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit ad9532a09c70bf6acc8b79b4fdbfcd6afadcbc91) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 22 May 2020, 23:43:43 UTC
c7591ee [SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark ### What changes were proposed in this pull request? add docs for sql migration-guide ### Why are the changes needed? let user know more about the cast scenarios in which Hive and Spark generate different results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? no need to test Closes #28605 from GuoPhilipse/spark-docs. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 892b600ce3a579ea4acabf8ff378c19830a7d89c) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 22 May 2020, 13:01:51 UTC
ec80e4b [SPARK-31784][CORE][TEST] Fix test BarrierTaskContextSuite."share messages with allGather() call" ### What changes were proposed in this pull request? Change from `messages.toList.iterator` to `Iterator.single(messages.toList)`. ### Why are the changes needed? In this test, the expected result of `rdd2.collect().head` should actually be `List("0", "1", "2", "3")` but is `"0"` now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated test. Thanks WeichenXu123 reported this problem. Closes #28596 from Ngone51/fix_allgather_test. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> (cherry picked from commit 83d0967dcc6b205a3fd2003e051f49733f63cb30) Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> 22 May 2020, 06:34:45 UTC
212ca86 [SPARK-31785][SQL][TESTS] Add a helper function to test all parquet readers ### What changes were proposed in this pull request? Add `withAllParquetReaders` to `ParquetTest`. The function allow to run a block of code for all available Parquet readers. ### Why are the changes needed? 1. It simplifies tests 2. Allow to test all parquet readers that could be available in projects based on Apache Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running affected test suites. Closes #28598 from MaxGekk/add-withAllParquetReaders. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 60118a242639df060a9fdcaa4f14cd072ea3d056) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 22 May 2020, 00:53:49 UTC
cc1e568 Revert "[SPARK-31387][SQL] Handle unknown operation/session ID in HiveThriftServer2Listener" This reverts commit 5198b6853b2e3bc69fc013c653aa163c79168366. 21 May 2020, 21:20:50 UTC
0f2afd3 [SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener ## What changes were proposed in this pull request? This change was made as a result of the conversation on https://issues.apache.org/jira/browse/SPARK-31354 and is intended to continue work from that ticket here. This change fixes a memory leak where SparkSession listeners are never cleared off of the SparkContext listener bus. Before running this PR, the following code: ``` SparkSession.builder().master("local").getOrCreate() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() SparkSession.builder().master("local").getOrCreate() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() ``` would result in a SparkContext with the following listeners on the listener bus: ``` [org.apache.spark.status.AppStatusListener5f610071, org.apache.spark.HeartbeatReceiverd400c17, org.apache.spark.sql.SparkSession$$anon$125849aeb, <-First instance org.apache.spark.sql.SparkSession$$anon$1fadb9a0] <- Second instance ``` After this PR, the execution of the same code above results in SparkContext with the following listeners on the listener bus: ``` [org.apache.spark.status.AppStatusListener5f610071, org.apache.spark.HeartbeatReceiverd400c17, org.apache.spark.sql.SparkSession$$anon$125849aeb] <-One instance ``` ## How was this patch tested? * Unit test included as a part of the PR Closes #28128 from vinooganesh/vinooganesh/SPARK-27958. Lead-authored-by: Vinoo Ganesh <vinoo.ganesh@gmail.com> Co-authored-by: Vinoo Ganesh <vganesh@palantir.com> Co-authored-by: Vinoo Ganesh <vinoo@safegraph.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dae79888dc6476892877d3b3b233381cdbf7fa74) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 May 2020, 16:06:41 UTC
b4df7b5 [SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString ### What changes were proposed in this pull request? 1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings: - TimestampFormatter: - `def format(ts: Timestamp): String` - `def format(instant: Instant): String` - DateFormatter: - `def format(date: Date): String` - `def format(localDate: LocalDate): String` 2. Re-use the added methods from `HiveResult.toHiveString` 3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions. ### Why are the changes needed? To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests for toHiveString and new tests in `TimestampFormatterSuite`. Closes #28582 from MaxGekk/opt-format-old-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5d673319af81bb826e5f532b5ff25f2d4b2da122) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 May 2020, 04:01:30 UTC
77797db [SPARK-31780][K8S][TESTS] Add R test tag to exclude R K8s image building and test ### What changes were proposed in this pull request? This PR aims to skip R image building and one R test during integration tests by using `--exclude-tags r`. ### Why are the changes needed? We have only one R integration test case, `Run SparkR on simple dataframe.R example`, for submission test coverage. Since this is rarely changed, we can skip this and save the efforts required for building the whole R image and running the single test. ``` KubernetesSuite: ... - Run SparkR on simple dataframe.R example Run completed in 10 minutes, 20 seconds. Total number of tests run: 20 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the K8S integration test and do the following manually. (Note that R test is skipped) ``` $ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --deploy-mode docker-for-desktop --exclude-tags r --spark-tgz $PWD/spark-*.tgz ... KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 10 minutes, 23 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28594 from dongjoon-hyun/SPARK-31780. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a06768ec4d5059d1037086fe5495e5d23cde514b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 21 May 2020, 01:33:51 UTC
6b31e1a [SPARK-30689][CORE][FOLLOW-UP] Add @since annotation for ResourceDiscoveryScriptPlugin/ResourceInformation ### What changes were proposed in this pull request? Added `since 3.0.0` for `ResourceDiscoveryScriptPlugin` and `ResourceInformation`. ### Why are the changes needed? It's required for exposed APIs(https://github.com/apache/spark/pull/27689#discussion_r426488851). ### Does this PR introduce _any_ user-facing change? Yes, they can easily know when does Spark introduces the API. ### How was this patch tested? Pass Jenkins. Closes #28591 from Ngone51/followup-30689. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 26bc6907222ddb2d800a2c1245232887422e49d9) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 20 May 2020, 18:12:09 UTC
5198b68 [SPARK-31387][SQL] Handle unknown operation/session ID in HiveThriftServer2Listener ### What changes were proposed in this pull request? This is a recreation of #28155, which was reverted due to causing test failures. The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log. To improve robustness, we also make the following changes in HiveSessionImpl.close(): - Catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed. - Handle not being able to access the scratch directory. When closing, all `.pipeout` files are removed from the scratch directory, which would have resulted in an NPE if the directory does not exist. ### Why are the changes needed? The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this changes the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException. In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests Closes #28544 from alismess-db/hive-thriftserver-listener-update-safer-2. Authored-by: Ali Smesseim <ali.smesseim@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d40ecfa3f7b0102063075cafd54451f082ce38e1) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 May 2020, 17:30:33 UTC
4e95339 Revert [SPARK-27142][SPARK-31440] SQL rest API in branch 3.0 ### What changes were proposed in this pull request? Revert https://github.com/apache/spark/pull/28208 and https://github.com/apache/spark/pull/24076 in branch 3.0 ### Why are the changes needed? Unfortunately, the PR https://github.com/apache/spark/pull/28208 is merged after Spark 3.0 RC 2 cut. Although the improvement is great, we can't break the policy to add new improvement commits into branch 3.0 now. Also, if we are going to adopt the improvement in a future release, we should not release 3.0 with https://github.com/apache/spark/pull/24076, since the API result will be changed. After discuss with cloud-fan and gatorsmile offline, we think the best choice is to revert both commits and follow community release policy. ### Does this PR introduce _any_ user-facing change? Yes, let's hold the SQL rest API until next release. ### How was this patch tested? Jenkins unit tests. Closes #28588 from gengliangwang/revertSQLRestAPI. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 May 2020, 12:20:45 UTC
0a20d32 [SPARK-31766][K8S][TESTS] Add Spark version prefix to K8s UUID test image tag ### What changes were proposed in this pull request? This PR aims to add Spark version prefix during generating test image tag for K8s integration testing. ### Why are the changes needed? This helps to distinguish the images by version. **BEFORE** ``` $ docker images | grep kubespark kubespark/spark-py F7188CBD-AE08-4705-9C8A-D0DD3DC8B86F ... kubespark/spark F7188CBD-AE08-4705-9C8A-D0DD3DC8B86F ... ``` **AFTER** ``` $ docker images | grep kubespark kubespark/spark-py 3.1.0-SNAPSHOT_F7188CBD-AE08-4705-9C8A-D0DD3DC8B86F ... kubespark/spark 3.1.0-SNAPSHOT_F7188CBD-AE08-4705-9C8A-D0DD3DC8B86F ... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the K8s integration test. ``` ... Successfully tagged kubespark/spark:3.1.0-SNAPSHOT_688b46c8-c119-404d-aadb-d05a14262db7 ... Successfully tagged kubespark/spark-py:3.1.0-SNAPSHOT_688b46c8-c119-404d-aadb-d05a14262db7 ... Successfully tagged kubespark/spark-r:3.1.0-SNAPSHOT_688b46c8-c119-404d-aadb-d05a14262db7 ``` Closes #28587 from dongjoon-hyun/SPARK-31766. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b7947e0285f61ac2d8cc3f93c75771a64ad8e1cd) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 May 2020, 05:58:10 UTC
e3996f4 [SPARK-31767][PYTHON][CORE] Remove ResourceInformation in pyspark module's namespace This PR proposes to only allow the import of `ResourceInformation` as below: ``` pyspark.resource.ResourceInformation ``` instead of ``` pyspark.ResourceInformation pyspark.resource.ResourceInformation ``` because `pyspark.resource` is a separate module, and it is documented so. The constructor of `ResourceInformation` isn't supposed to directly call anyway. To keep the code structure coherent. No, it will be in the unreleased branches. Manually tested via importing: Before: ```python >>> import pyspark >>> pyspark.ResourceInformation <class 'pyspark.resource.information.ResourceInformation'> >>> pyspark.resource.ResourceInformation <class 'pyspark.resource.information.ResourceInformation'> ``` After: ```python >>> import pyspark >>> pyspark.ResourceInformation Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: module 'pyspark' has no attribute 'ResourceInformation' >>> pyspark.resource.ResourceInformation <class 'pyspark.resource.information.ResourceInformation'> ``` Also tested via ```bash cd python ./run-tests --python-executables=python3 --modules=pyspark-core,pyspark-resource ``` Jenkins will test and existing tests should cover. Closes #28589 from HyukjinKwon/SPARK-31767. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dc3a606fbd37ede404a61551ff792fa94fe1780f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 May 2020, 05:37:35 UTC
57797d9 [SPARK-31748][PYTHON][3.0] Document resource module in PySpark doc and rename/move classes ### What changes were proposed in this pull request? This PR partially backports https://github.com/apache/spark/pull/28569. 1.. Rename ``` pyspark └── resourceinformation.py    └── class ResourceInformation ``` to ``` pyspark └── resource.py    └── class ResourceInformation ``` So, the `ResourceInformation` is consistently imported via `pyspark.resource.ResourceInformation`. 2.. Document the new `pyspark.resource` module 3.. Minor docstring fix e.g.: ```diff - param name the name of the resource - param addresses an array of strings describing the addresses of the resource + :param name: the name of the resource + :param addresses: an array of strings describing the addresses of the resource + + .. versionadded:: 3.0.0 ``` ### Why are the changes needed? To document APIs, and move Python modules to fewer and simpler modules. ### Does this PR introduce _any_ user-facing change? No, the changes are in unreleased branches. ### How was this patch tested? Manually tested via: ```bash cd python ./run-tests --python-executables=python3 --modules=pyspark-core ``` Closes #28586 from HyukjinKwon/SPARK-31748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 20 May 2020, 04:37:49 UTC
a1391cf [SPARK-31706][SQL] add back the support of streaming update mode ### What changes were proposed in this pull request? This PR adds a private `WriteBuilder` mixin trait: `SupportsStreamingUpdate`, so that the builtin v2 streaming sinks can still support the update mode. Note: it's private because we don't have a proper design yet. I didn't take the proposal in https://github.com/apache/spark/pull/23702#discussion_r258593059 because we may want something more general, like updating by an expression `key1 = key2 + 10`. ### Why are the changes needed? In Spark 2.4, all builtin v2 streaming sinks support all streaming output modes, and v2 sinks are enabled by default, see https://issues.apache.org/jira/browse/SPARK-22911 It's too risky for 3.0 to go back to v1 sinks, so I propose to add a private trait to fix builtin v2 sinks, to keep backward compatibility. ### Does this PR introduce _any_ user-facing change? Yes, now all the builtin v2 streaming sinks support all streaming output modes, which is the same as 2.4 ### How was this patch tested? existing tests. Closes #28523 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 34414acfa36d1a91cd6a64761625c8b0bd90c0a7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 May 2020, 03:45:23 UTC
d0a7c6f [SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType ### What changes were proposed in this pull request? Eliminate the `UpCast` if it's child data type is already decimal type. ### Why are the changes needed? While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.: ``` sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d") .write.mode("overwrite") .parquet(f.getAbsolutePath) // can fail spark.read.parquet(f.getAbsolutePath).as[BigDecimal] ``` ``` [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18). [info] The type path of the target object is: [info] - root class: "scala.math.BigDecimal" [info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) ``` ### Does this PR introduce _any_ user-facing change? Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change. ### How was this patch tested? Added tests. Closes #28572 from Ngone51/fix_encoder. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 20 May 2020, 02:20:41 UTC
70ff959 [SPARK-31721][SQL] Assert optimized is initialized before tracking the planning time ### What changes were proposed in this pull request? The QueryPlanningTracker in QueryExeuction reports the planning time that also includes the optimization time. This happens because the optimizedPlan in QueryExecution is lazy and only will initialize when first called. When df.queryExecution.executedPlan is called, the the tracker starts recording the planning time, and then calls the optimized plan. This causes the planning time to start before optimization and also include the planning time. This PR fixes this behavior by introducing a method assertOptimized, similar to assertAnalyzed that explicitly initializes the optimized plan. This method is called before measuring the time for sparkPlan and executedPlan. We call it before sparkPlan because that also counts as planning time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #28543 from dbaliafroozeh/AddAssertOptimized. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit b9cc31cd958fec4b473e757694c5f811316c858e) Signed-off-by: herman <herman@databricks.com> 19 May 2020, 09:11:33 UTC
9f34cf5 [SPARK-31651][CORE] Improve handling the case where different barrier sync types in a single sync ### What changes were proposed in this pull request? This PR improves handling the case where different barrier sync types in a single sync: - use `clear` instead of `cleanupBarrierStage ` - make sure all requesters are failed because of "different barrier sync types" ### Why are the changes needed? Currently, we use `cleanupBarrierStage` to clean up a barrier stage when we detecting the case of "different barrier sync types". But this leads to a problem that we could create new a `ContextBarrierState` for the same stage again if there're on-way requests from tasks. As a result, those task will fail because of killing instead of "different barrier sync types". Besides, we don't handle the current request which is being handling properly as it will fail due to epoch mismatch instead of "different barrier sync types". ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated a existed test. Closes #28462 from Ngone51/impr_barrier_req. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> (cherry picked from commit 653ca19b1f26155c2b7656e2ebbe227b18383308) Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com> 19 May 2020, 06:54:54 UTC
d668d67 [SPARK-31440][SQL] Improve SQL Rest API ### What changes were proposed in this pull request? SQL Rest API exposes query execution metrics as Public API. This PR aims to apply following improvements on SQL Rest API by aligning Spark-UI. **Proposed Improvements:** 1- Support Physical Operations and group metrics per physical operation by aligning Spark UI. 2- Support `wholeStageCodegenId` for Physical Operations 3- `nodeId` can be useful for grouping metrics and sorting physical operations (according to execution order) to differentiate same operators (if used multiple times during the same query execution) and their metrics. 4- Filter `empty` metrics by aligning with Spark UI - SQL Tab. Currently, Spark UI does not show empty metrics. 5- Remove line breakers(`\n`) from `metricValue`. 6- `planDescription` can be `optional` Http parameter to avoid network cost where there is specially complex jobs creating big-plans. 7- `metrics` attribute needs to be exposed at the bottom order as `nodes`. Specially, this can be useful for the user where `nodes` array size is high. 8- `edges` attribute is being exposed to show relationship between `nodes`. 9- Reverse order on `metricDetails` aims to match with Spark UI by supporting Physical Operators' execution order. ### Why are the changes needed? Proposed improvements provides more useful (e.g: physical operations and metrics correlation, grouping) and clear (e.g: filtering blank metrics, removing line breakers) result for the end-user. ### Does this PR introduce any user-facing change? Yes. Please find both current and improved versions of the results as attached for following SQL Rest Endpoint: ``` curl -X GET http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true ``` **Current version:** https://issues.apache.org/jira/secure/attachment/12999821/current_version.json **Improved version:** https://issues.apache.org/jira/secure/attachment/13000621/improved_version.json ### Backward Compatibility SQL Rest API will be started to expose with `Spark 3.0` and `3.0.0-preview2` (released on 12/23/19) does not cover this API so if PR can catch 3.0 release, this will not have any backward compatibility issue. ### How was this patch tested? 1. New Unit tests are added. 2. Also, patch has been tested manually through both **Spark Core** and **History Server** Rest APIs. Closes #28208 from erenavsarogullari/SPARK-31440. Authored-by: Eren Avsarogullari <eren.avsarogullari@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> (cherry picked from commit ab4cf49a1ca5a2578ecd0441ad6db8e88593e648) Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 19 May 2020, 06:21:58 UTC
18925c3 [SPARK-31739][PYSPARK][DOCS][MINOR] Fix docstring syntax issues and misplaced space characters This commit is published into the public domain. Some syntax issues in docstrings have been fixed. In some places, the documentation did not render as intended, e.g. parameter documentations were not formatted as such. Slight improvements in documentation. Manual testing and `dev/lint-python` run. No new Sphinx warnings arise due to this change. Closes #28559 from DavidToneian/SPARK-31739. Authored-by: David Toneian <david@toneian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 19 May 2020, 02:01:38 UTC
847d6d4 [SPARK-31102][SQL][3.0] Spark-sql fails to parse when contains comment This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049. Backport to 3.0 from #27920 Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue: ``` spark-sql> SELECT 1 -- someone's comment here > ; Error in query: extraneous input ';' expecting <EOF>(line 2, pos 0) == SQL == SELECT 1 -- someone's comment here ; ^^^ ``` This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem: ``` spark-sql> select > 1, > -- two > 2; Error in query: mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2) == SQL == select 1, --^^^ ``` This issue is generated by a missing turn-off for the insideComment flag with a newline. No - For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context. - A new test for inline comments was added. Closes #27920 from javierivanov/SPARK-31102. Authored-by: Javier Fuentes <j.fuentes.micloud.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #28565 from javierivanov/SPARK-3.0-31102. Authored-by: Javier Fuentes <j.fuentes.m@icloud.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 18 May 2020, 23:57:30 UTC
f6053b9 Preparing development version 3.0.1-SNAPSHOT 18 May 2020, 13:21:43 UTC
29853ec Preparing Spark release v3.0.0-rc2 18 May 2020, 13:21:37 UTC
740da34 [SPARK-31738][SQL][DOCS] Describe 'L' and 'M' month pattern letters ### What changes were proposed in this pull request? 1. Describe standard 'M' and stand-alone 'L' text forms 2. Add examples for all supported number of month letters <img width="1047" alt="Screenshot 2020-05-18 at 08 57 31" src="https://user-images.githubusercontent.com/1580697/82178856-b16f1000-98e5-11ea-87c0-456ef94dcd43.png"> ### Why are the changes needed? To improve docs and show how to use month patterns. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By building docs and checking by eyes. Closes #28558 from MaxGekk/describe-L-M-date-pattern. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b3686a762281ce9bf595bf790f8e4198d3d186b4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 May 2020, 12:07:13 UTC
2cdf4eb [SPARK-31746][YARN][TESTS] Show the actual error message in LocalityPlacementStrategySuite This PR proposes to show the actual traceback when "handle large number of containers and tasks (SPARK-18750)" test fails in `LocalityPlacementStrategySuite`. **It does not fully resolve the JIRA SPARK-31746 yet**. I tried to reproduce in my local by controlling the factors in the tests but I couldn't. I double checked the changes in SPARK-18750 are still valid. This test is flaky for an unknown reason (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122768/testReport/org.apache.spark.deploy.yarn/LocalityPlacementStrategySuite/handle_large_number_of_containers_and_tasks__SPARK_18750_/): ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError did not equal null at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) ``` After this PR, it will help to investigate the root cause: **Before**: ``` [info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (824 milliseconds) [info] java.lang.StackOverflowError did not equal null (LocalityPlacementStrategySuite.scala:49) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:49) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) [info] at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) [info] at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) ... ``` **After**: ``` [info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (825 milliseconds) [info] StackOverflowError should not be thrown; however, got: [info] [info] java.lang.StackOverflowError [info] at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) ... ``` No, dev-only. Manually tested by reverting https://github.com/apache/spark/commit/76db394f2baedc2c7b7a52c05314a64ec9068263 locally. Closes #28566 from HyukjinKwon/SPARK-31746. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 3bf7bf99e96fab754679a4f3c893995263161341) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 18 May 2020, 05:56:24 UTC
88e00c3 Revert "[SPARK-31746][YARN][TESTS] Show the actual error message in LocalityPlacementStrategySuite" This reverts commit cbd8568ad7588cf14e5519c5aabf88d0b7fb0e33. 18 May 2020, 05:55:31 UTC
cbd8568 [SPARK-31746][YARN][TESTS] Show the actual error message in LocalityPlacementStrategySuite ### What changes were proposed in this pull request? This PR proposes to show the actual traceback when "handle large number of containers and tasks (SPARK-18750)" test fails in `LocalityPlacementStrategySuite`. **It does not fully resolve the JIRA SPARK-31746 yet**. I tried to reproduce in my local by controlling the factors in the tests but I couldn't. I double checked the changes in SPARK-18750 are still valid. ### Why are the changes needed? This test is flaky for an unknown reason (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122768/testReport/org.apache.spark.deploy.yarn/LocalityPlacementStrategySuite/handle_large_number_of_containers_and_tasks__SPARK_18750_/): ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError did not equal null at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) ``` After this PR, it will help to investigate the root cause: **Before**: ``` [info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (824 milliseconds) [info] java.lang.StackOverflowError did not equal null (LocalityPlacementStrategySuite.scala:49) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:49) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) [info] at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) [info] at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) ... ``` **After**: ``` [info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (825 milliseconds) [info] StackOverflowError should not be thrown; however, got: [info] [info] java.lang.StackOverflowError [info] at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) [info] at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) [info] at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256) ... ``` ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested by reverting https://github.com/apache/spark/commit/76db394f2baedc2c7b7a52c05314a64ec9068263 locally. Closes #28566 from HyukjinKwon/SPARK-31746. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 3bf7bf99e96fab754679a4f3c893995263161341) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 18 May 2020, 05:35:23 UTC
back to top