https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
55efce1 Preparing Spark release v2.4.7-rc2 21 August 2020, 08:35:14 UTC
4344a69 [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc ### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit bf221debd02b11003b092201d0326302196e4ba5) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 August 2020, 07:49:39 UTC
d846476 [SPARK-31966][ML][TESTS][PYTHON][2.4] Increase the timeout for StreamingLogisticRegressionWithSGDTests.test_training_and_prediction ### What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/28798 to branch-2.4 to reduce the flakiness in the tests. This is similar with https://github.com/apache/spark/commit/64cb6f7066134a0b9e441291992d2da73de5d918 The test `StreamingLogisticRegressionWithSGDTests.test_training_and_prediction` seems also flaky. This PR just increases the timeout to 3 mins too. The cause is very likely the time elapsed. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ ``` Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 330, in test_training_and_prediction eventually(condition, timeout=60.0) File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/testing/utils.py", line 90, in eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 60 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75 ``` ### Why are the changes needed? To make PR builds more stable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins will test them out. Closes #29481 from HyukjinKwon/SPARK-31966-2.4. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 20 August 2020, 01:33:27 UTC
7c65f76 [SPARK-32249][INFRA][2.4] Run Github Actions builds in branch-2.4 ### What changes were proposed in this pull request? This PR proposes to backport the following JIRAs: - SPARK-32245 - SPARK-32292 - SPARK-32252 - SPARK-32408 - SPARK-32303 - SPARK-32363 - SPARK-32419 - SPARK-32491 - SPARK-32493 - SPARK-32496 - SPARK-32497 - SPARK-32357 - SPARK-32606 - SPARK-32605 - SPARK-32645 - Minor renaming https://github.com/apache/spark/commit/d0dfe4986b1c4cb5a47be46b2bbedeea42d81caf#diff-02d9c370a663741451423342d5869b21 in order to enable GitHub Actions in branch-2.4. ### Why are the changes needed? To be able to run the tests in branch-2.4. Jenkins jobs are unstable. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Build in this PR will test. Closes #29465 from HyukjinKwon/SPARK-32249-2.4. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 August 2020, 01:31:20 UTC
ff8a663 [SPARK-32647][INFRA] Report SparkR test results with JUnit reporter This PR proposes to generate JUnit XML test report in SparkR tests that can be leveraged in both Jenkins and GitHub Actions. **GitHub Actions** ![Screen Shot 2020-08-18 at 12 42 46 PM](https://user-images.githubusercontent.com/6477701/90467934-55b85b00-e150-11ea-863c-c8415e764ddb.png) **Jenkins** ![Screen Shot 2020-08-18 at 2 03 42 PM](https://user-images.githubusercontent.com/6477701/90472509-a5505400-e15b-11ea-9165-777ec9b96eaa.png) NOTE that while I am here, I am switching back the console reporter from "progress" to "summary". Currently non-ascii codes are broken in Jenkins console and switching it to "summary" can work around it. "summary" is the default format used in testthat 1.x. To check the test failures more easily. No, dev-only It is tested in GitHub Actions at https://github.com/HyukjinKwon/spark/pull/23/checks?check_run_id=996586446 In case of Jenkins, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127525/testReport/ Closes #29456 from HyukjinKwon/sparkr-junit. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 19 August 2020, 06:14:40 UTC
34544d6 [SPARK-32609] Incorrect exchange reuse with DataSourceV2 ### What changes were proposed in this pull request? Compare pushedFilters in DataSourceV2ScanExec's equals function. ### Why are the changes needed? Scans with different filters were considered equal, thus causing incorrect exchange reuse. This change fix the issue. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? unit test coverage and ad hoc verification. Without my change, the unit test will fail. Closes #29430 from mingjialiu/branch-2.4. Authored-by: mingjial <mingjial@google.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 August 2020, 15:23:35 UTC
e3ec2d7 Revert "[SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value" This reverts commit afdad0eea81f40cab32b95af8c1bbeed55c5f10f. 17 August 2020, 13:58:13 UTC
75d45fa [SPARK-32625][SQL] Log error message when falling back to interpreter mode ### What changes were proposed in this pull request? This pr log the error message when falling back to interpreter mode. ### Why are the changes needed? Not all error messages are in `CodeGenerator`, such as: ``` 21:48:44.612 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr codegen error and falling back to interpreter mode java.lang.IllegalArgumentException: Can not interpolate org.apache.spark.sql.types.Decimal into code block. at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1(javaCode.scala:240) at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1$adapted(javaCode.scala:236) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #29440 from wangyum/SPARK-32625. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c280c7f529e2766dd7dd45270bde340c28b9d74b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 August 2020, 19:32:05 UTC
e5bef51 [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big endian platforms ### What changes were proposed in this pull request? (back-porting from https://github.com/apache/spark/commit/9a3811dbf5f1234c1587337a3d74823d1d163b53) This PR fixes the issue introduced during SPARK-26985. SPARK-26985 changes the `putDoubles()` and `putFloats()` methods to respect the platform's endian-ness. However, that causes the RLE paths in VectorizedRleValuesReader.java to read the RLE entries in parquet as BIG_ENDIAN on big endian platforms (i.e., as is), even though parquet data is always in little endian format. The comments in `WriteableColumnVector.java` say those methods are used for "ieee formatted doubles in platform native endian" (or floats), but since the data in parquet is always in little endian format, use of those methods appears to be inappropriate. To demonstrate the problem with spark-shell: ```scala import org.apache.spark._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ var data = Seq( (1.0, 0.1), (2.0, 0.2), (0.3, 3.0), (4.0, 4.0), (5.0, 5.0)) var df = spark.createDataFrame(data).write.mode(SaveMode.Overwrite).parquet("/tmp/data.parquet2") var df2 = spark.read.parquet("/tmp/data.parquet2") df2.show() ``` result: ```scala +--------------------+--------------------+ | _1| _2| +--------------------+--------------------+ | 3.16E-322|-1.54234871366845...| | 2.0553E-320| 2.0553E-320| | 2.561E-320| 2.561E-320| |4.66726145843124E-62| 1.0435E-320| | 3.03865E-319|-1.54234871366757...| +--------------------+--------------------+ ``` Also tests in ParquetIOSuite that involve float/double data would fail, e.g., - basic data types (without binary) - read raw Parquet file /examples/src/main/python/mllib/isotonic_regression_example.py would fail as well. Purposed code change is to add `putDoublesLittleEndian()` and `putFloatsLittleEndian()` methods for parquet to invoke, just like the existing `putIntsLittleEndian()` and `putLongsLittleEndian()`. On little endian platforms they would call `putDoubles()` and `putFloats()`, on big endian they would read the entries as little endian like pre-SPARK-26985. No new unit-test is introduced as the existing ones are actually sufficient. ### Why are the changes needed? RLE float/double data in parquet files will not be read back correctly on big endian platforms. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All unit tests (mvn test) were ran and OK. Closes #29419 from tinhto-000/SPARK-31703-2.4. Lead-authored-by: Tin Hang To <tinto@us.ibm.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 August 2020, 03:48:33 UTC
b9595c3 [MINOR] Update URL of the parquet project in code comment ### What changes were proposed in this pull request? Update URL of the parquet project in code comment. ### Why are the changes needed? The original url is not available. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? No test needed. Closes #29416 from izchen/Update-Parquet-URL. Authored-by: Chen Zhang <izchen@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 13 August 2020, 02:55:59 UTC
f1f6fc3 [MINOR][DOCS] Fix typos at ExecutorAllocationManager.scala ### What changes were proposed in this pull request? This PR fixes some typos in <code>core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala</code> file. ### Why are the changes needed? <code>spark.dynamicAllocation.sustainedSchedulerBacklogTimeout</code> (N) is used only after the <code>spark.dynamicAllocation.schedulerBacklogTimeout</code> (M) is exceeded. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No test needed. Closes #29351 from JoeyValentine/master. Authored-by: JoeyValentine <rlaalsdn0506@naver.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dc3fac81848f557e3dac3f35686af325a18d0291) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 August 2020, 19:36:57 UTC
a693960 [SPARK-32556][INFRA][2.4] Fix release script to URL encode the user provided passwords ### What changes were proposed in this pull request? 1. URL encode the `ASF_PASSWORD` of the release manager. 2. force delete a `.gitignore` file as it may be absent, specially in the subsequent runs of the script. And causes the release script failure. 3. Update the image to install qpdf and jq dependency. 4. Increase the JVM HEAP memory for MAVEN build. ### Why are the changes needed? Release script takes hours to run, and if a single failure happens about somewhere midway, then either one has to get down to manually doing stuff or re run the entire script. (This is my understanding) So, I have made the fixes of a few failures, discovered so far. 1. If the release manager password contains a char, that is not allowed in URL, then it fails the build at the clone spark step. `git clone "https://$ASF_USERNAME:$ASF_PASSWORD$ASF_SPARK_REPO" -b $GIT_BRANCH` ^^^ Fails with bad URL `ASF_USERNAME` may not be URL encoded, but we need to encode `ASF_PASSWORD`. 2. If the `.gitignore` file is missing, it fails the build at `rm .gitignore` step. 3. Ran into out of memory issues. ``` [INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 57.801 s] [INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [01:13 min] [INFO] Spark Kinesis Integration .......................... SUCCESS [ 59.862 s] [INFO] Spark Project Examples ............................. SUCCESS [02:10 min] [INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 45.454 s] [INFO] Spark Avro ......................................... SUCCESS [01:58 min] [INFO] Spark Project External Flume Sink .................. SUCCESS [01:34 min] [INFO] Spark Project External Flume ....................... FAILURE [20:04 min] [INFO] Spark Project External Flume Assembly .............. SKIPPED [INFO] Spark Project Kinesis Assembly 2.4.7 ............... SKIPPED [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 54:10 min [INFO] Finished at: 2020-08-06T10:10:12Z [INFO] ------------------------------------------------------------------------ [ERROR] Java heap space -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the release for branch-2.4, using both type of passwords, i.e. passwords with special chars and without it. For other branches, will followup in other PRs targeted for those branches. Closes #29371 from ScrapCodes/release-script-fixs. Lead-authored-by: Prashant Sharma <prashsh1@in.ibm.com> Co-authored-by: Prashant Sharma <prashant@apache.org> Signed-off-by: Prashant Sharma <prashant@apache.org> 07 August 2020, 06:28:46 UTC
701c44d [SPARK-32560][SQL] Improve exception message at InsertIntoHiveTable.processInsert ### What changes were proposed in this pull request? improve exception message ### Why are the changes needed? the before message lack of single quotes, we may improve it to keep consisent. ![image](https://user-images.githubusercontent.com/46367746/89595808-15bbc300-d888-11ea-9914-b05ea7b66461.png) ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? No ,it is only improving the message. Closes #29376 from GuoPhilipse/improve-exception-message. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit aa4d3c19fead4ec2f89b4957b4ccc7482e121e4d) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 07 August 2020, 05:30:08 UTC
7dc17cd Preparing development version 2.4.8-SNAPSHOT 06 August 2020, 07:33:46 UTC
dc04bf5 Preparing Spark release v2.4.7-rc1 06 August 2020, 06:15:37 UTC
39d31dc [SPARK-32003][CORE][2.4] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost ### What changes were proposed in this pull request? If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files. In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased. We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. This test fails without the change and passes with it. Closes #29182 from wypoon/SPARK-32003-2.4. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> 04 August 2020, 19:36:59 UTC
91f2a25 [SPARK-28818][SQL][2.4] Respect source column nullability in the arrays created by `freqItems()` ### What changes were proposed in this pull request? This PR replaces the hard-coded non-nullability of the array elements returned by `freqItems()` with a nullability that reflects the original schema. Essentially [the functional change](https://github.com/apache/spark/pull/25575/files#diff-bf59bb9f3dc351f5bf6624e5edd2dcf4R122) to the schema generation is: ``` StructField(name + "_freqItems", ArrayType(dataType, false)) ``` Becomes: ``` StructField(name + "_freqItems", ArrayType(dataType, originalField.nullable)) ``` Respecting the original nullability prevents issues when Spark depends on `ArrayType`'s `containsNull` being accurate. The example that uncovered this is calling `collect()` on the dataframe (see [ticket](https://issues.apache.org/jira/browse/SPARK-28818) for full repro). Though it's likely that there a several places where this could cause a problem. I've also refactored a small amount of the surrounding code to remove some unnecessary steps and group together related operations. Note: This is the backport PR of #25575 and the credit should be MGHawes. ### Why are the changes needed? I think it's pretty clear why this change is needed. It fixes a bug that currently prevents users from calling `df.freqItems.collect()` along with potentially causing other, as yet unknown, issues. ### Does this PR introduce any user-facing change? Nullability of columns when calling freqItems on them is now respected after the change. ### How was this patch tested? I added a test that specifically tests the carry-through of the nullability as well as explicitly calling `collect()` to catch the exact regression that was observed. I also ran the test against the old version of the code and it fails as expected. Closes #29327 from maropu/SPARK-28818-2.4. Lead-authored-by: Matt Hawes <mhawes@palantir.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 02 August 2020, 23:55:28 UTC
4a8f692 [SPARK-32397][BUILD] Allow specifying of time for build to keep time consistent between modules ### What changes were proposed in this pull request? Upgrade codehaus maven build helper to allow people to specify a time during the build to avoid snapshot artifacts with different version strings. ### Why are the changes needed? During builds of snapshots the maven may assign different versions to different artifacts based on the time each individual sub-module starts building. The timestamp is used as part of the version string when run `maven deploy` on a snapshot build. This results in different sub-modules having different version strings. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual build while specifying the current time, ensured the time is consistent in the sub components. Open question: Ideally I'd like to backport this as well since it's sort of a bug fix and while it does change a dependency version it's not one that is propagated. I'd like to hear folks thoughts about this. Closes #29274 from holdenk/SPARK-32397-snapshot-artifact-timestamp-differences. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit 50911df08eb7a27494dc83bcec3d09701c2babfe) Signed-off-by: DB Tsai <d_tsai@apple.com> 29 July 2020, 21:39:42 UTC
c1421d0 [MINOR][PYTHON] Fix spacing in error message ### What changes were proposed in this pull request? Fixes spacing in an error message ### Why are the changes needed? Makes error messages easier to read ### Does this PR introduce _any_ user-facing change? Yes, it changes the error message ### How was this patch tested? This patch doesn't affect any logic, so existing tests should cover it Closes #29264 from hauntsaninja/patch-1. Authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 77f2ca6cced1c723d1c2e6082a1534f6436c6d2a) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 28 July 2020, 02:22:56 UTC
62671af [SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons… …istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 86ead044e3789b3291a38ec2142cbb343d1290c1) Signed-off-by: Sean Owen <srowen@gmail.com> 26 July 2020, 14:13:26 UTC
6ee0eb4 [SPARK-32280][SPARK-32372][2.4][SQL] ResolveReferences.dedupRight should only rewrite attributes for ancestor nodes of the conflict plan ### What changes were proposed in this pull request? This PR refactors `ResolveReferences.dedupRight` to make sure it only rewrite attributes for ancestor nodes of the conflict plan. ### Why are the changes needed? This is a bug fix. ```scala sql("SELECT name, avg(age) as avg_age FROM person GROUP BY name") .createOrReplaceTempView("person_a") sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = p2.name") .createOrReplaceTempView("person_b") sql("SELECT * FROM person_a UNION SELECT * FROM person_b") .createOrReplaceTempView("person_c") sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name = p2.name").show() ``` When executing the above query, we'll hit the error: ```scala [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: Resolved attribute(s) avg_age#231 missing from name#223,avg_age#218,id#232,age#234,name#233 in operator !Project [name#233, avg_age#231]. Attribute(s) with the same name appear in the operation: avg_age. Please check if the right attribute(s) are used.;; ... ``` The plan below is the problematic plan which is the right plan of a `Join` operator. And, it has conflict plans comparing to the left plan. In this problematic plan, the first `Aggregate` operator (the one under the first child of `Union`) becomes a conflict plan compares to the left one and has a rewrite attribute pair as `avg_age#218` -> `avg_age#231`. With the current `dedupRight` logic, we'll first replace this `Aggregate` with a new one, and then rewrites the attribute `avg_age#218` from bottom to up. As you can see, projects with the attribute `avg_age#218` of the second child of the `Union` can also be replaced with `avg_age#231`(That means we also rewrite attributes for non-ancestor plans for the conflict plan). Ideally, the attribute `avg_age#218` in the second `Aggregate` operator (the one under the second child of `Union`) should also be replaced. But it didn't because it's an `Alias` while we only rewrite `Attribute` yet. Therefore, the project above the second `Aggregate` becomes unresolved. ```scala :
 : +- SubqueryAlias p2 +- SubqueryAlias person_c +- Distinct +- Union :- Project [name#233, avg_age#231] : +- SubqueryAlias person_a : +- Aggregate [name#233], [name#233, avg(cast(age#234 as bigint)) AS avg_age#231] : +- SubqueryAlias person : +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234] : +- ExternalRDD [obj#165] +- Project [name#233 AS name#227, avg_age#231 AS avg_age#228] +- Project [name#233, avg_age#231] +- SubqueryAlias person_b +- !Project [name#233, avg_age#231] +- Join Inner, (name#233 = name#223) :- SubqueryAlias p1 : +- SubqueryAlias person : +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234] : +- ExternalRDD [obj#165] +- SubqueryAlias p2 +- SubqueryAlias person_a +- Aggregate [name#223], [name#223, avg(cast(age#224 as bigint)) AS avg_age#218] +- SubqueryAlias person +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#222, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#223, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#224] +- ExternalRDD [obj#165] ``` ### Does this PR introduce _any_ user-facing change? Yes, users would no longer hit the error after this fix. ### How was this patch tested? Added test. Closes #29208 from Ngone51/cherry-pick-spark-32372. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 July 2020, 04:26:22 UTC
6a653a2 [SPARK-32364][SQL][2.4] Use CaseInsensitiveMap for DataFrameReader/Writer options ### What changes were proposed in this pull request? This PR is a backport of SPARK-32364 (https://github.com/apache/spark/pull/29160, https://github.com/apache/spark/pull/29191). When a user have multiple options like `path`, `paTH`, and `PATH` for the same key `path`, `option/options` is indeterministic because `extraOptions` is `HashMap`. This PR aims to use `CaseInsensitiveMap` instead of `HashMap` to fix this bug fundamentally. Like the following, DataFrame's `option/options` have been non-deterministic in terms of case-insensitivity because it stores the options at `extraOptions` which is using `HashMap` class. ```scala spark.read .option("paTh", "1") .option("PATH", "2") .option("Path", "3") .option("patH", "4") .load("5") ... org.apache.spark.sql.AnalysisException: Path does not exist: file:/.../1; ``` Also, this PR adds the following. 1. Add an explicit document to `DataFrameReader/DataFrameWriter`. 2. Add `toMap` to `CaseInsensitiveMap` in order to return `originalMap: Map[String, T]` because it's more consistent with the existing `case-sensitive key names` behavior for the existing code pattern like `AppendData.byName(..., extraOptions.toMap)`. Previously, it was `HashMap.toMap`. 3. During (2), we need to change the following to keep the original logic using `CaseInsensitiveMap.++`. ```scala - val params = extraOptions.toMap ++ connectionProperties.asScala.toMap + val params = extraOptions ++ connectionProperties.asScala ``` 4. Additionally, use `.toMap` in the following because `dsOptions.asCaseSensitiveMap()` is used later. ```scala - val options = sessionOptions ++ extraOptions + val options = sessionOptions.filterKeys(!extraOptions.contains(_)) ++ extraOptions.toMap val dsOptions = new CaseInsensitiveStringMap(options.asJava) ``` `extraOptions.toMap` is used in several places (e.g. `DataFrameReader`) to hand over `Map[String, T]`. In this case, `CaseInsensitiveMap[T] private (val originalMap: Map[String, T])` had better return `originalMap`. ### Why are the changes needed? This will fix indeterministic behavior. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Pass the Jenkins with the existing tests and newly add test cases. Closes #29209 from dongjoon-hyun/SPARK-32364-2.4. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 23 July 2020, 23:09:27 UTC
df930e1 [SPARK-32377][SQL][2.4] CaseInsensitiveMap should be deterministic for addition ### What changes were proposed in this pull request? This PR aims to fix `CaseInsensitiveMap` to be deterministic for addition. This is a backporting of https://github.com/apache/spark/pull/29172 . ### Why are the changes needed? ```scala import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap var m = CaseInsensitiveMap(Map.empty[String, String]) Seq(("paTh", "1"), ("PATH", "2"), ("Path", "3"), ("patH", "4"), ("path", "5")).foreach { kv => m = (m + kv).asInstanceOf[CaseInsensitiveMap[String]] println(m.get("path")) } ``` **BEFORE** ``` Some(1) Some(2) Some(3) Some(4) Some(1) ``` **AFTER** ``` Some(1) Some(2) Some(3) Some(4) Some(5) ``` ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug fix on non-deterministic behavior. ### How was this patch tested? Pass the newly added test case. Closes #29175 from dongjoon-hyun/SPARK-32377-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 July 2020, 10:52:49 UTC
822cc34 [SPARK-32379][BUILD] docker based spark release script should use correct CRAN repo … ### What changes were proposed in this pull request? Change repo for CRAN as per the ubuntu version used in docker image of the release script. ### Why are the changes needed? Without this change, r-base and r-dev dependency won't install. ``` [rootkyok-test-1 ~]# tail docker-build.log distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be installed Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be installed r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be installed E: Unable to correct problems, you have held broken packages. The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates apt-transport-https && echo 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> /etc/apt/sources.list && gpg --keyserver keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && gpg -a --export E084DAB9 | apt-key add - && apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get clean && apt-get update && $APT_INSTALL software-properties-common && apt-add-repository -y ppa:brightbox/ruby-ng && apt-get update && $APT_INSTALL openjdk-8-jdk && update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && $APT_INSTALL curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && ln -s -T /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && curl -sL https://deb.nodesource.com/setup_4.x | bash && $APT_INSTALL nodejs && $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && pip install --upgrade pip && hash -r pip && pip install setuptools && pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && cd && virtualenv -p python3 /opt/p35 && . /opt/p35/bin/activate && pip install setuptools && pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && $APT_INSTALL r-base r-base-dev && $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf && Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && Rscript -e "devtools::install_github('jimhester/lintr')" && $APT_INSTALL ruby2.3 ruby2.3-dev mkdocs && gem install jekyll --no-rdoc --no-ri -v 3.8.6 && gem install jekyll-redirect-from -v 0.15.0 && gem install pygments.rb' returned a non-zero code: 100 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually, by applying this patch, create release script pass without errors. Closes #29177 from ScrapCodes/release-script-fix. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 July 2020, 10:09:46 UTC
e37b67a [MINOR][DOCS] add link for Debugging your Application in running-on-yarn.html#launching-spark-on-yarn ### What changes were proposed in this pull request? add link for Debugging your Application in `running-on-yarn.html#launching-spark-on-yar` ### Why are the changes needed? Currrently on running-on-yarn.html page launching-spark-on-yarn section, it mentions to refer for Debugging your Application. It is better to add a direct link for it to save reader time to find the section ![image](https://user-images.githubusercontent.com/20021316/87867542-80cc5500-c9c0-11ea-8560-5ddcb5a308bc.png) ### Does this PR introduce _any_ user-facing change? Yes. Docs changes. 1. add link for Debugging your Application in `running-on-yarn.html#launching-spark-on-yarn` section Updated behavior: ![image](https://user-images.githubusercontent.com/20021316/87867534-6eeab200-c9c0-11ea-94ee-d3fa58157156.png) 2. update Spark Properties link to anchor link only ### How was this patch tested? manual test has been performed to test the updated Closes #29154 from brandonJY/patch-1. Authored-by: Brandon <brandonJY@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1267d80db6abaa130384b8e7b514c39aec3a8c77) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 21 July 2020, 04:42:54 UTC
c9c187a [SPARK-32367][K8S][TESTS] Correct the spelling of parameter in KubernetesTestComponents ### What changes were proposed in this pull request? Correct the spelling of parameter 'spark.executor.instances' in KubernetesTestComponents ### Why are the changes needed? Parameter spelling error ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test is not needed. Closes #29164 from merrily01/SPARK-32367. Authored-by: maruilei <maruilei@jd.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ffdca8285ef7c7bd0da2622a81d9c21ada035794) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 July 2020, 20:49:33 UTC
58c637a [SPARK-32344][SQL][2.4] Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates ### What changes were proposed in this pull request? This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6; ``` scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show() ... Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37 at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258) at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226) at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82) at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268) ``` A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe; https://github.com/apache/spark/blob/ffdbbae1d465fe2c710d020de62ca1a6b0b924d9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala#L74-L76 So, this PR proposes to change a vriable for `IGNORE NULLS` from `Expression` to `Boolean` to avoid the case. This is the backport of https://github.com/apache/spark/pull/29143. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a test in `DataFrameAggregateSuite`. Closes #29157 from maropu/SPARK-32344-BRANCH2.4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 July 2020, 06:35:44 UTC
afdad0e [SPARK-32018][SQL][2.4] UnsafeRow.setDecimal should set null with overflowed value backport https://github.com/apache/spark/pull/29125 Closes #29141 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 July 2020, 15:04:43 UTC
9aeeb0f [SPARK-32318][SQL][TESTS] Add a test case to EliminateSortsSuite for ORDER BY in DISTRIBUTE BY This PR aims to add a test case to EliminateSortsSuite to protect a valid use case which is using ORDER BY in DISTRIBUTE BY statement. ```scala scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/master") $ ls -al /tmp/master/ total 56 drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:12 ./ drwxrwxrwt 15 root wheel 480 Jul 14 22:12 ../ -rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:12 ._SUCCESS.crc -rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:12 .part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:12 _SUCCESS -rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:12 part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc -rw-r--r-- 1 dongjoon wheel 932 Jul 14 22:12 part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc -rw-r--r-- 1 dongjoon wheel 939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc ``` The following was found during SPARK-32276. If Spark optimizer removes the inner `ORDER BY`, the file size increases. ```scala scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/SPARK-32276") $ ls -al /tmp/SPARK-32276/ total 632 drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:08 ./ drwxrwxrwt 14 root wheel 448 Jul 14 22:08 ../ -rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:08 ._SUCCESS.crc -rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:08 .part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 1188 Jul 14 22:08 .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 1188 Jul 14 22:08 .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:08 _SUCCESS -rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:08 part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc -rw-r--r-- 1 dongjoon wheel 150735 Jul 14 22:08 part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc -rw-r--r-- 1 dongjoon wheel 150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc ``` No. This only improves the test coverage. Pass the GitHub Action or Jenkins. Closes #29118 from dongjoon-hyun/SPARK-32318. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8950dcbb1cafccc2ba8bbf030ab7ac86cfe203a4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 July 2020, 14:46:11 UTC
5084c71 [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions ### What changes were proposed in this pull request? This PR proposes to just simply by-pass the case when the number of array size is negative, when it collects data from Spark DataFrame with no partitions for `toPandas` with Arrow optimization enabled. ```python spark.sparkContext.emptyRDD().toDF("col1 int").toPandas() ``` In the master and branch-3.0, this was fixed together at https://github.com/apache/spark/commit/ecaa495b1fe532c36e952ccac42f4715809476af but it's legitimately not ported back. ### Why are the changes needed? To make empty Spark DataFrame able to be a pandas DataFrame. ### Does this PR introduce _any_ user-facing change? Yes, ```python spark.sparkContext.emptyRDD().toDF("col1 int").toPandas() ``` **Before:** ``` ... Caused by: java.lang.NegativeArraySizeException at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293) at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) ... ``` **After:** ``` Empty DataFrame Columns: [col1] Index: [] ``` ### How was this patch tested? Manually tested and unittest were added. Closes #29098 from HyukjinKwon/SPARK-32300. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> 14 July 2020, 20:28:36 UTC
a4854d6 [MINOR][DOCS] Fix typo in PySpark example in ml-datasource.md ### What changes were proposed in this pull request? This PR changes `true` to `True` in the python code. ### Why are the changes needed? The previous example is a syntax error. ### Does this PR introduce _any_ user-facing change? Yes, but this is doc-only typo fix. ### How was this patch tested? Manually run the example. Closes #29073 from ChuliangXiao/patch-1. Authored-by: Chuliang Xiao <ChuliangX@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c56c84af473547c9e9cab7ef6422f2b550084b59) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 July 2020, 16:02:16 UTC
4c82ae8 [SPARK-32238][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in ScalaUDF This PR proposes to use `Utils.getSimpleName(function)` instead of `function.getClass.getSimpleName` to get the class name. For some functions(see the demo below), using `function.getClass.getSimpleName` can hit "Malformed class name" error. Yes. For the demo, ```scala object MalformedClassObject extends Serializable { class MalformedNonPrimitiveFunction extends (String => Int) with Serializable { override def apply(v1: String): Int = v1.toInt / 0 } } OuterScopes.addOuterScope(MalformedClassObject) val f = new MalformedClassObject.MalformedNonPrimitiveFunction() Seq("20").toDF("col").select(udf(f).apply(Column("col"))).collect() ``` Before this PR, user can only see the error about "Malformed class name": ```scala An exception or error caused a run to abort: Malformed class name java.lang.InternalError: Malformed class name at java.lang.Class.getSimpleName(Class.java:1330) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage$lzycompute(ScalaUDF.scala:1157) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage(ScalaUDF.scala:1155) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.doGenCode(ScalaUDF.scala:1077) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:147) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:142) at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:160) at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:69) ... ``` After this PR, user can see the real root cause of the udf failure: ```scala org.apache.spark.SparkException: Failed to execute user defined function(UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction: (string) => int) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:464) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ArithmeticException: / by zero at org.apache.spark.sql.UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction.apply(UDFSuite.scala:677) at org.apache.spark.sql.UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction.apply(UDFSuite.scala:676) ... 17 more ``` Added a test. Closes #29050 from Ngone51/fix-malformed-udf. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0c9196e5493628d343ef67bb9e83d0c95ff3943a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 July 2020, 13:36:41 UTC
d5b903e [SPARK-32035][DOCS][EXAMPLES] Fixed typos involving AWS Access, Secret, & Sessions tokens ### What changes were proposed in this pull request? I resolved some of the inconsistencies of AWS env variables. They're fixed in the documentation as well as in the examples. I grep-ed through the repo to try & find any more instances but nothing popped up. ### Why are the changes needed? As previously mentioned, there is a JIRA request, SPARK-32035, which encapsulates all the issues. But, in summary, the naming of items was inconsistent. ### Does this PR introduce _any_ user-facing change? Correct names: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN These are the same that AWS uses in their libraries. However, looking through the Spark documentation and comments, I see that these are not denoted correctly across the board: docs/cloud-integration.md 106:1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY` <-- both different 107:and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options docs/streaming-kinesis-integration.md 232:- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials. <-- secret key different external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py 34: $ export AWS_ACCESS_KEY_ID=<your-access-key> 35: $ export AWS_SECRET_KEY=<your-secret-key> <-- different 48: Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 438: val keyId = System.getenv("AWS_ACCESS_KEY_ID") 439: val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY") 448: val sessionToken = System.getenv("AWS_SESSION_TOKEN") external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala 53: * $ export AWS_ACCESS_KEY_ID=<your-access-key> 54: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different 65: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different external/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java 59: * $ export AWS_ACCESS_KEY_ID=[your-access-key] 60: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different 71: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different These were all fixed to match names listed under the "correct names" heading. ### How was this patch tested? I built the documentation using jekyll and verified that the changes were present & accurate. Closes #29058 from Moovlin/SPARK-32035. Authored-by: moovlin <richjoerger@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9331a5c44baa79998625829e9be624e8564c91ea) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 09 July 2020, 17:36:04 UTC
49c6877 [SPARK-32024][WEBUI][FOLLOWUP] Quick fix on test failure on missing when statements ### What changes were proposed in this pull request? This patch fixes the test failure due to the missing when statements for destination path. Note that it didn't fail on master branch, because 245aee9 got rid of size call in destination path, but still good to not depend on 245aee9. ### Why are the changes needed? The build against branch-3.0 / branch-2.4 starts to fail after merging SPARK-32024 (#28859) and this patch will fix it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran modified UT against master / branch-3.0 / branch-2.4. Closes #29046 from HeartSaVioR/QUICKFIX-SPARK-32024. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 161cf2a12698bfebba94e0d406e0b110e4429b6b) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 09 July 2020, 06:27:14 UTC
146062d [SPARK-32167][2.4][SQL] Fix GetArrayStructFields to respect inner field's nullability together ### What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/28992 to 2.4 Fix nullability of `GetArrayStructFields`. It should consider both the original array's `containsNull` and the inner field's nullability. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes. See the added test. ### How was this patch tested? a new UT and end-to-end test Closes #29019 from cloud-fan/port. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 08 July 2020, 20:31:35 UTC
eddf40d [SPARK-32024][WEBUI] Update ApplicationStoreInfo.size during HistoryServerDiskManager initializing ### What changes were proposed in this pull request? Update ApplicationStoreInfo.size to real size during HistoryServerDiskManager initializing. ### Why are the changes needed? This PR is for fixing bug [32024](https://issues.apache.org/jira/browse/SPARK-32024). We found after history server restart, below error would randomly happen: "java.lang.IllegalStateException: Disk usage tracker went negative (now = -***, delta = -***)" from `HistoryServerDiskManager`. ![Capture](https://user-images.githubusercontent.com/10524738/85034468-fda4ae80-b136-11ea-9011-f0c3e6508002.JPG) **Cause**: Reading data from level db would trigger table file compaction, which may also trigger size of level db directory changes. This size change may not be recorded in LevelDB (`ApplicationStoreInfo` in `listing`). When service restarts, `currentUsage` is calculated from real directory size, but `ApplicationStoreInfo` are loaded from leveldb, then `currentUsage` may be less then sum of `ApplicationStoreInfo.size`. In `makeRoom()` function, `ApplicationStoreInfo.size` is used to update usage. Then `currentUsage` becomes negative after several round of `release()` and `lease()` (`makeRoom()`). **Reproduce**: we can reproduce this issue in dev environment by reducing config value of "spark.history.retainedApplications" and "spark.history.store.maxDiskUsage" to some small values. Here are steps: 1. start history server, load some applications and access some pages (maybe "stages" page to trigger leveldb compaction). 2. restart HS, and refresh pages. I also added an UT to simulate this case in `HistoryServerDiskManagerSuite`. **Benefit**: this change would help improve history server reliability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit test and manually tested it. Closes #28859 from zhli1142015/update-ApplicationStoreInfo.size-during-disk-manager-initialize. Authored-by: Zhen Li <zhli@microsoft.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit 8e7fc04637bbb8d2fdc2c758746e0eaf496c4d92) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> 08 July 2020, 12:59:29 UTC
2bb5ced [SPARK-32214][SQL] The type conversion function generated in makeFromJava for "other" type uses a wrong variable ### What changes were proposed in this pull request? This PR fixes an inconsistency in `EvaluatePython.makeFromJava`, which creates a type conversion function for some Java/Scala types. `other` is a type but it should actually pass `obj`: ```scala case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) ``` This does not change the output because it always returns `null` for unsupported datatypes. ### Why are the changes needed? To make the codes coherent, and consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No behaviour change. Closes #29029 from sarutak/fix-makeFromJava. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 371b35d2e0ab08ebd853147c6673de3adfad0553) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 08 July 2020, 08:47:02 UTC
2227a16 [MINOR][TEST][SQL] Make in-limit.sql more robust ### What changes were proposed in this pull request? For queries like `t1d in (SELECT t2d FROM t2 ORDER BY t2c LIMIT 2)`, the result can be non-deterministic as the result of the subquery may output different results (it's not sorted by `t2d` and it has shuffle). This PR makes the test more robust by sorting the output column. ### Why are the changes needed? avoid flaky test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28976 from cloud-fan/small. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit f83415629b18d628f72a32285f0afc24f29eaa1e) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 July 2020, 12:06:09 UTC
7f4d452 [SPARK-31935][2.4][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options ### What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/28948 This is a followup of https://github.com/apache/spark/pull/28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in https://github.com/apache/spark/pull/28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28973 from cloud-fan/pick. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 July 2020, 10:25:50 UTC
bc1acfe [SPARK-32089][R][BUILD] Upgrade R version to 4.0.2 in the release DockerFiile This PR proposes to upgrade R version to 4.0.2 in the release docker image. As of SPARK-31918, we should make a release with R 4.0.0+ which works with R 3.5+ too. To unblock releases on CRAN. No, dev-only. Manually tested via scripts under `dev/create-release`, manually attaching to the container and checking the R version. Closes #28922 from HyukjinKwon/SPARK-32089. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 July 2020, 15:37:06 UTC
37b32c8 [SPARK-32131][SQL] Fix AnalysisException messages at UNION/EXCEPT/MINUS operations fix error exception messages during exceptions on Union and set operations Union and set operations can only be performed on tables with the compatible column types,while when we have more than two column, the exception messages will have wrong column index. Steps to reproduce: ``` drop table if exists test1; drop table if exists test2; drop table if exists test3; create table if not exists test1(id int, age int, name timestamp); create table if not exists test2(id int, age timestamp, name timestamp); create table if not exists test3(id int, age int, name int); insert into test1 select 1,2,'2020-01-01 01:01:01'; insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01'; insert into test3 select 1,3,4; ``` Query1: ```sql select * from test1 except select * from test2; ``` Result1: ``` Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. timestamp <> int at the second column of the second table;; 'Except false :- Project [id#620, age#621, name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, name#625] (state=,code=0) ``` Query2: ```sql select * from test1 except select * from test3; ``` Result2: ``` Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types int <> timestamp at the 2th column of the second table; ``` the above query1 has the right exception message the above query2 have the wrong errors information, it may need to change to the following ``` Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. int <> timestamp at the third column of the second table ``` NO unit test Closes #28951 from GuoPhilipse/32131-correct-error-messages. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 02f3b80d3a277e0c19a66c28d935fa41da7b3307) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 July 2020, 06:35:40 UTC
409930d [SPARK-32028][WEBUI][2.4] fix app id link for multi attempts app in history summary page This is simply a backport of https://github.com/apache/spark/pull/28867 to branch-2.4 Closes #28949 from srowen/SPARK-32028.2. Authored-by: Zhen Li <zhli@microsoft.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 01 July 2020, 06:10:13 UTC
1eda585 [SPARK-32115][SQL] Fix SUBSTRING to handle integer overflows Bug fix for overflow case in `UTF8String.substringSQL`. SQL query `SELECT SUBSTRING("abc", -1207959552, -1207959552)` incorrectly returns` "abc"` against expected output of `""`. For query `SUBSTRING("abc", -100, -100)`, we'll get the right output of `""`. Yes, bug fix for the overflow case. New UT. Closes #28937 from xuanyuanking/SPARK-32115. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6484c14c57434dd6961cf9e9e73bbe8aa04cda15) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 28 June 2020, 19:29:08 UTC
a295003 [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow When you use floats are index of pandas, it creates a Spark DataFrame with a wrong results as below when Arrow is enabled: ```bash ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true ``` ```python >>> import pandas as pd >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show() +---+ | a| +---+ | 1| | 1| | 2| +---+ ``` This is because direct slicing uses the value as index when the index contains floats: ```python >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:] a 2.0 1 3.0 2 4.0 3 >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:] a 4.0 3 >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:] a 4 3 ``` This PR proposes to explicitly use `iloc` to positionally slide when we create a DataFrame from a pandas DataFrame with Arrow enabled. FWIW, I was trying to investigate why direct slicing refers the index value or the positional index sometimes but I stopped investigating further after reading this https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection > While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc` and `.iloc`. To create the correct Spark DataFrame from a pandas DataFrame without a data loss. Yes, it is a bug fix. ```bash ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true ``` ```python import pandas as pd spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show() ``` Before: ``` +---+ | a| +---+ | 1| | 1| | 2| +---+ ``` After: ``` +---+ | a| +---+ | 1| | 2| | 3| +---+ ``` Manually tested and unittest were added. Closes #28928 from HyukjinKwon/SPARK-32098. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> (cherry picked from commit 1af19a7b6836f87a3b34189a8a13b6d21d3a37d8) Signed-off-by: Bryan Cutler <cutlerb@gmail.com> 25 June 2020, 18:17:12 UTC
77006b2 [SPARK-32073][R] Drop R < 3.5 support Spark 3.0 accidentally dropped R < 3.5. It is built by R 3.6.3 which not support R < 3.5: ``` Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; need R 3.5.0 or newer version. ``` In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R 4.0.0. This is inevitable to release on CRAN because they require to make the tests pass with the latest R. To show the supported versions correctly, and support R 4.0.0 to unblock the releases. In fact, no because Spark 3.0.0 already does not work with R < 3.5. Compared to Spark 2.4, yes. R < 3.5 would not work. Jenkins should test it out. Closes #28908 from HyukjinKwon/SPARK-32073. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b62e2536db9def0d11605ceac8990f72a515e9a0) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 24 June 2020, 02:08:59 UTC
29873c9 [SPARK-31918][R] Ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+ ### What changes were proposed in this pull request? This PR proposes to ignore S4 generic methods under SparkR namespace in closure cleaning to support R 4.0.0+. Currently, when you run the codes that runs R native codes, it fails as below with R 4.0.0: ```r df <- createDataFrame(lapply(seq(100), function (e) list(value=e))) count(dapply(df, function(x) as.data.frame(x[x$value < 50,]), schema(df))) ``` ``` org.apache.spark.SparkException: R unexpectedly exited. R worker produced errors: Error in lapply(part, FUN) : attempt to bind a variable to R_UnboundValue ``` The root cause seems to be related to when an S4 generic method is manually included into the closure's environment via `SparkR:::cleanClosure`. For example, when an RRDD is created via `createDataFrame` with calling `lapply` to convert, `lapply` itself: https://github.com/apache/spark/blob/f53d8c63e80172295e2fbc805c0c391bdececcaa/R/pkg/R/RDD.R#L484 is added into the environment of the cleaned closure - because this is not an exposed namespace; however, this is broken in R 4.0.0+ for an unknown reason with an error message such as "attempt to bind a variable to R_UnboundValue". Actually, we don't need to add the `lapply` into the environment of the closure because it is not supposed to be called in worker side. In fact, there is no private generic methods supposed to be called in worker side in SparkR at all from my understanding. Therefore, this PR takes a simpler path to work around just by explicitly excluding the S4 generic methods under SparkR namespace to support R 4.0.0. in SparkR. ### Why are the changes needed? To support R 4.0.0+ with SparkR, and unblock the releases on CRAN. CRAN requires the tests pass with the latest R. ### Does this PR introduce _any_ user-facing change? Yes, it will support R 4.0.0 to end-users. ### How was this patch tested? Manually tested. Both CRAN and tests with R 4.0.1: ``` ══ testthat results ═══════════════════════════════════════════════════════════ [ OK: 13 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ] ✔ | OK F W S | Context ✔ | 11 | binary functions [2.5 s] ✔ | 4 | functions on binary files [2.1 s] ✔ | 2 | broadcast variables [0.5 s] ✔ | 5 | functions in client.R ✔ | 46 | test functions in sparkR.R [6.3 s] ✔ | 2 | include R packages [0.3 s] ✔ | 2 | JVM API [0.2 s] ✔ | 75 | MLlib classification algorithms, except for tree-based algorithms [86.3 s] ✔ | 70 | MLlib clustering algorithms [44.5 s] ✔ | 6 | MLlib frequent pattern mining [3.0 s] ✔ | 8 | MLlib recommendation algorithms [9.6 s] ✔ | 136 | MLlib regression algorithms, except for tree-based algorithms [76.0 s] ✔ | 8 | MLlib statistics algorithms [0.6 s] ✔ | 94 | MLlib tree-based algorithms [85.2 s] ✔ | 29 | parallelize() and collect() [0.5 s] ✔ | 428 | basic RDD functions [25.3 s] ✔ | 39 | SerDe functionality [2.2 s] ✔ | 20 | partitionBy, groupByKey, reduceByKey etc. [3.9 s] ✔ | 4 | functions in sparkR.R ✔ | 16 | SparkSQL Arrow optimization [19.2 s] ✔ | 6 | test show SparkDataFrame when eager execution is enabled. [1.1 s] ✔ | 1175 | SparkSQL functions [134.8 s] ✔ | 42 | Structured Streaming [478.2 s] ✔ | 16 | tests RDD function take() [1.1 s] ✔ | 14 | the textFile() function [2.9 s] ✔ | 46 | functions in utils.R [0.7 s] ✔ | 0 1 | Windows-specific tests ──────────────────────────────────────────────────────────────────────────────── test_Windows.R:22: skip: sparkJars tag in SparkContext Reason: This test is only for Windows, skipped ──────────────────────────────────────────────────────────────────────────────── ══ Results ═════════════════════════════════════════════════════════════════════ Duration: 987.3 s OK: 2304 Failed: 0 Warnings: 0 Skipped: 1 ... Status: OK + popd Tests passed. ``` Note that I tested to build SparkR in R 4.0.0, and run the tests with R 3.6.3. It all passed. See also [the comment in the JIRA](https://issues.apache.org/jira/browse/SPARK-31918?focusedCommentId=17142837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17142837). Closes #28907 from HyukjinKwon/SPARK-31918. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 11d2b07b74c73ce6d59ac4f7446f1eb8bc6bbb4b) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 24 June 2020, 02:03:53 UTC
9fb3760 [SPARK-32044][SS][2.4] Kakfa continuous processing print mislead initial offset ### What changes were proposed in this pull request? Use `java.util.Optional.orElseGet` instead of `java.util.Optional.orElse` to fix unnecessary kafka offset fetch and misleading info log. In Java, `orElseGet` uses lazy evaluation while `orElse` always evaluate the expression. ### Why are the changes needed? Fix mislead initial offsets log and unnecessary kafka offset fetch ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Currently, no test for KafkaContinuousReader. Also it's hard to test log. Closes #28887 from warrenzhu25/SPARK-32044. Authored-by: Warren Zhu <zhonzh@microsoft.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 June 2020, 22:19:53 UTC
e50c1e5 [SPARK-32034][SQL][2.4] Port HIVE-14817: Shutdown the SessionManager timeoutChecker thread properly upon shutdown ### What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/28870 which ports https://issues.apache.org/jira/browse/HIVE-14817 for spark thrift server. ### Why are the changes needed? Port HIVE-14817 to fix related issues ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing Jenkins Closes #28888 from yaooqinn/SPARK-32034-24. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 22 June 2020, 08:13:42 UTC
d0a2d33 [SPARK-31980][SQL][2.4] Function sequence() fails if start and end of range are equal dates ### What changes were proposed in this pull request? 1. Add judge equal as bigger condition in `org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl#eval` 2. Unit test for interval `day`, `month`, `year` Former PR to master is [https://github.com/apache/spark/pull/28819](url) Current PR change `stringToInterval => CalendarInterval.fromString` in order to compatible with branch-2.4 ### Why are the changes needed? Bug exists when sequence input get same equal start and end dates, which will occur `while loop` forever ### Does this PR introduce _any_ user-facing change? Yes, Before this PR, people will get a `java.lang.ArrayIndexOutOfBoundsException`, when eval as below: `sql("select sequence(cast('2011-03-01' as date), cast('2011-03-01' as date), interval 1 year)").show(false) ` ### How was this patch tested? Unit test. Closes #28877 from TJX2014/branch-2.4-SPARK-31980-compatible. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 June 2020, 18:14:20 UTC
16c70cf Revert "[SPARK-31980][SQL] Function sequence() fails if start and end of range are equal dates" This reverts commit dee27ee04559c75bb996161bc1e1be829354fe06. 20 June 2020, 02:29:10 UTC
dee27ee [SPARK-31980][SQL] Function sequence() fails if start and end of range are equal dates ### What changes were proposed in this pull request? 1. Add judge equal as bigger condition in `org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl#eval` 2. Unit test for interval `day`, `month`, `year` ### Why are the changes needed? Bug exists when sequence input get same equal start and end dates, which will occur `while loop` forever ### Does this PR introduce _any_ user-facing change? Yes, Before this PR, people will get a `java.lang.ArrayIndexOutOfBoundsException`, when eval as below: `sql("select sequence(cast('2011-03-01' as date), cast('2011-03-01' as date), interval 1 year)").show(false) ` ### How was this patch tested? Unit test. Closes #28819 from TJX2014/master-SPARK-31980. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 177a380bcf1f56982760442e8418d28415bef8b0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 20 June 2020, 02:25:03 UTC
9c5c823 [SPARK-31871][CORE][WEBUI][2.4] Display the canvas element icon for sorting column What changes were proposed in this pull request? issue link https://issues.apache.org/jira/browse/SPARK-31871 this PR link https://github.com/apache/spark/pull/28680 As sarutak suggested I open a new PR for branch-2.4 This issue is resolved the sorted list cannot display ICONS Before: Executors Page ![image](https://user-images.githubusercontent.com/28332082/84387592-d7a67800-ac25-11ea-9aee-d649380a3628.png) History Server Page ![image](https://user-images.githubusercontent.com/28332082/84384892-47663400-ac21-11ea-92c5-939aff2f0b73.png) After: Executors Page ![image](https://user-images.githubusercontent.com/28332082/84385053-85635800-ac21-11ea-96b7-1015096016b9.png) History Server Page ![image](https://user-images.githubusercontent.com/28332082/84385101-94e2a100-ac21-11ea-971a-6fb6c9e656a3.png) Why are the changes needed? The icon could not find the correct path Does this PR introduce any user-facing change? No. How was this patch tested? Fixed existing test. history server page: Closes #28799 from liucht-inspur/branch-2.4. Lead-authored-by: liucht <liucht@inspur.com> Co-authored-by: liucht-inspur <liucht@inspur.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 17 June 2020, 17:08:23 UTC
23ff9e6 [SPARK-32000][2.4][CORE][TESTS] Fix the flaky test for partially launched task in barrier-mode ### What changes were proposed in this pull request? This PR changes the test to get an active executorId and set it as preferred location instead of setting a fixed preferred location. ### Why are the changes needed? The test is flaky. After checking the [log](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124086/artifact/core/), I find the root cause is: Two test cases from different test suites got submitted at the same time because of concurrent execution. In this particular case, the two test cases (from DistributedSuite and BarrierTaskContextSuite) both launch under local-cluster mode. The two applications are submitted at the SAME time so they have the same applications(app-20200615210132-0000). Thus, when the cluster of BarrierTaskContextSuite is launching executors, it failed to create the directory for the executor 0, because the path (/home/jenkins/workspace/work/app-app-20200615210132-0000/0) has been used by the cluster of DistributedSuite. Therefore, it has to launch executor 1 and 2 instead, that lead to non of the tasks can get preferred locality thus they got scheduled together and lead to the test failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The test can not be reproduced locally. We can only know it's been fixed when it's no longer flaky on Jenkins. Closes #28851 from Ngone51/fix-spark-32000-24. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 June 2020, 16:59:14 UTC
e1cb384 [SPARK-31997][SQL][TESTS] Drop test_udtf table when SingleSessionSuite test completed ### What changes were proposed in this pull request? `SingleSessionSuite` not do `DROP TABLE IF EXISTS test_udtf` when test completed, then if we do mvn test `HiveThriftBinaryServerSuite`, the test case named `SPARK-11595 ADD JAR with input path having URL scheme` will FAILED because it want to re-create an exists table test_udtf. ### Why are the changes needed? test suite shouldn't rely on their execution order ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Manual test,mvn test SingleSessionSuite and HiveThriftBinaryServerSuite in order Closes #28838 from LuciferYang/drop-test-table. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit d24d27f1bc39e915df23d65f8fda0d83e716b308) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 16 June 2020, 10:21:19 UTC
a89a674 Revert "[SPARK-29152][CORE][2.4] Executor Plugin shutdown when dynamic allocation is enabled" This reverts commit 90e928c05073561d8f2ee40ebe50b9f7c5208754. 15 June 2020, 01:48:17 UTC
90e928c [SPARK-29152][CORE][2.4] Executor Plugin shutdown when dynamic allocation is enabled ### What changes were proposed in this pull request? Added a Shutdown Hook in `executor.scala` which will ensure that executor's `stop()` method is always called. ### Why are the changes needed? In case executors are not going down gracefully, their `stop()` is not called. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually Closes #26901 from iRakson/SPARK-29152_2.4. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 June 2020, 21:51:27 UTC
e44190a [SPARK-31968][SQL] Duplicate partition columns check when writing data ### What changes were proposed in this pull request? A unit test is added Partition duplicate check added in `org.apache.spark.sql.execution.datasources.PartitioningUtils#validatePartitionColumn` ### Why are the changes needed? When people write data with duplicate partition column, it will cause a `org.apache.spark.sql.AnalysisException: Found duplicate column ...` in loading data from the writted. ### Does this PR introduce _any_ user-facing change? Yes. It will prevent people from using duplicate partition columns to write data. 1. Before the PR: It will look ok at `df.write.partitionBy("b", "b").csv("file:///tmp/output")`, but get an exception when read: `spark.read.csv("file:///tmp/output").show()` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the partition schema: `b`; 2. After the PR: `df.write.partitionBy("b", "b").csv("file:///tmp/output")` will trigger the exception: org.apache.spark.sql.AnalysisException: Found duplicate column(s) b, b: `b`; ### How was this patch tested? Unit test. Closes #28814 from TJX2014/master-SPARK-31968. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a4ea599b1b9b8ebaae0100b54e6ac1d7576c6d8c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 14 June 2020, 05:22:18 UTC
5de1929 [SPARK-31632][CORE][WEBUI][FOLLOWUP] Enrich the exception message when application summary is unavailable ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR enriches the exception message when application summary is not available. #28444 covers the case when application information is not available but the case application summary is not available is not covered. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> To complement #28444 . ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> Yes. Before this change, we can get the following error message when we access to `/jobs` if application summary is not available. <img width="707" alt="no-such-element-exception-error-message" src="https://user-images.githubusercontent.com/4736016/84562182-6aadf200-ad8d-11ea-8980-d63edde6fad6.png"> After this change, we can get the following error message. It's like #28444 does. <img width="1349" alt="enriched-errorm-message" src="https://user-images.githubusercontent.com/4736016/84562189-85806680-ad8d-11ea-8346-4da2ec11df2b.png"> ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> I checked with the following procedure. 1. Set breakpoint in the line of `kvstore.write(appSummary)` in `AppStatusListener#onStartApplicatin`. Only the thread reaching this line should be suspended. 2. Start spark-shell and wait few seconds. 3. Access to `/jobs` Closes #28820 from sarutak/fix-no-such-element. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit c2e5012a0a76734acf94b8716ee293ba2f58ccb4) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 14 June 2020, 05:18:18 UTC
f191e2b [SPARK-31967][UI][2.4] Downgrade to vis.js 4.21.0 to fix Jobs UI loading time regression ### What changes were proposed in this pull request? After #28192, the job list page becomes very slow. For example, after the following operation, the UI loading can take >40 sec. ``` (1 to 1000).foreach(_ => sc.parallelize(1 to 10).collect) ``` This is caused by a [performance issue of `vis-timeline`](https://github.com/visjs/vis-timeline/issues/379). The serious issue affects both branch-3.0 and branch-2.4 I tried a different version 4.21.0 from https://cdnjs.com/libraries/vis The infinite drawing issue seems also fixed if the zoom is disabled as default. ### Why are the changes needed? Fix the serious perf issue in web UI by falling back vis-timeline-graph2d to an ealier version. ### Does this PR introduce _any_ user-facing change? Yes, fix the UI perf regression ### How was this patch tested? Manual test Closes #28813 from gengliangwang/vis2.4. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 13 June 2020, 00:33:55 UTC
e51eb3a [SPARK-31954][SQL] Delete duplicate testcase in HiveQuerySuite ### What changes were proposed in this pull request? remove duplicate test cases ### Why are the changes needed? improve test quality ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? No test Closes #28782 from GuoPhilipse/31954-delete-duplicate-testcase. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 912d45df7c6535336f72c971c90fecd11cfe87e9) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 11 June 2020, 13:04:18 UTC
53f1349 [SPARK-31941][CORE] Replace SparkException to NoSuchElementException for applicationInfo in AppStatusStore ### What changes were proposed in this pull request? After SPARK-31632 SparkException is thrown from def applicationInfo `def applicationInfo(): v1.ApplicationInfo = { try { // The ApplicationInfo may not be available when Spark is starting up. store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info } catch { case _: NoSuchElementException => throw new SparkException("Failed to get the application information. " + "If you are starting up Spark, please wait a while until it's ready.") } }` Where as the caller for this method def getSparkUser in Spark UI is not handling SparkException in the catch `def getSparkUser: String = { try { Option(store.applicationInfo().attempts.head.sparkUser) .orElse(store.environmentInfo().systemProperties.toMap.get("user.name")) .getOrElse("<unknown>") } catch { case _: NoSuchElementException => "<unknown>" } }` So On using this method (getSparkUser )we can get the application erred out. As the part of this PR we will replace SparkException to NoSuchElementException for applicationInfo in AppStatusStore ### Why are the changes needed? On invoking the method getSparkUser, we can get the SparkException on calling store.applicationInfo(). And this is not handled in the catch block and getSparkUser will error out in this scenario ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Done the manual testing using the spark-shell and spark-submit Closes #28768 from SaurabhChawla100/SPARK-31941. Authored-by: SaurabhChawla <saurabhc@qubole.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 82ff29be7afa2ff6350310ab9bdf6b474398fdc1) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> 10 June 2020, 07:52:25 UTC
2556d32 [SPARK-31935][2.4][SQL] Hadoop file system config should be effective in data source options ### What changes were proposed in this pull request? Mkae Hadoop file system config effective in data source options. From `org.apache.hadoop.fs.FileSystem.java`: ``` public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null && authority == null) { // use default FS return get(conf); } if (scheme != null && authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme()) // if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); } ``` Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`. After changes, we can specify authority and URI schema related configurations for scanning file systems. This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`. ### Why are the changes needed? Allow users to specify authority and URI schema related Hadoop configurations for file source reading. ### Does this PR introduce _any_ user-facing change? Yes, the file system related Hadoop configuration in data source option will be effective on reading. ### How was this patch tested? Unit test Closes #28771 from gengliangwang/SPARK-31935-2.4. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 10 June 2020, 01:52:09 UTC
48017cc [SPARK-31923][CORE] Ignore internal accumulators that use unrecognized types rather than crashing (branch-2.4) ### What changes were proposed in this pull request? Backport #28744 to branch-2.4. ### Why are the changes needed? Low risky fix for branch-2.4. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes #28758 from zsxwing/SPARK-31923-2.4. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 08 June 2020, 23:52:34 UTC
476010a [SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI ### What changes were proposed in this pull request? This is a backport of #28730. In `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 30 AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.enabled', True) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-05 at 10 58 42 AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-05 at 11 29 48 AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png) Closes #28740 from ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 06 June 2020, 07:50:40 UTC
bf01280 [SPARK-31860][BUILD][2.4] only push release tags on success ### What changes were proposed in this pull request? Only push the release tag after the build has finished. ### Why are the changes needed? If the build fails we don't need a release tag. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running locally with a fake user Closes #28667 from holdenk/SPARK-31860-only-push-release-tags-on-success. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> 02 June 2020, 00:46:13 UTC
ac59a8c [SPARK-31889][BUILD] Docker release script does not allocate enough memory to reliably publish ### What changes were proposed in this pull request? Allow overriding the zinc options in the docker release and set a higher so the publish step can succeed consistently. ### Why are the changes needed? The publish step experiences memory pressure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running test locally with fake user to see if publish step (besides svn part) succeeds Closes #28698 from holdenk/SPARK-31889-docker-release-script-does-not-allocate-enough-memory-to-reliably-publish. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> (cherry picked from commit ab9e5a2fe9a0e9e93301096a027d7409fc3c9b64) Signed-off-by: Holden Karau <hkarau@apple.com> 01 June 2020, 22:51:52 UTC
6f6dcc8 [SPARK-31854][SQL][2.4] Invoke in MapElementsExec should not propagate null ### What changes were proposed in this pull request? This PR intends to fix a bug of `Dataset.map` below when the whole-stage codegen enabled; ``` scala> val ds = Seq(1.asInstanceOf[Integer], null.asInstanceOf[Integer]).toDS() scala> sql("SET spark.sql.codegen.wholeStage=true") scala> ds.map(v=>(v,v)).explain == Physical Plan == *(1) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1.intValue AS _1#69, assertnotnull(input[0, scala.Tuple2, true])._2.intValue AS _2#70] +- *(1) MapElements <function1>, obj#68: scala.Tuple2 +- *(1) DeserializeToObject staticinvoke(class java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, value#1, true, false), obj#67: java.lang.Integer +- LocalTableScan [value#1] // `AssertNotNull` in `SerializeFromObject` will fail; scala> ds.map(v => (v, v)).show() java.lang.NullPointerException: Null value appeared in non-nullable fails: top level Product input object If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). // When the whole-stage codegen disabled, the query works well; scala> sql("SET spark.sql.codegen.wholeStage=false") scala> ds.map(v=>(v,v)).show() +----+----+ | _1| _2| +----+----+ | 1| 1| |null|null| +----+----+ ``` A root cause is that `Invoke` used in `MapElementsExec` propagates input null, and then [AssertNotNull](https://github.com/apache/spark/blob/1b780f364bfbb46944fe805a024bb6c32f5d2dde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L253-L255) in `SerializeFromObject` fails because a top-level row becomes null. So, `MapElementsExec` should not return `null` but `(null, null)`. NOTE: the generated code of the query above in the current master; ``` /* 033 */ private void mapelements_doConsume_0(java.lang.Integer mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws java.io.IOException { /* 034 */ boolean mapelements_isNull_1 = true; /* 035 */ scala.Tuple2 mapelements_value_1 = null; /* 036 */ if (!false) { /* 037 */ mapelements_resultIsNull_0 = false; /* 038 */ /* 039 */ if (!mapelements_resultIsNull_0) { /* 040 */ mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; /* 041 */ mapelements_mutableStateArray_0[0] = mapelements_expr_0_0; /* 042 */ } /* 043 */ /* 044 */ mapelements_isNull_1 = mapelements_resultIsNull_0; /* 045 */ if (!mapelements_isNull_1) { /* 046 */ Object mapelements_funcResult_0 = null; /* 047 */ mapelements_funcResult_0 = ((scala.Function1) references[1] /* literal */).apply(mapelements_mutableStateArray_0[0]); /* 048 */ /* 049 */ if (mapelements_funcResult_0 != null) { /* 050 */ mapelements_value_1 = (scala.Tuple2) mapelements_funcResult_0; /* 051 */ } else { /* 052 */ mapelements_isNull_1 = true; /* 053 */ } /* 054 */ /* 055 */ } /* 056 */ } /* 057 */ /* 058 */ serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1); /* 059 */ /* 060 */ } ``` The generated code w/ this fix; ``` /* 032 */ private void mapelements_doConsume_0(java.lang.Integer mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws java.io.IOException { /* 033 */ boolean mapelements_isNull_1 = true; /* 034 */ scala.Tuple2 mapelements_value_1 = null; /* 035 */ if (!false) { /* 036 */ mapelements_mutableStateArray_0[0] = mapelements_expr_0_0; /* 037 */ /* 038 */ mapelements_isNull_1 = false; /* 039 */ if (!mapelements_isNull_1) { /* 040 */ Object mapelements_funcResult_0 = null; /* 041 */ mapelements_funcResult_0 = ((scala.Function1) references[1] /* literal */).apply(mapelements_mutableStateArray_0[0]); /* 042 */ /* 043 */ if (mapelements_funcResult_0 != null) { /* 044 */ mapelements_value_1 = (scala.Tuple2) mapelements_funcResult_0; /* 045 */ mapelements_isNull_1 = false; /* 046 */ } else { /* 047 */ mapelements_isNull_1 = true; /* 048 */ } /* 049 */ /* 050 */ } /* 051 */ } /* 052 */ /* 053 */ serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1); /* 054 */ /* 055 */ } ``` This comes from https://github.com/apache/spark/pull/28681 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #28691 from maropu/SPARK-31854-BRANCH2.4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 01 June 2020, 11:09:09 UTC
f32c60e Preparing development version 2.4.7-SNAPSHOT 29 May 2020, 23:28:37 UTC
807e0a4 Preparing Spark release v2.4.6-rc8 29 May 2020, 23:28:32 UTC
8307f1a [SPARK-26095][BUILD] Disable parallelization in make-distibution.sh. It makes the build slower, but at least it doesn't hang. Seems that maven-shade-plugin has some issue with parallelization. Closes #23061 from vanzin/SPARK-26095. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit d2792046a1b10a07b65fc30be573983f1237e450) Signed-off-by: Holden Karau <hkarau@apple.com> 29 May 2020, 21:00:06 UTC
1c36f8e Preparing development version 2.4.7-SNAPSHOT 29 May 2020, 07:51:23 UTC
105de05 Preparing Spark release v2.4.6-rc7 29 May 2020, 07:51:19 UTC
d53363d Preparing development version 2.4.7-SNAPSHOT 29 May 2020, 00:40:59 UTC
787c947 Preparing Spark release v2.4.6-rc6 29 May 2020, 00:40:54 UTC
7f522d5 [BUILD][INFRA] bump the timeout to match the jenkins PRB ### What changes were proposed in this pull request? bump the timeout to match what's set in jenkins ### Why are the changes needed? tests be timing out! ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? via jenkins Closes #28666 from shaneknapp/increase-jenkins-timeout. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com> (cherry picked from commit 9e68affd13a3875b92f0700b8ab7c9d902f1a08c) Signed-off-by: shane knapp <incomplete@gmail.com> 28 May 2020, 21:26:59 UTC
8bde6ed [SPARK-31839][TESTS] Delete duplicate code in castsuit ### What changes were proposed in this pull request? Delete duplicate code castsuit ### Why are the changes needed? keep spark code clean ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? no need Closes #28655 from GuoPhilipse/delete-duplicate-code-castsuit. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit dfbc5edf20040e8163ee3beef61f2743a948c508) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 28 May 2020, 00:57:48 UTC
e1adbac Preparing development version 2.4.7-SNAPSHOT 27 May 2020, 16:38:38 UTC
f5962ca Preparing Spark release v2.4.6-rc5 27 May 2020, 16:38:33 UTC
6b055a4 Preparing development version 2.4.7-SNAPSHOT 27 May 2020, 00:07:42 UTC
e08970b Preparing Spark release v2.4.6-rc4 27 May 2020, 00:07:39 UTC
48ba885 [SPARK-31819][K8S][DOCS][TESTS][2.4] Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases ### What changes were proposed in this pull request? This PR aims to add a workaround `HTTP2_DISABLE=true` to the document and to update the K8s integration test. ### Why are the changes needed? SPARK-31786 reported fabric8 kubernetes-client library fails to talk K8s 1.17.x client on Java 8u251+ environment. It's fixed at Apache Spark 3.0.0 by upgrading the library, but it turns out that we can not use the same way in `branch-2.4` (https://github.com/apache/spark/pull/28625) ### Does this PR introduce _any_ user-facing change? Yes. This will provide a workaround at the document and testing environment. ### How was this patch tested? This PR is irrelevant to Jenkins UT because it's only updating docs and integration tests. We need to the followings. - [x] Pass the Jenkins K8s IT with old JDK8 and K8s versions (https://github.com/apache/spark/pull/28638#issuecomment-633837179) - [x] Manually run K8s IT on K8s 1.17/Java 8u251+ with `export HTTP2_DISABLE=true`. **K8s v1.17.6 / JDK 1.8.0_252** ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. Run completed in 5 minutes, 7 seconds. Total number of tests run: 14 Suites: completed 2, aborted 0 Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28638 from dongjoon-hyun/SPARK-31819. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 May 2020, 21:30:35 UTC
1d1a207 [SPARK-31787][K8S][TESTS][2.4] Fix Minikube.getIfNewMinikubeStatus to understand 1.5+ ### What changes were proposed in this pull request? This PR aims to fix the testing infra to support Minikube 1.5+ in K8s IT. Also, note that this is a subset of #26488 with the same ownership. ### Why are the changes needed? This helps us testing `master/3.0/2.4` in the same Minikube version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass the Jenkins K8s IT with Minikube v0.34.1. - Manually, test with Minikube 1.5.x. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. Run completed in 4 minutes, 37 seconds. Total number of tests run: 14 Suites: completed 2, aborted 0 Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28599 from dongjoon-hyun/SPARK-31787. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 21 May 2020, 21:09:51 UTC
11e97b9 [K8S][MINOR] Log minikube version when running integration tests. Closes #23893 from vanzin/minikube-version. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 9f16af636661ec1f7af057764409fe359da0026a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 21 May 2020, 17:59:14 UTC
281c591 [SPARK-31399][CORE][2.4] Support indylambda Scala closure in ClosureCleaner This is a backport of https://github.com/apache/spark/pull/28463 from Apache Spark master/3.0 to 2.4. Minor adaptation include: - Retain the Spark 2.4-specific behavior of skipping the indylambda check when using Scala 2.11 - Remove unnecessary LMF restrictions in ClosureCleaner tests - Address review comments in the original PR from kiszk Tested with the default Scala 2.11 build, and also tested ClosureCleaner-related tests in Scala 2.12 build as well: - repl: `SingletonReplSuite` - core: `ClosureCleanerSuite` and `ClosureCleanerSuite2` --- ### What changes were proposed in this pull request? This PR proposes to enhance Spark's `ClosureCleaner` to support "indylambda" style of Scala closures to the same level as the existing implementation for the old (inner class) style ones. The goal is to reach feature parity with the support of the old style Scala closures, with as close to bug-for-bug compatibility as possible. Specifically, this PR addresses one lacking support for indylambda closures vs the inner class closures: - When a closure is declared in a Scala REPL and captures the enclosing REPL line object, such closure should be cleanable (unreferenced fields on the enclosing REPL line object should be cleaned) This PR maintains the same limitations in the new indylambda closure support as the old inner class closures, in particular the following two: - Cleaning is only available for one level of REPL line object. If a closure captures state from a REPL line object further out from the immediate enclosing one, it won't be subject to cleaning. See example below. - "Sibling" closures are not handled yet. A "sibling" closure is defined here as a closure that is directly or indirectly referenced by the starting closure, but isn't lexically enclosing. e.g. ```scala { val siblingClosure = (x: Int) => x + this.fieldA // captures `this`, references `fieldA` on `this`. val startingClosure = (y: Int) => y + this.fieldB + siblingClosure(y) // captures `this` and `siblingClosure`, references `fieldB` on `this`. } ``` The changes are intended to be minimal, with further code cleanups planned in separate PRs. Jargons: - old, inner class style Scala closures, aka `delambdafy:inline`: default in Scala 2.11 and before - new, "indylambda" style Scala closures, aka `delambdafy:method`: default in Scala 2.12 and later ### Why are the changes needed? There had been previous effortsto extend Spark's `ClosureCleaner` to support "indylambda" Scala closures, which is necessary for proper Scala 2.12 support. Most notably the work done for [SPARK-14540](https://issues.apache.org/jira/browse/SPARK-14540). But the previous efforts had missed one import scenario: a Scala closure declared in a Scala REPL, and it captures the enclosing `this` -- a REPL line object. e.g. in a Spark Shell: ```scala :pa class NotSerializableClass(val x: Int) val ns = new NotSerializableClass(42) val topLevelValue = "someValue" val func = (j: Int) => { (1 to j).flatMap { x => (1 to x).map { y => y + topLevelValue } } } <Ctrl+D> sc.parallelize(0 to 2).map(func).collect ``` In this example, `func` refers to a Scala closure that captures the enclosing `this` because it needs to access `topLevelValue`, which is in turn implemented as a field on the enclosing REPL line object. The existing `ClosureCleaner` in Spark supports cleaning this case in Scala 2.11-, and this PR brings feature parity to Scala 2.12+. Note that the existing cleaning logic only supported one level of REPL line object nesting. This PR does not go beyond that. When a closure references state declared a few commands earlier, the cleaning will fail in both Scala 2.11 and Scala 2.12. e.g. ```scala scala> :pa // Entering paste mode (ctrl-D to finish) class NotSerializableClass1(val x: Int) case class Foo(id: String) val ns = new NotSerializableClass1(42) val topLevelValue = "someValue" // Exiting paste mode, now interpreting. defined class NotSerializableClass1 defined class Foo ns: NotSerializableClass1 = NotSerializableClass1615b1baf topLevelValue: String = someValue scala> :pa // Entering paste mode (ctrl-D to finish) val closure2 = (j: Int) => { (1 to j).flatMap { x => (1 to x).map { y => y + topLevelValue } // 2 levels } } // Exiting paste mode, now interpreting. closure2: Int => scala.collection.immutable.IndexedSeq[String] = <function1> scala> sc.parallelize(0 to 2).map(closure2).collect org.apache.spark.SparkException: Task not serializable ... ``` in the Scala 2.11 / Spark 2.4.x case: ``` Caused by: java.io.NotSerializableException: NotSerializableClass1 Serialization stack: - object not serializable (class: NotSerializableClass1, value: NotSerializableClass1615b1baf) - field (class: $iw, name: ns, type: class NotSerializableClass1) - object (class $iw, $iw64df3f4b) - field (class: $iw, name: $iw, type: class $iw) - object (class $iw, $iw66e6e5e9) - field (class: $line14.$read, name: $iw, type: class $iw) - object (class $line14.$read, $line14.$readc310aa3) - field (class: $iw, name: $line14$read, type: class $line14.$read) - object (class $iw, $iw79224636) - field (class: $iw, name: $outer, type: class $iw) - object (class $iw, $iw636d4cdc) - field (class: $anonfun$1, name: $outer, type: class $iw) - object (class $anonfun$1, <function1>) ``` in the Scala 2.12 / Spark 2.4.x case after this PR: ``` Caused by: java.io.NotSerializableException: NotSerializableClass1 Serialization stack: - object not serializable (class: NotSerializableClass1, value: NotSerializableClass16f3b4c9a) - field (class: $iw, name: ns, type: class NotSerializableClass1) - object (class $iw, $iw2945a3c1) - field (class: $iw, name: $iw, type: class $iw) - object (class $iw, $iw152705d0) - field (class: $line14.$read, name: $iw, type: class $iw) - object (class $line14.$read, $line14.$read7cf311eb) - field (class: $iw, name: $line14$read, type: class $line14.$read) - object (class $iw, $iwd980dac) - field (class: $iw, name: $outer, type: class $iw) - object (class $iw, $iw557d9532) - element of array (index: 0) - array (class [Ljava.lang.Object;, size 1) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class $iw, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic $anonfun$closure2$1$adapted:(L$iw;Ljava/lang/Object;)Lscala/collection/immutable/IndexedSeq;, instantiatedMethodType=(Ljava/lang/Object;)Lscala/collection/immutable/IndexedSeq;, numCaptured=1]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class $Lambda$2103/815179920, $Lambda$2103/815179920569b57c4) ``` For more background of the new and old ways Scala lowers closures to Java bytecode, please see [A note on how NSC (New Scala Compiler) lowers lambdas](https://gist.github.com/rednaxelafx/e9ecd09bbd1c448dbddad4f4edf25d48#file-notes-md). For more background on how Spark's `ClosureCleaner` works and what's needed to make it support "indylambda" Scala closures, please refer to [A Note on Apache Spark's ClosureCleaner](https://gist.github.com/rednaxelafx/e9ecd09bbd1c448dbddad4f4edf25d48#file-spark_closurecleaner_notes-md). #### tl;dr The `ClosureCleaner` works like a mark-sweep algorithm on fields: - Finding (a chain of) outer objects referenced by the starting closure; - Scanning the starting closure and its inner closures and marking the fields on the outer objects accessed; - Cloning the outer objects, nulling out fields that are not accessed by any closure of concern. ##### Outer Objects For the old, inner class style Scala closures, the "outer objects" is defined as the lexically enclosing closures of the starting closure, plus an optional enclosing REPL line object if these closures are defined in a Scala REPL. All of them are on a singly-linked `$outer` chain. For the new, "indylambda" style Scala closures, the capturing implementation changed, so closures no longer refer to their enclosing closures via an `$outer` chain. However, a closure can still capture its enclosing REPL line object, much like the old style closures. The name of the field that captures this reference would be `arg$1` (instead of `$outer`). So what's missing in the `ClosureCleaner` for the "indylambda" support is find and potentially clone+clean the captured enclosing `this` REPL line object. That's what this PR implements. ##### Inner Closures The old, inner class style of Scala closures are compiled into separate inner classes, one per lambda body. So in order to discover the implementation (bytecode) of the inner closures, one has to jump over multiple classes. The name of such a class would contain the marker substring `$anonfun$`. The new, "indylambda" style Scala closures are compiled into **static methods** in the class where the lambdas were declared. So for lexically nested closures, their lambda bodies would all be compiled into static methods **in the same class**. This makes it much easier to discover the implementation (bytecode) of the nested lambda bodies. The name of such a static method would contain the marker substring `$anonfun$`. Discovery of inner closures involves scanning bytecode for certain patterns that represent the creation of a closure object for the inner closure. - For inner class style: the closure object creation site is like `new <InnerClassForTheClosure>(captured args)` - For "indylambda" style: the closure object creation site would be compiled into an `invokedynamic` instruction, with its "bootstrap method" pointing to the same one used by Java 8 for its serializable lambdas, and with the bootstrap method arguments pointing to the implementation method. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, Spark 2.4 / 3.0 / master on Scala 2.12 would not support Scala closures declared in a Scala REPL that captures anything from the REPL line objects. After this PR, such scenario is supported. ### How was this patch tested? Added new unit test case to `org.apache.spark.repl.SingletonReplSuite`. The new test case fails without the fix in this PR, and pases with the fix. Closes #28463 from rednaxelafx/closure-cleaner-indylambda. Authored-by: Kris Mok <kris.mokdatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit dc01b7556f74e4a9873ceb1f78bc7df4e2ab4a8a) Signed-off-by: Kris Mok <kris.mokdatabricks.com> Closes #28577 from rednaxelafx/backport-spark-31399-2.4. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: DB Tsai <d_tsai@apple.com> 19 May 2020, 19:29:33 UTC
fdbd32e [SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory **BEFORE** ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03 ``` **AFTER** ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = LocalFS scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem ``` The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case No Tested locally. Added Unit test Closes #28516 from karuppayya/SPARK-31692. Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 72601460ada41761737f39d5dff8e69444fce2ba) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d639a12ef243e1e8d20bd06d3a97d00e47f05517) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 May 2020, 00:07:44 UTC
d52ff4a [SPARK-25694][SQL][FOLLOW-UP] Move 'spark.sql.defaultUrlStreamHandlerFactory.enabled' into StaticSQLConf.scala This PR is a followup of https://github.com/apache/spark/pull/26530 and proposes to move the configuration `spark.sql.defaultUrlStreamHandlerFactory.enabled` to `StaticSQLConf.scala` for consistency. To put the similar configurations together and for readability. No. Manually tested as described in https://github.com/apache/spark/pull/26530. Closes #26570 from HyukjinKwon/SPARK-25694. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8469614c0513fbed87977d4e741649db3fdd8add) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 May 2020, 23:57:14 UTC
19cb475 [SPARK-25694][SQL] Add a config for `URL.setURLStreamHandlerFactory` Add a property `spark.fsUrlStreamHandlerFactory.enabled` to allow users turn off the default registration of `org.apache.hadoop.fs.FsUrlStreamHandlerFactory` This [SPARK-25694](https://issues.apache.org/jira/browse/SPARK-25694) is a long-standing issue. Originally, [[SPARK-12868][SQL] Allow adding jars from hdfs](https://github.com/apache/spark/pull/17342 ) added this for better Hive support. However, this have a side-effect when the users use Apache Spark without `-Phive`. This causes exceptions when the users tries to use another custom factories or 3rd party library (trying to set this). This configuration will unblock those non-hive users. Yes. This provides a new user-configurable property. By default, the behavior is unchanged. Manual testing. **BEFORE** ``` $ build/sbt package $ bin/spark-shell scala> sql("show tables").show +--------+---------+-----------+ |database|tableName|isTemporary| +--------+---------+-----------+ +--------+---------+-----------+ scala> java.net.URL.setURLStreamHandlerFactory(new org.apache.hadoop.fs.FsUrlStreamHandlerFactory()) java.lang.Error: factory already defined at java.net.URL.setURLStreamHandlerFactory(URL.java:1134) ... 47 elided ``` **AFTER** ``` $ build/sbt package $ bin/spark-shell --conf spark.sql.defaultUrlStreamHandlerFactory.enabled=false scala> sql("show tables").show +--------+---------+-----------+ |database|tableName|isTemporary| +--------+---------+-----------+ +--------+---------+-----------+ scala> java.net.URL.setURLStreamHandlerFactory(new org.apache.hadoop.fs.FsUrlStreamHandlerFactory()) ``` Closes #26530 from jiangzho/master. Lead-authored-by: Zhou Jiang <zhou_jiang@apple.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: zhou-jiang <zhou_jiang@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit ee3bd6d76887ccc4961fd520c5d03f7edd3742ac) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 May 2020, 23:45:27 UTC
5c57171 [SPARK-31740][K8S][TESTS] Use github URL instead of a broken link This PR aims to use GitHub URL instead of a broken link in `BasicTestsSuite.scala`. Currently, K8s integration test is broken: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/534/console ``` - Run SparkRemoteFileTest using a remote data file *** FAILED *** The code passed to eventually never returned normally. Attempted 130 times over 2.00109555135 minutes. Last failure message: false was not true. (KubernetesSuite.scala:370) ``` No. Pass the K8s integration test. Closes #28561 from williamhyun/williamhyun-patch-1. Authored-by: williamhyun <62487364+williamhyun@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5bb1a09b5f3a0f91409c7245847ab428c3c58322) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 18 May 2020, 05:14:16 UTC
6d8cef5 [SPARK-31655][BUILD][2.4] Upgrade snappy-java to 1.1.7.5 ### What changes were proposed in this pull request? snappy-java have release v1.1.7.5, upgrade to latest version. Fixed in v1.1.7.4 - Caching internal buffers for SnappyFramed streams #234 - Fixed the native lib for ppc64le to work with glibc 2.17 (Previously it depended on 2.22) Fixed in v1.1.7.5 - Fixes java.lang.NoClassDefFoundError: org/xerial/snappy/pool/DefaultPoolFactory in 1.1.7.4 https://github.com/xerial/snappy-java/compare/1.1.7.3...1.1.7.5 v 1.1.7.5 release note: https://github.com/xerial/snappy-java/commit/edc4ec28bdb15a32b6c41ca9e8b195e635bec3a3 ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need Closes #28506 from AngersZhuuuu/spark-31655-2.4. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 17 May 2020, 19:01:36 UTC
a4885f3 [SPARK-31663][SQL] Grouping sets with having clause returns the wrong result - Resolve the havingcondition with expanding the GROUPING SETS/CUBE/ROLLUP expressions together in `ResolveGroupingAnalytics`: - Change the operations resolving directions to top-down. - Try resolving the condition of the filter as though it is in the aggregate clause by reusing the function in `ResolveAggregateFunctions` - Push the aggregate expressions into the aggregate which contains the expanded operations. - Use UnresolvedHaving for all having clause. Correctness bug fix. See the demo and analysis in SPARK-31663. Yes, correctness bug fix for HAVING with GROUPING SETS. New UTs added. Closes #28501 from xuanyuanking/SPARK-31663. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 86bd37f37eb1e534c520dc9a02387debf9fa05a1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 May 2020, 05:32:47 UTC
18e0767 Preparing development version 2.4.7-SNAPSHOT 16 May 2020, 02:03:49 UTC
570848d Preparing Spark release v2.4.6-rc3 16 May 2020, 02:03:44 UTC
d2027e7 Preparing development version 2.4.7-SNAPSHOT 15 May 2020, 22:49:14 UTC
d6a488a Preparing Spark release v2.4.6-rc2 15 May 2020, 22:49:09 UTC
8b72eb7 [SPARK-31712][SQL][TESTS][2.4] Check casting timestamps before the epoch to Byte/Short/Int/Long types ### What changes were proposed in this pull request? Added tests to check casting timestamps before 1970-01-01 00:00:00Z to ByteType, ShortType, IntegerType and LongType in ansi and non-ansi modes. This is a backport of https://github.com/apache/spark/pull/28531. ### Why are the changes needed? To improve test coverage and prevent errors while modifying the CAST expression code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ ./build/sbt "test:testOnly *CastSuite" ``` Closes #28542 from MaxGekk/test-cast-timestamp-to-byte-2.4. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 May 2020, 20:31:08 UTC
c3ae928 [SPARK-31716][SQL] Use fallback versions in HiveExternalCatalogVersionsSuite # What changes were proposed in this pull request? This PR aims to provide a fallback version instead of `Nil` in `HiveExternalCatalogVersionsSuite`. The provided fallback Spark versions recovers Jenkins jobs instead of failing. ### Why are the changes needed? Currently, `HiveExternalCatalogVersionsSuite` is aborted in all Jenkins jobs except JDK11 Jenkins jobs which don't have old Spark releases supporting JDK11. ``` HiveExternalCatalogVersionsSuite: org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** Exception encountered when invoking run on a nested suite - Fail to get the lates Spark versions to test. (HiveExternalCatalogVersionsSuite.scala:180) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins Closes #28536 from dongjoon-hyun/SPARK-HiveExternalCatalogVersionsSuite. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5d90886523c415768c65ea9cba7db24bc508a23b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 May 2020, 07:30:59 UTC
630f4dd [SPARK-31713][INFRA] Make test-dependencies.sh detect version string correctly ### What changes were proposed in this pull request? This PR makes `test-dependencies.sh` detect the version string correctly by ignoring all the other lines. ### Why are the changes needed? Currently, all SBT jobs are broken like the following. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3/476/console ``` [error] running /home/jenkins/workspace/spark-branch-3.0-test-sbt-hadoop-3.2-hive-2.3/dev/test-dependencies.sh ; received return code 1 Build step 'Execute shell' marked build as failure ``` The reason is that the script detects the old version like `Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT` when `build/mvn` did fallback. Specifically, in the script, `OLD_VERSION` became `Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT` instead of `3.1.0-SNAPSHOT` if build/mvn did fallback. Then, `pom.xml` file is corrupted like the following at the end and the exit code become `1` instead of `0`. It causes Jenkins jobs fails ``` - <version>3.1.0-SNAPSHOT</version> + <version>Falling</version> ``` **NO FALLBACK** ``` $ build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn 3.1.0-SNAPSHOT ``` **FALLBACK** ``` $ build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec Falling back to archive.apache.org to download Maven Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn 3.1.0-SNAPSHOT ``` **In the script** ``` $ echo $(build/mvn -q -Dexec.executable="echo" -Dexec.args='${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:1.6.0:exec) Using `mvn` from path: /Users/dongjoon/APACHE/spark-merge/build/apache-maven-3.6.3/bin/mvn Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT ``` This PR will prevent irrelevant logs like `Falling back to archive.apache.org to download Maven`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the PR Builder. Closes #28532 from dongjoon-hyun/SPARK-31713. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cd5fbcf9a0151f10553f67bcaa22b8122b3cf263) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 15 May 2020, 02:28:58 UTC
1ea5844 [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0 ``` for (i <- 0 until splits.length) { if (splits(i) == -0.0) { splits(i) = 0.0 } } ``` Fix bug. No Unit test. ~~~scala import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) import spark.implicits._ val df1 = sc.parallelize(a1, 2).toDF("id") import org.apache.spark.ml.feature.QuantileDiscretizer val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0) val model = qd.fit(df1) // will raise error in spark master. ~~~ scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code. And array.distinct will rely on elem.hashCode so it leads to this error. Test code on distinct ``` import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) a1.distinct.sorted.foreach(x => print(x.toString + "\n")) ``` Then you will see output like: ``` ... -0.009292684662246975 -0.0033280686465135823 -0.0 0.0 0.0022219556032221366 0.02217419561977274 ... ``` Closes #28498 from WeichenXu123/SPARK-31676. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b2300fca1e1a22d74c6eeda37942920a6c6299ff) Signed-off-by: Sean Owen <srowen@gmail.com> 14 May 2020, 14:27:30 UTC
4ba9421 [SPARK-31632][CORE][WEBUI] Enrich the exception message when application information is unavailable ### What changes were proposed in this pull request? This PR caught the `NoSuchElementException` and enriched the error message for `AppStatusStore.applicationInfo()` when Spark is starting up and the application information is unavailable. ### Why are the changes needed? During the initialization of `SparkContext`, it first starts the Web UI and then set up the `LiveListenerBus` thread for dispatching the `SparkListenerApplicationStart` event (which will trigger writing the requested `ApplicationInfo` to `InMemoryStore`). If the Web UI is accessed before this info's being written to `InMemoryStore`, the following `NoSuchElementException` will be thrown. ``` WARN org.eclipse.jetty.server.HttpChannel: /jobs/ java.util.NoSuchElementException at java.util.Collections$EmptyIterator.next(Collections.java:4191) at org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:505) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. This can be reproduced: 1. `./bin/spark-shell` 2. at the same time, open `http://localhost:4040/jobs/` in your browser with quickly refreshing. Closes #28444 from xccui/SPARK-31632. Authored-by: Xingcan Cui <xccui@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 42951e6786319481220ba4abfad015a8d11749f3) Signed-off-by: HyukjinKwon <gurwls223@apache.org> 14 May 2020, 03:08:03 UTC
5b51880 [SPARK-31691][INFRA] release-build.sh should ignore a fallback output from `build/mvn` ### What changes were proposed in this pull request? This PR adds `i` option to ignore additional `build/mvn` output which is irrelevant to version string. ### Why are the changes needed? SPARK-28963 added additional output message, `Falling back to archive.apache.org to download Maven` in build/mvn. This breaks `dev/create-release/release-build.sh` and currently Spark Packaging Jenkins job is hitting this issue consistently and broken. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/2912/console ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This happens only when the mirror fails. So, this is verified manually hiject the script. It works like the following. ``` $ echo 'Falling back to archive.apache.org to download Maven' > out $ build/mvn help:evaluate -Dexpression=project.version >> out Using `mvn` from path: /Users/dongjoon/PRS/SPARK_RELEASE_2/build/apache-maven-3.6.3/bin/mvn $ cat out | grep -v INFO | grep -v WARNING | grep -v Download Falling back to archive.apache.org to download Maven 3.1.0-SNAPSHOT $ cat out | grep -v INFO | grep -v WARNING | grep -vi Download 3.1.0-SNAPSHOT ``` Closes #28514 from dongjoon-hyun/SPARK_RELEASE_2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3772154442e6341ea97a2f41cd672413de918e85) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ce52f61f720783e8eeb3313c763493054599091a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 12 May 2020, 21:26:26 UTC
back to top