https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
42f25f3 Preparing Spark release v2.4.0-rc2 27 September 2018, 14:30:59 UTC
3c78ea2 [SPARK-25522][SQL] Improve type promotion for input arguments of elementAt function ## What changes were proposed in this pull request? In ElementAt, when first argument is MapType, we should coerce the key type and the second argument based on findTightestCommonType. This is not happening currently. We may produce wrong output as we will incorrectly downcast the right hand side double expression to int. ```SQL spark-sql> select element_at(map(1,"one", 2, "two"), 2.2); two ``` Also, when the first argument is ArrayType, the second argument should be an integer type or a smaller integral type that can be safely casted to an integer type. Currently we may do an unsafe cast. In the following case, we should fail with an error as 2.2 is not a integer index. But instead we down cast it to int currently and return a result instead. ```SQL spark-sql> select element_at(array(1,2), 1.24D); 1 ``` This PR also supports implicit cast between two MapTypes. I have followed similar logic that exists today to do implicit casts between two array types. ## How was this patch tested? Added new tests in DataFrameFunctionSuite, TypeCoercionSuite. Closes #22544 from dilipbiswal/SPARK-25522. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d03e0af80d7659f12821cc2442efaeaee94d3985) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 September 2018, 11:50:01 UTC
53eb858 [SPARK-25314][SQL] Fix Python UDF accessing attributes from both side of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes #22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2a8cbfddba2a59d144b32910c68c22d0199093fe) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 September 2018, 07:13:39 UTC
0b4e581 [SPARK-23715][SQL][DOC] improve document for from/to_utc_timestamp ## What changes were proposed in this pull request? We have an agreement that the behavior of `from/to_utc_timestamp` is corrected, although the function itself doesn't make much sense in Spark: https://issues.apache.org/jira/browse/SPARK-23715 This PR improves the document. ## How was this patch tested? N/A Closes #22543 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ff876137faba1802b66ecd483ba15f6ccd83ffc5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 September 2018, 07:02:52 UTC
0cf4c5b [SPARK-25468][WEBUI] Highlight current page index in the spark UI ## What changes were proposed in this pull request? This PR is highlight current page index in the spark UI and history server UI, https://issues.apache.org/jira/browse/SPARK-25468 I have add the following code in webui.css ``` .paginate_button.active>a { color: #999999; text-decoration: underline; } ``` ## How was this patch tested? Manual tests for Chrome, Firefox and Safari Before modifying: ![image](https://user-images.githubusercontent.com/10048468/45914897-01ca6c00-be7e-11e8-8e31-47d45db0c3bf.png) After modifying: ![image](https://user-images.githubusercontent.com/10048468/45913987-7e564e00-be70-11e8-9c16-de17e2c63308.png) Closes #22516 from Adamyuanyuan/spark-adam-25468. Lead-authored-by: 王小刚 <wangxiaogang@chinatelecom.cn> Co-authored-by: Adam Wang <Adamyuanyuan@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 8b727994edd27104d49c6d690f93c6858fb9e1fc) Signed-off-by: Sean Owen <sean.owen@databricks.com> 27 September 2018, 05:02:29 UTC
01c000b Revert "[SPARK-25540][SQL][PYSPARK] Make HiveContext in PySpark behave as the same as Scala." This reverts commit 7656358adc39eb8eb881368ab5a066fbf86149c8. 27 September 2018, 04:38:14 UTC
f12769e [SPARK-25536][CORE] metric value for METRIC_OUTPUT_RECORDS_WRITTEN is incorrect ## What changes were proposed in this pull request? changed metric value of METRIC_OUTPUT_RECORDS_WRITTEN from 'task.metrics.inputMetrics.recordsRead' to 'task.metrics.outputMetrics.recordsWritten'. This bug was introduced in SPARK-22190. https://github.com/apache/spark/pull/19426 ## How was this patch tested? Existing tests Closes #22555 from shahidki31/SPARK-25536. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5def10e61e49dba85f4d8b39c92bda15137990a2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 27 September 2018, 04:14:13 UTC
7656358 [SPARK-25540][SQL][PYSPARK] Make HiveContext in PySpark behave as the same as Scala. ## What changes were proposed in this pull request? In Scala, `HiveContext` sets a config `spark.sql.catalogImplementation` of the given `SparkContext` and then passes to `SparkSession.builder`. The `HiveContext` in PySpark should behave as the same as Scala. ## How was this patch tested? Existing tests. Closes #22552 from ueshin/issues/SPARK-25540/hive_context. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c3c45cbd76d91d591d98cf8411fcfd30079f5969) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 September 2018, 01:51:42 UTC
2ff91f2 [SPARK-25454][SQL] add a new config for picking minimum precision for integral literals ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/20023 proposed to allow precision lose during decimal operations, to reduce the possibilities of overflow. This is a behavior change and is protected by the DECIMAL_OPERATIONS_ALLOW_PREC_LOSS config. However, that PR introduced another behavior change: pick a minimum precision for integral literals, which is not protected by a config. This PR add a new config for it: `spark.sql.literal.pickMinimumPrecision`. This can allow users to work around issue in SPARK-25454, which is caused by a long-standing bug of negative scale. ## How was this patch tested? a new test Closes #22494 from cloud-fan/decimal. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit d0990e3dfee752a6460a6360e1a773138364d774) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 27 September 2018, 00:47:18 UTC
8d17200 [SPARK-24519][CORE] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS only once ## What changes were proposed in this pull request? Previously SPARK-24519 created a modifiable config SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS. However, the config is being parsed for every creation of MapStatus, which could be very expensive. Another problem with the previous approach is that it created the illusion that this can be changed dynamically at runtime, which was not true. This PR changes it so the config is computed only once. ## How was this patch tested? Removed a test case that's no longer valid. Closes #22521 from rxin/SPARK-24519. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e702fb1d5218d062fcb8e618b92dad7958eb4062) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 26 September 2018, 17:22:50 UTC
dc60476 [SPARK-25318] Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block SPARK-4105 provided a solution to block corruption issue by retrying the fetch or the stage. In that solution there is a step that wraps the input stream with compression and/or encryption. This step is prone to exceptions, but in the current code there is no exception handling for this step and this has caused confusion for the user. The confusion was that after SPARK-4105 the user expects to see either a fetchFailed exception or a warning about a corrupted block. However an exception during wrapping can fail the job without any of those. This change adds exception handling for the wrapping step and also adds a fetch retry if we experience a corruption during the wrapping step. The reason for adding the retry is that usually user won't experience the same failure after rerunning the job and so it seems reasonable try to fetch and wrap one more time instead of failing. Closes #22325 from rezasafi/localcorruption. Authored-by: Reza Safi <rezasafi@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit bd2ae857d1c5f251056de38a7a40540986756b94) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 26 September 2018, 16:30:13 UTC
9969827 [SPARK-25509][CORE] Windows doesn't support POSIX permissions ## What changes were proposed in this pull request? SHS V2 cannot enabled in Windows, because windows doesn't support POSIX permission. ## How was this patch tested? test case fails in windows without this fix. org.apache.spark.deploy.history.HistoryServerDiskManagerSuite test("leasing space") SHS V2 cannot run successfully in Windows without this fix. java.lang.UnsupportedOperationException: 'posix:permissions' not supported as initial attribute at sun.nio.fs.WindowsSecurityDescriptor.fromAttribute(WindowsSecurityDescriptor.java:358) Closes #22520 from jianjianjiao/FixWindowsPermssionsIssue. Authored-by: Rong Tang <rotang@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit a2ac5a72ccd2b14c8492d4a6da9e8b30f0f3c9b4) Signed-off-by: Sean Owen <sean.owen@databricks.com> 26 September 2018, 15:37:27 UTC
d44b863 [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide ## What changes were proposed in this pull request? Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide. ## How was this patch tested? N/A Closes #22453 from seancxmao/SPARK-20937. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit cf5c9c4b550c3a8ed59d7ef9404f2689ea763fa9) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 26 September 2018, 14:14:27 UTC
3f20305 [SPARK-24324][PYTHON][FOLLOW-UP] Rename the Conf to spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName ## What changes were proposed in this pull request? Add the legacy prefix for spark.sql.execution.pandas.groupedMap.assignColumnsByPosition and rename it to spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName ## How was this patch tested? The existing tests. Closes #22540 from gatorsmile/renameAssignColumnsByPosition. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 8c2edf46d0f89e5ec54968218d89f30a3f8190bc) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 26 September 2018, 01:33:13 UTC
f91247f [SPARK-25422][CORE] Don't memory map blocks streamed to disk. After data has been streamed to disk, the buffers are inserted into the memory store in some cases (eg., with broadcast blocks). But broadcast code also disposes of those buffers when the data has been read, to ensure that we don't leave mapped buffers using up memory, which then leads to garbage data in the memory store. ## How was this patch tested? Ran the old failing test in a loop. Full tests on jenkins Closes #22546 from squito/SPARK-25422-master. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9bb3a0c67bd851b09ff4701ef1d280e2a77d791b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 September 2018, 00:45:56 UTC
544f86a [SPARK-25495][SS] FetchedData.reset should reset all fields ## What changes were proposed in this pull request? `FetchedData.reset` should reset `_nextOffsetInFetchedData` and `_offsetAfterPoll`. Otherwise it will cause inconsistent cached data and may make Kafka connector return wrong results. ## How was this patch tested? The new unit test. Closes #22507 from zsxwing/fix-kafka-reset. Lead-authored-by: Shixiong Zhu <zsxwing@gmail.com> Co-authored-by: Shixiong Zhu <shixiong@databricks.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> (cherry picked from commit 66d29870c09e6050dd846336e596faaa8b0d14ad) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 25 September 2018, 18:42:39 UTC
a709718 [SPARK-23907][SQL] Revert regr_* functions entirely ## What changes were proposed in this pull request? This patch reverts entirely all the regr_* functions added in SPARK-23907. These were added by mgaido91 (and proposed by gatorsmile) to improve compatibility with other database systems, without any actual use cases. However, they are very rarely used, and in Spark there are much better ways to compute these functions, due to Spark's flexibility in exposing real programming APIs. I'm going through all the APIs added in Spark 2.4 and I think we should revert these. If there are strong enough demands and more use cases, we can add them back in the future pretty easily. ## How was this patch tested? Reverted test cases also. Closes #22541 from rxin/SPARK-23907. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 9cbd001e2476cd06aa0bcfcc77a21a9077d5797a) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 25 September 2018, 12:13:22 UTC
4ca4ef7 [SPARK-25519][SQL] ArrayRemove function may return incorrect result when right expression is implicitly downcasted. ## What changes were proposed in this pull request? In ArrayRemove, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_remove(array(1,2,3), 1.23D); [2,3] ``` ```SQL spark-sql> select array_remove(array(1,2,3), 'foo'); NULL ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22542 from dilipbiswal/SPARK-25519. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7d8f5b62c57c9e2903edd305e8b9c5400652fdb0) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 September 2018, 04:05:37 UTC
e4c03e8 [SPARK-25503][CORE][WEBUI] Total task message in stage page is ambiguous ## What changes were proposed in this pull request? Test steps : 1) bin/spark-shell --conf spark.ui.retainedTasks=10 2) val rdd = sc.parallelize(1 to 1000, 1000) 3) rdd.count Stage page tab in the UI will display 10 tasks, but display message is wrong. It should reverse. **Before fix :** ![webui_1](https://user-images.githubusercontent.com/23054875/45917921-8926d800-be9c-11e8-8da5-3998d07e3ccc.jpg) **After fix** ![spark_web_ui2](https://user-images.githubusercontent.com/23054875/45917935-b4112c00-be9c-11e8-9d10-4fcc8e88568f.jpg) ## How was this patch tested? Manually tested Closes #22525 from shahidki31/SparkUI. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 615792da42b3ee3c5f623c869fada17a3aa92884) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 25 September 2018, 03:04:26 UTC
ffc081c [SPARK-25502][CORE][WEBUI] Empty Page when page number exceeds the reatinedTask size. ## What changes were proposed in this pull request? Test steps : 1) bin/spark-shell --conf spark.ui.retainedTasks=200 ``` val rdd = sc.parallelize(1 to 1000, 1000) rdd.count ``` Stage tab in the UI will display 10 pages with 100 tasks per page. But number of retained tasks is only 200. So, from the 3rd page onwards will display nothing. We have to calculate total pages based on the number of tasks need display in the UI. **Before fix:** ![empty_4](https://user-images.githubusercontent.com/23054875/45918251-b1650580-bea1-11e8-90d3-7e0d491981a2.jpg) **After fix:** ![empty_3](https://user-images.githubusercontent.com/23054875/45918257-c2ae1200-bea1-11e8-960f-dfbdb4a90ae7.jpg) ## How was this patch tested? Manually tested Closes #22526 from shahidki31/SPARK-25502. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 3ce2e008ec1bf70adc5a4b356e09a469e94af803) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 24 September 2018, 21:18:03 UTC
ec38428 [SPARK-25460][BRANCH-2.4][SS] DataSourceV2: SS sources do not respect SessionConfigSupport ## What changes were proposed in this pull request? This PR proposes to backport SPARK-25460 to branch-2.4: This PR proposes to respect `SessionConfigSupport` in SS datasources as well. Currently these are only respected in batch sources: https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L198-L203 https://github.com/apache/spark/blob/e06da95cd9423f55cdb154a2778b0bddf7be984c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L244-L249 If a developer makes a datasource V2 that supports both structured streaming and batch jobs, batch jobs respect a specific configuration, let's say, URL to connect and fetch data (which end users might not be aware of); however, structured streaming ends up with not supporting this (and should explicitly be set into options). ## How was this patch tested? Unit tests were added. Closes #22529 from HyukjinKwon/SPARK-25460-backport. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 24 September 2018, 15:49:19 UTC
51d5378 [SPARK-25416][SQL] ArrayPosition function may return incorrect result when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayPosition, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_position(array(1), 1.34); 1 ``` ```SQL spark-sql> select array_position(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22407 from dilipbiswal/SPARK-25416. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bb49661e192eed78a8a306deffd83c73bd4a9eff) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 September 2018, 13:43:34 UTC
13bc58d [SPARK-21318][SQL] Improve exception message thrown by `lookupFunction` ## What changes were proposed in this pull request? The function actually exists in current selected database, and it's failed to init during `lookupFunciton`, but the exception message is: ``` This function is neither a registered temporary function nor a permanent function registered in the database 'default'. ``` This is not conducive to positioning problems. This PR fix the problem. ## How was this patch tested? new test case + manual tests Closes #18544 from stanzhai/fix-udf-error-message. Authored-by: Stan Zhai <mail@stanzhai.site> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 804515f821086ea685815d3c8eff42d76b7d9e4e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 September 2018, 13:33:38 UTC
36e7c8f [SPARKR] Match pyspark features in SparkR communication protocol 24 September 2018, 11:28:31 UTC
c64e750 [MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm ## What changes were proposed in this pull request? Always close the tempFile after `serializer.dump_stream(data, tempFile)` in _serialize_to_jvm ## How was this patch tested? N/A Closes #22523 from gatorsmile/fixMinor. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> 23 September 2018, 02:18:00 UTC
1303eb5 [SPARK-25321][ML] Fix local LDA model constructor ## What changes were proposed in this pull request? change back the constructor to: ``` class LocalLDAModel private[ml] ( uid: String, vocabSize: Int, private[clustering] val oldLocalModel : OldLocalLDAModel, sparkSession: SparkSession) ``` Although it is marked `private[ml]`, it is used in `mleap` and the master change breaks `mleap` building. See mleap code [here](https://github.com/combust/mleap/blob/c7860af328d519cf56441b4a7cd8e6ec9d9fee59/mleap-spark/src/main/scala/org/apache/spark/ml/bundle/ops/clustering/LDAModelOp.scala#L57) ## How was this patch tested? Manual. Closes #22510 from WeichenXu123/LDA_fix. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com> (cherry picked from commit 40edab209bdefe793b59b650099cea026c244484) Signed-off-by: Xiangrui Meng <meng@databricks.com> 21 September 2018, 20:08:11 UTC
138a631 [SPARK-25321][ML] Revert SPARK-14681 to avoid API breaking change ## What changes were proposed in this pull request? Revert SPARK-14681 to avoid API breaking change. PR [SPARK-14681] will break mleap. ## How was this patch tested? N/A Closes #22492 from WeichenXu123/revert_tree_change. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com> 21 September 2018, 20:05:24 UTC
ce66361 [SPARK-19724][SQL] allowCreatingManagedTableUsingNonemptyLocation should have legacy prefix One more legacy config to go ... Closes #22515 from rxin/allowCreatingManagedTableUsingNonemptyLocation. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 4a11209539130c6a075119bf87c5ad854d42978e) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 21 September 2018, 16:46:03 UTC
604828e [SPARK-25469][SQL] Eval methods of Concat, Reverse and ElementAt should use pattern matching only once ## What changes were proposed in this pull request? The PR proposes to avoid usage of pattern matching for each call of ```eval``` method within: - ```Concat``` - ```Reverse``` - ```ElementAt``` ## How was this patch tested? Run the existing tests for ```Concat```, ```Reverse``` and ```ElementAt``` expression classes. Closes #22471 from mn-mikke/SPARK-25470. Authored-by: Marek Novotny <mn.mikke@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 2c9d8f56c71093faf152ca7136c5fcc4a7b2a95f) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 21 September 2018, 09:30:32 UTC
e425462 [SPARK-23549][SQL] Rename config spark.sql.legacy.compareDateTimestampInTimestamp ## What changes were proposed in this pull request? See title. Makes our legacy backward compatibility configs more consistent. ## How was this patch tested? Make sure all references have been updated: ``` > git grep compareDateTimestampInTimestamp docs/sql-programming-guide.md: - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.legacy.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0. sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: // if conf.compareDateTimestampInTimestamp is true sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) else Some(StringType) sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala: => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) else Some(StringType) sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: buildConf("spark.sql.legacy.compareDateTimestampInTimestamp") sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: def compareDateTimestampInTimestamp : Boolean = getConf(COMPARE_DATE_TIMESTAMP_IN_TIMESTAMP) sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala: "spark.sql.legacy.compareDateTimestampInTimestamp" -> convertToTS.toString) { ``` Closes #22508 from rxin/SPARK-23549. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 411ecc365ea62aef7a29d8764e783e6a58dbb1d5) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 September 2018, 06:28:09 UTC
aff6aed [SPARK-25384][SQL] Clarify fromJsonForceNullableSchema will be removed in Spark 3.0 See above. This should go into the 2.4 release. Closes #22509 from rxin/SPARK-25384. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fb3276a54a2b7339e5e0fb62fb01cbefcc330c8b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 September 2018, 06:18:00 UTC
5d74449 Revert "[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone ## What changes were proposed in this pull request? This reverts commit 417ad92502e714da71552f64d0e1257d2fd5d3d0. We decided to keep the current behaviors unchanged and will consider whether we will deprecate the these functions in 3.0. For more details, see the discussion in https://issues.apache.org/jira/browse/SPARK-23715 ## How was this patch tested? The existing tests. Closes #22505 from gatorsmile/revertSpark-23715. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5d25e154408f71d24c4829165a16014fdacdd209) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 September 2018, 02:40:13 UTC
51f3659 [SPARK-24777][SQL] Add write benchmark for AVRO ## What changes were proposed in this pull request? Refactor `DataSourceWriteBenchmark` and add write benchmark for AVRO. ## How was this patch tested? Build and run the benchmark. Closes #22451 from gengliangwang/avroWriteBenchmark. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 950ab79957fc0cdc2dafac94765787e87ece9e74) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 21 September 2018, 00:41:37 UTC
43c62e7 [SPARK-24918][CORE] Executor Plugin API ## What changes were proposed in this pull request? A continuation of squito's executor plugin task. By his request I took his code and added testing and moved the plugin initialization to a separate thread. Executor plugins now run on one separate thread, so the executor does not wait on them. Added testing. ## How was this patch tested? Added test cases that test using a sample plugin. Closes #22192 from NiharS/executorPlugin. Lead-authored-by: Nihar Sheth <niharrsheth@gmail.com> Co-authored-by: NiharS <niharrsheth@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 2f51e72356babac703cc20a531b4dcc7712f34af) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 20 September 2018, 18:52:31 UTC
c67c597 [SPARK-25450][SQL] PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation ## What changes were proposed in this pull request? The problem was cause by the PushProjectThroughUnion rule, which, when creating new Project for each child of Union, uses the same exprId for expressions of the same position. This is wrong because, for each child of Union, the expressions are all independent, and it can lead to a wrong result if other rules like FoldablePropagation kicks in, taking two different expressions as the same. This fix is to create new expressions in the new Project for each child of Union. ## How was this patch tested? Added UT. Closes #22447 from maryannxue/push-project-thru-union-bug. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 88446b6ad19371f15d06ef67052f6c1a8072c04a) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 20 September 2018, 17:00:42 UTC
fc03672 [MINOR][PYTHON] Use a helper in `PythonUtils` instead of direct accessing Scala package ## What changes were proposed in this pull request? This PR proposes to use add a helper in `PythonUtils` instead of direct accessing Scala package. ## How was this patch tested? Jenkins tests. Closes #22483 from HyukjinKwon/minor-refactoring. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 88e7e87bd5c052e10f52d4bb97a9d78f5b524128) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 20 September 2018, 16:42:17 UTC
78dd1d8 [SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_contains(array(1), 1.34); true ``` ```SQL spark-sql> select array_contains(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22408 from dilipbiswal/SPARK-25417. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 67f2c6a55425d0f38e26caaf7e0b665d978d0a68) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 September 2018, 12:34:04 UTC
b3bdfd7 Revert [SPARK-19355][SPARK-25352] ## What changes were proposed in this pull request? This goes to revert sequential PRs based on some discussion and comments at https://github.com/apache/spark/pull/16677#issuecomment-422650759. #22344 #22330 #22239 #16677 ## How was this patch tested? Existing tests. Closes #22481 from viirya/revert-SPARK-19355-1. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 89671a27e783d77d4bfaec3d422cc8dd468ef04c) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 September 2018, 12:18:58 UTC
e07042a [MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent ## What changes were proposed in this pull request? This PR replace an effective `show()` to `collect()` to make the output silent. **Before:** ``` test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... +---+----------+ |key| val| +---+----------+ | 0|[0.0, 0.0]| | 1|[1.0, 1.0]| | 2|[2.0, 2.0]| | 0|[3.0, 3.0]| | 1|[4.0, 4.0]| | 2|[5.0, 5.0]| | 0|[6.0, 6.0]| | 1|[7.0, 7.0]| | 2|[8.0, 8.0]| | 0|[9.0, 9.0]| +---+----------+ ``` **After:** ``` test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... ok ``` ## How was this patch tested? Manually tested. Closes #22479 from HyukjinKwon/minor-udf-test. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 7ff5386ed934190344b2cda1069bde4bc68a3e63) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 20 September 2018, 07:03:34 UTC
dfcff38 [SPARK-4502][SQL] Rename to spark.sql.optimizer.nestedSchemaPruning.enabled ## What changes were proposed in this pull request? This patch adds an "optimizer" prefix to nested schema pruning. ## How was this patch tested? Should be covered by existing tests. Closes #22475 from rxin/SPARK-4502. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 76399d75e23f2c7d6c2a1fb77a4387c5e15c809b) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 20 September 2018, 04:23:49 UTC
06efed2 [SPARK-24341][FOLLOWUP][DOCS] Add migration note for IN subqueries behavior ## What changes were proposed in this pull request? The PR updates the migration guide in order to explain the changes introduced in the behavior of the IN operator with subqueries, in particular, the improved handling of struct attributes in these situations. ## How was this patch tested? NA Closes #22469 from mgaido91/SPARK-24341_followup. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8aae49afc7997aa1da61029409ef6d8ce0ab256a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 September 2018, 02:10:36 UTC
535bf1c [SPARK-24157][SS][FOLLOWUP] Rename to spark.sql.streaming.noDataMicroBatches.enabled ## What changes were proposed in this pull request? This patch changes the config option `spark.sql.streaming.noDataMicroBatchesEnabled` to `spark.sql.streaming.noDataMicroBatches.enabled` to be more consistent with rest of the configs. Unfortunately there is one streaming config called `spark.sql.streaming.metricsEnabled`. For that one we should just use a fallback config and change it in a separate patch. ## How was this patch tested? Made sure no other references to this config are in the code base: ``` > git grep "noDataMicro" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: buildConf("spark.sql.streaming.noDataMicroBatches.enabled") ``` Closes #22476 from rxin/SPARK-24157. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Reynold Xin <rxin@databricks.com> (cherry picked from commit 936c920347e196381b48bc3656ca81a06f2ff46d) Signed-off-by: Reynold Xin <rxin@databricks.com> 20 September 2018, 01:51:31 UTC
99ae693 [SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23 ## What changes were proposed in this pull request? Fix test that constructs a Pandas DataFrame by specifying the column order. Previously this test assumed the columns would be sorted alphabetically, however when using Python 3.6 with Pandas 0.23 or higher, the original column order is maintained. This causes the columns to get mixed up and the test errors. Manually tested with `python/run-tests` using Python 3.6.6 and Pandas 0.23.4 Closes #22477 from BryanCutler/pyspark-tests-py36-pd23-SPARK-25471. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 90e3955f384ca07bdf24faa6cdb60ded944cf0d8) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 20 September 2018, 01:29:49 UTC
a9a8d3a [SPARK-25425][SQL][BACKPORT-2.4] Extra options should override session options in DataSource V2 ## What changes were proposed in this pull request? In the PR, I propose overriding session options by extra options in DataSource V2. Extra options are more specific and set via `.option()`, and should overwrite more generic session options. ## How was this patch tested? Added tests for read and write paths. Closes #22474 from MaxGekk/session-options-2.4. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 September 2018, 23:53:26 UTC
9031c78 [SPARK-25021][K8S][BACKPORT] Add spark.executor.pyspark.memory limit for K8S ## What changes were proposed in this pull request? Add spark.executor.pyspark.memory limit for K8S [BACKPORT] ## How was this patch tested? Unit and Integration tests Closes #22376 from ifilonenko/SPARK-25021-2.4. Authored-by: Ilan Filonenko <if56@cornell.edu> Signed-off-by: Holden Karau <holden@pigscanfly.ca> 19 September 2018, 22:37:56 UTC
83a75a8 [SPARK-22666][ML][FOLLOW-UP] Improve testcase to tolerate different schema representation ## What changes were proposed in this pull request? Improve testcase "image datasource test: read non image" to tolerate different schema representation. Because file:/path and file:///path are both valid URI-ifications so in some environment the testcase will fail. ## How was this patch tested? Manual. Closes #22449 from WeichenXu123/image_url. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com> (cherry picked from commit 6f681d42964884d19bf22deb614550d712223117) Signed-off-by: Xiangrui Meng <meng@databricks.com> 19 September 2018, 22:16:30 UTC
9fefb47 Revert "[SPARK-23173][SQL] rename spark.sql.fromJsonForceNullableSchema" This reverts commit 6c7db7fd1ced1d143b1389d09990a620fc16be46. (cherry picked from commit cb1b55cf771018f1560f6b173cdd7c6ca8061bc7) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 19 September 2018, 21:38:21 UTC
538ae62 [SPARK-25445][BUILD][FOLLOWUP] Resolve issues in release-build.sh for publishing scala-2.12 build ## What changes were proposed in this pull request? This is a follow up for #22441. 1. Remove flag "-Pkafka-0-8" for Scala 2.12 build. 2. Clean up the script, simpler logic. 3. Switch to Scala version to 2.11 before script exit. ## How was this patch tested? Manual test. Closes #22454 from gengliangwang/revise_release_build. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5534a3a58e4025624fbad527dd129acb8025f25a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 19 September 2018, 10:31:09 UTC
f11f445 [SPARK-24626] Add statistics prefix to parallelFileListingInStatsComputation ## What changes were proposed in this pull request? To be more consistent with other statistics based configs. ## How was this patch tested? N/A - straightforward rename of config option. Used `git grep` to make sure there are no mention of it. Closes #22457 from rxin/SPARK-24626. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 4193c7623b92765adaee539e723328ddc9048c09) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 19 September 2018, 05:41:39 UTC
00ede12 [SPARK-23173][SQL] rename spark.sql.fromJsonForceNullableSchema ## What changes were proposed in this pull request? `spark.sql.fromJsonForceNullableSchema` -> `spark.sql.function.fromJson.forceNullable` ## How was this patch tested? Made sure there are no more references to `spark.sql.fromJsonForceNullableSchema`. Closes #22459 from rxin/SPARK-23173. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 6c7db7fd1ced1d143b1389d09990a620fc16be46) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 19 September 2018, 05:39:43 UTC
ba8560a [SPARK-23200] Reset Kubernetes-specific config on Checkpoint restore Several configuration parameters related to Kubernetes need to be reset, as they are changed with each invokation of spark-submit and thus prevents recovery of Spark Streaming tasks. ## What changes were proposed in this pull request? When using the Kubernetes cluster-manager and spawning a Streaming workload, it is important to reset many spark.kubernetes.* properties that are generated by spark-submit but which would get rewritten when restoring a Checkpoint. This is so, because the spark-submit codepath creates Kubernetes resources, such as a ConfigMap, a Secret and other variables, which have an autogenerated name and the previous one will not resolve anymore. In short, this change enables checkpoint restoration for streaming workloads, and thus enables Spark Streaming workloads in Kubernetes, which were not possible to restore from a checkpoint before if the workload went down. ## How was this patch tested? This patch needs would benefit from testing in different k8s clusters. This is similar to the YARN related code for resetting a Spark Streaming workload, but for the Kubernetes scheduler. This PR removes the initcontainers properties that existed before because they are now removed in master. For a previous discussion, see the non-rebased work at: apache-spark-on-k8s#516 Closes #22392 from ssaavedra/fix-checkpointing-master. Authored-by: Santiago Saavedra <santiagosaavedra@gmail.com> Signed-off-by: Yinan Li <ynli@google.com> (cherry picked from commit 497f00f62b3ddd1f40507fdfe10f30cd9effb6cf) Signed-off-by: Yinan Li <ynli@google.com> 19 September 2018, 05:09:10 UTC
76514a0 [SPARK-25456][SQL][TEST] Fix PythonForeachWriterSuite PythonForeachWriterSuite was failing because RowQueue now needs to have a handle on a SparkEnv with a SerializerManager, so added a mock env with a serializer manager. Also fixed a typo in the `finally` that was hiding the real exception. Tested PythonForeachWriterSuite locally, full tests via jenkins. Closes #22452 from squito/SPARK-25456. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> (cherry picked from commit a6f37b0742d87d5c8ee3e134999d665e5719e822) Signed-off-by: Imran Rashid <irashid@cloudera.com> 18 September 2018, 21:33:49 UTC
67f2cb6 [SPARK-25291][K8S] Fixing Flakiness of Executor Pod tests ## What changes were proposed in this pull request? Added fix to flakiness that was present in PySpark tests w.r.t Executors not being tested. Important fix to executorConf which was failing tests when executors *were* tested ## How was this patch tested? Unit and Integration tests Closes #22415 from ifilonenko/SPARK-25291. Authored-by: Ilan Filonenko <if56@cornell.edu> Signed-off-by: Yinan Li <ynli@google.com> (cherry picked from commit 123f0041d534f28e14343aafb4e5cec19dde14ad) Signed-off-by: Yinan Li <ynli@google.com> 18 September 2018, 18:44:06 UTC
8a2992e [SPARK-19550][DOC][FOLLOW-UP] Update tuning.md to use JDK8 ## What changes were proposed in this pull request? Update `tuning.md` and `rdd-programming-guide.md` to use JDK8. ## How was this patch tested? manual tests Closes #22446 from wangyum/java8. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 182da81e9e75ac1658a39014beb90e60495bf544) Signed-off-by: Sean Owen <sean.owen@databricks.com> 18 September 2018, 15:39:04 UTC
cc3fbea [SPARK-25445][BUILD] the release script should be able to publish a scala-2.12 build ## What changes were proposed in this pull request? update the package and publish steps, to support scala 2.12 ## How was this patch tested? manual test Closes #22441 from cloud-fan/scala. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1c0423b28705eb96237c0cb4e90f49305c64a997) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 September 2018, 14:29:26 UTC
ffd448b [SPARK-24151][SQL] Case insensitive resolution of CURRENT_DATE and CURRENT_TIMESTAMP ## What changes were proposed in this pull request? SPARK-22333 introduced a regression in the resolution of `CURRENT_DATE` and `CURRENT_TIMESTAMP`. Before that ticket, these 2 functions were resolved in a case insensitive way. After, this depends on the value of `spark.sql.caseSensitive`. The PR restores the previous behavior and makes their resolution case insensitive anyhow. The PR takes over #21217, therefore it closes #21217 and credit for this patch should be given to jamesthomp. ## How was this patch tested? added UT Closes #22440 from mgaido91/SPARK-24151. Lead-authored-by: James Thompson <jamesthomp@users.noreply.github.com> Co-authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit ba838fee001d553e30d2205337c1fa5ccbd57caf) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 September 2018, 06:19:19 UTC
c403751 [SPARK-25443][BUILD] fix issues when building docs with release scripts in docker ## What changes were proposed in this pull request? These 2 changes are required to build the docs for Spark 2.4.0 RC1: 1. install `mkdocs` in the docker image 2. set locale to C.UTF-8. Otherwise jekyll fails to build the doc. ## How was this patch tested? tested manually when doing the 2.4.0 RC1 Closes #22438 from cloud-fan/infra. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0f1413e320bbf9804dac1b00d56f30bc20dc36a6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 September 2018, 02:10:43 UTC
7beb341 [CORE] Updates to remote cache reads Covered by tests in DistributedSuite (cherry picked from commit a97001d21757ae214c86371141bd78a376200f66) 17 September 2018, 20:28:40 UTC
80e317b [PYSPARK][SQL] Updates to RowQueue Tested with updates to RowQueueSuite (cherry picked from commit 8f5a5a9e5b9f273443b2721f80c99dc7397ef4c0) 17 September 2018, 20:28:36 UTC
963af13 [PYSPARK] Updates to pyspark broadcast (cherry picked from commit 58419b92673c46911c25bc6c6b13397f880c6424) 17 September 2018, 20:28:31 UTC
08f7b14 [SPARK-24654][BUILD][FOLLOWUP] Update, fix LICENSE and NOTICE, and specialize for source vs binary ## What changes were proposed in this pull request? Fix location of licenses-binary in binary release, and remove binary items from source release ## How was this patch tested? N/A Closes #22436 from srowen/SPARK-24654.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 30aa37fca45ec0ad4f30076bc855d1a201cfc097) Signed-off-by: Sean Owen <sean.owen@databricks.com> 17 September 2018, 13:54:53 UTC
d05596e [SPARK-25431][SQL][EXAMPLES] Fix function examples and the example results. ## What changes were proposed in this pull request? There are some mistakes in examples of newly added functions. Also the format of the example results are not unified. We should fix them. ## How was this patch tested? Manually executed the examples. Closes #22437 from ueshin/issues/SPARK-25431/fix_examples_2. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 8cf6fd1c2342949916fedb5a7f712177b22585fa) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 17 September 2018, 12:40:56 UTC
56f7068 [SPARK-25427][SQL][TEST] Add BloomFilter creation test cases ## What changes were proposed in this pull request? Spark supports BloomFilter creation for ORC files. This PR aims to add test coverages to prevent accidental regressions like [SPARK-12417](https://issues.apache.org/jira/browse/SPARK-12417). ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes #22418 from dongjoon-hyun/SPARK-25427. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0dd61ec47df7078fd4f77d8c58ecf26c630c700e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 September 2018, 11:34:22 UTC
43c9b10 [SPARK-23425][SQL][FOLLOWUP] Support wildcards in HDFS path for loadtable command. What changes were proposed in this pull request Updated the Migration guide for the behavior changes done in the JIRA issue SPARK-23425. How was this patch tested? Manually verified. Closes #22396 from sujith71955/master_newtest. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 619c949019feccd3fc2c9e58a841c655d05216f3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 September 2018, 11:23:08 UTC
e368efc [SPARK-24685][BUILD][FOLLOWUP] Fix the nonexist profile name in release script ## What changes were proposed in this pull request? `without-hadoop` profile doesn't exist in Maven, instead the name should be `hadoop-provided`, this is a regression introduced by SPARK-24685. So here fix it. ## How was this patch tested? Local test. Closes #22434 from jerryshao/SPARK-24685-followup. Authored-by: jerryshao <sshao@hortonworks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b66e14dc96011a83f5ea0df8708ecb02a154ed1d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 September 2018, 07:21:41 UTC
fb1539a [SPARK-22713][CORE][TEST][FOLLOWUP] Fix flaky ExternalAppendOnlyMapSuite due to timeout ## What changes were proposed in this pull request? SPARK-22713 uses [`eventually` with the default timeout `150ms`](https://github.com/apache/spark/pull/21369/files#diff-5bbb6a931b7e4d6a31e4938f51935682R462). It causes flakiness because it's executed once when GC is slow. ```scala eventually { System.gc() ... } ``` **Failures** ```scala org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 1 times over 501.22261 milliseconds. Last failure message: tmpIsNull was false. ``` - master-test-sbt-hadoop-2.7 [4916](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4916) [4907](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4907) [4906](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4906) - spark-master-test-sbt-hadoop-2.6 [4979](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4979) [4974](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4974) [4967](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4967) [4966](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4966) ## How was this patch tested? Pass the Jenkins. Closes #22432 from dongjoon-hyun/SPARK-22713. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 538e0478783160d8fab2dc76fd8fc7b469cb4e19) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 September 2018, 03:08:46 UTC
1cb1e43 [MINOR][DOCS] Axe deprecated doc refs Continuation of #22370. Summary of discussion there: There is some inconsistency in the R manual w.r.t. supercedent functions linking back to deprecated functions. - `createOrReplaceTempView` and `createTable` both link back to functions which are deprecated (`registerTempTable` and `createExternalTable`, respectively) - `sparkR.session` and `dropTempView` do _not_ link back to deprecated functions This PR takes the view that it is preferable _not_ to link back to deprecated functions, and removes these references from `?createOrReplaceTempView` and `?createTable`. As `registerTempTable` was included in the `SparkDataFrame functions` `family` of functions, other documentation pages which included a link to `?registerTempTable` will similarly be altered. Author: Michael Chirico <michael.chirico@grabtaxi.com> Author: Michael Chirico <michaelchirico4@gmail.com> Closes #22393 from MichaelChirico/axe_deprecated_doc_refs. (cherry picked from commit a1dd78255a3ae023820b2f245cd39f0c57a32fb1) Signed-off-by: Felix Cheung <felixcheung@apache.org> 16 September 2018, 19:58:04 UTC
60af706 [SPARK-24418][FOLLOWUP][DOC] Update docs to show Scala 2.11.12 ## What changes were proposed in this pull request? SPARK-24418 upgrades Scala to 2.11.12. This PR updates Scala version in docs. - https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications (screenshot) ![screen1](https://user-images.githubusercontent.com/9700541/45590509-9c5f0400-b8ee-11e8-9293-e48d297db894.png) - https://spark.apache.org/docs/latest/rdd-programming-guide.html#working-with-key-value-pairs (Scala, Java) (These are hyperlink updates) - https://spark.apache.org/docs/latest/streaming-flume-integration.html#configuring-flume-1 (screenshot) ![screen2](https://user-images.githubusercontent.com/9700541/45590511-a123b800-b8ee-11e8-97a5-b7f2288229c2.png) ## How was this patch tested? Manual. ```bash $ cd docs $ SKIP_API=1 jekyll build ``` Closes #22431 from dongjoon-hyun/SPARK-24418. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit bfcf7426057a964b3cee90089aab6c003addc4fb) Signed-off-by: DB Tsai <d_tsai@apple.com> 16 September 2018, 04:14:34 UTC
b839721 [SPARK-25439][TESTS][SQL] Fixes TPCHQuerySuite datatype of customer.c_nationkey to BIGINT according to spec ## What changes were proposed in this pull request? Fixes TPCH DDL datatype of `customer.c_nationkey` from `STRING` to `BIGINT` according to spec and `nation.nationkey` in `TPCHQuerySuite.scala`. The rest of the keys are OK. Note, this will lead to **non-comparable previous results** to new runs involving the customer table. ## How was this patch tested? Manual tests Author: npoggi <npmnpm@gmail.com> Closes #22430 from npoggi/SPARK-25439_Fix-TPCH-customer-c_nationkey. (cherry picked from commit 02c2963f895b9d78d7f6d9972cacec4ef55fa278) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 16 September 2018, 03:06:26 UTC
b40e5fe [SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption ## What changes were proposed in this pull request? This PR aims to fix three things in `FilterPushdownBenchmark`. **1. Use the same memory assumption.** The following configurations are used in ORC and Parquet. - Memory buffer for writing - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) - Compression chunk size - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. **2. Dictionary encoding should not be enforced for all cases.** SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of `prepareStringDictTable`. **3. Generate test result on AWS r3.xlarge** SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. ## How was this patch tested? Manual. Enable the test cases and run `FilterPushdownBenchmark` on `AWS r3.xlarge`. It takes about 4 hours 15 minutes. Closes #22427 from dongjoon-hyun/SPARK-25438. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit fefaa3c30df2c56046370081cb51bfe68d26976b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 16 September 2018, 00:48:53 UTC
ae2ca0e Revert "[SPARK-25431][SQL][EXAMPLES] Fix function examples and unify the format of the example results." This reverts commit 59054fa89b1f39f0d5d83cfe0b531ec39517f8fe. 15 September 2018, 03:51:46 UTC
d3f5475 [SPARK-25238][PYTHON] lint-python: Upgrade pycodestyle to v2.4.0 See https://pycodestyle.readthedocs.io/en/latest/developer.html#changes for changes made in this release. ## What changes were proposed in this pull request? Upgrade pycodestyle to v2.4.0 ## How was this patch tested? __pycodestyle__ Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22231 from cclauss/patch-1. Authored-by: cclauss <cclauss@bluewin.ch> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 9bb798f2e6eefd9edb7b6d9980a894557c107bd3) Signed-off-by: Sean Owen <sean.owen@databricks.com> 15 September 2018, 01:13:16 UTC
59054fa [SPARK-25431][SQL][EXAMPLES] Fix function examples and unify the format of the example results. ## What changes were proposed in this pull request? There are some mistakes in examples of newly added functions. Also the format of the example results are not unified. We should fix and unify them. ## How was this patch tested? Manually executed the examples. Closes #22421 from ueshin/issues/SPARK-25431/fix_examples. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 9c25d7f735ed8c49c795babea3fda3cab226e7cb) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 14 September 2018, 16:25:43 UTC
8cdf7f4 Preparing development version 2.4.1-SNAPSHOT 14 September 2018, 12:41:28 UTC
1220ab8 Preparing Spark release v2.4.0-rc1 14 September 2018, 12:39:46 UTC
9273be0 [SPARK-25400][CORE][TEST] Increase test timeouts We've seen some flakiness in jenkins in SchedulerIntegrationSuite which looks like it just needs a longer timeout. Closes #22385 from squito/SPARK-25400. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 9deddbb13edebfefb3fd03f063679ed12e73c575) Signed-off-by: Sean Owen <sean.owen@databricks.com> 13 September 2018, 19:12:04 UTC
35a84ba [SPARK-25406][SQL] For ParquetSchemaPruningSuite.scala, move calls to `withSQLConf` inside calls to `test` (Link to Jira: https://issues.apache.org/jira/browse/SPARK-25406) ## What changes were proposed in this pull request? The current use of `withSQLConf` in `ParquetSchemaPruningSuite.scala` is incorrect. The desired configuration settings are not being set when running the test cases. This PR fixes that defective usage and addresses the test failures that were previously masked by that defect. ## How was this patch tested? I added code to relevant test cases to print the expected SQL configuration settings and found that the settings were not being set as expected. When I changed the order of calls to `test` and `withSQLConf` I found that the configuration settings were being set as expected. Closes #22394 from mallman/spark-25406-fix_broken_schema_pruning_tests. Authored-by: Michael Allman <msa@allman.ms> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit a7e5aa6cd430d0a49bb6dac92c007fab189db3a3) Signed-off-by: DB Tsai <d_tsai@apple.com> 13 September 2018, 17:09:01 UTC
cc19f42 [SPARK-25170][DOC] Add list and short description of Spark Executor Task Metrics to the documentation. ## What changes were proposed in this pull request? Add description of Executor Task Metrics to the documentation. Closes #22397 from LucaCanali/docMonitoringTaskMetrics. Authored-by: LucaCanali <luca.canali@cern.ch> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 45c4ebc8171d75fc0d169bb8071a4c43263d283e) Signed-off-by: Sean Owen <sean.owen@databricks.com> 13 September 2018, 15:19:31 UTC
e7f511a [SPARK-25352][SQL][FOLLOWUP] Add helper method and address style issue ## What changes were proposed in this pull request? This follow-up patch addresses [the review comment](https://github.com/apache/spark/pull/22344/files#r217070658) by adding a helper method to simplify code and fixing style issue. ## How was this patch tested? Existing unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #22409 from viirya/SPARK-25352-followup. (cherry picked from commit 5b761c537a600115450b53817bee0679d5c2bb97) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com> 13 September 2018, 12:21:17 UTC
abb5196 [SPARK-25295][K8S] Fix executor names collision ## What changes were proposed in this pull request? Fixes the collision issue with spark executor names in client mode, see SPARK-25295 for the details. It follows the cluster name convention as app-name will be used as the prefix and if that is not defined we use "spark" as the default prefix. Eg. `spark-pi-1536781360723-exec-1` where spark-pi is the name of the app passed at the config side or transformed if it contains illegal characters. Also fixes the issue with spark app name having spaces in cluster mode. If you run the Spark Pi test in client mode it passes. The tricky part is the user may set the app name: https://github.com/apache/spark/blob/3030b82c89d3e45a2e361c469fbc667a1e43b854/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala#L30 If i do: ``` ./bin/spark-submit ... --deploy-mode cluster --name "spark pi" ... ``` it will fail as the app name is used for the prefix of driver's pod name and it cannot have spaces (according to k8s conventions). ## How was this patch tested? Manually by running spark job in client mode. To reproduce do: ``` kubectl create -f service.yaml kubectl create -f pod.yaml ``` service.yaml : ``` kind: Service apiVersion: v1 metadata: name: spark-test-app-1-svc spec: clusterIP: None selector: spark-app-selector: spark-test-app-1 ports: - protocol: TCP name: driver-port port: 7077 targetPort: 7077 - protocol: TCP name: block-manager port: 10000 targetPort: 10000 ``` pod.yaml: ``` apiVersion: v1 kind: Pod metadata: name: spark-test-app-1 labels: spark-app-selector: spark-test-app-1 spec: containers: - name: spark-test image: skonto/spark:k8s-client-fix imagePullPolicy: Always command: - 'sh' - '-c' - "/opt/spark/bin/spark-submit --verbose --master k8s://https://kubernetes.default.svc --deploy-mode client --class org.apache.spark.examples.SparkPi --conf spark.app.name=spark --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=skonto/spark:k8s-client-fix --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.authenticate.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token --conf spark.kubernetes.authenticate.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt --conf spark.executor.memory=500m --conf spark.executor.cores=1 --conf spark.executor.instances=1 --conf spark.driver.host=spark-test-app-1-svc.default.svc --conf spark.driver.port=7077 --conf spark.driver.blockManager.port=10000 local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar 1000000" ``` Closes #22405 from skonto/fix-k8s-client-mode-executor-names. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Yinan Li <ynli@google.com> (cherry picked from commit 3e75a9fa24f8629d068b5fbbc7356ce2603fa58d) Signed-off-by: Yinan Li <ynli@google.com> 13 September 2018, 05:04:17 UTC
ae5c7bb [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4 (This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231) ## What changes were proposed in this pull request? Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines. ## How was this patch tested? Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure. Closes #22400 from srowen/SPARK-25238.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 08c76b5d39127ae207d9d1fff99c2551e6ce2581) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 13 September 2018, 03:20:27 UTC
6f4d647 [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more information like file path to event log ## What changes were proposed in this pull request? Field metadata removed from SparkPlanInfo in #18600 . Corresponding, many meta data was also removed from event SparkListenerSQLExecutionStart in Spark event log. If we want to analyze event log to get all input paths, we couldn't get them. Instead, simpleString of SparkPlanInfo JSON only display 100 characters, it won't help. Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log looks like below (It contains the metadata field which has the intact information): >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., "metadata": {"Location": "InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4/test5/snapshot/dt=20180904]","ReadSchema":"struct<snpsht_start_dt:date,snpsht_end_dt:date,am_ntlogin_name:string,am_first_name:string,am_last_name:string,isg_name:string,CRE_DATE:date,CRE_USER:string,UPD_DATE:timestamp,UPD_USER:string>"} After #18600, metadata field was removed. >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., So I add this field back to SparkPlanInfo class. Then it will log out the meta data to event log. Intact information in event log is very useful for offline job analysis. ## How was this patch tested? Unit test Closes #22353 from LantaoJin/SPARK-25357. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6dc5921e66d56885b95c07e56e687f9f6c1eaca7) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 September 2018, 01:57:56 UTC
776dc42 [SPARK-25387][SQL] Fix for NPE caused by bad CSV input ## What changes were proposed in this pull request? The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers. ## How was this patch tested? I added a test which reproduce the issue and tested by `CSVSuite`. Closes #22374 from MaxGekk/npe-on-bad-csv. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 083c9447671719e0bd67312e3d572f6160c06a4a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 September 2018, 01:52:12 UTC
71f7013 [SPARK-23820][CORE] Enable use of long form of callsite in logs This is a rework of #21433 to address some concerns there. Closes #22398 from michaelmior/long-callsite2. Authored-by: Michael Mior <mmior@uwaterloo.ca> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ab25c967905ca0973fc2f30b8523246bb9244206) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 13 September 2018, 01:46:06 UTC
15d2e9d [SPARK-24882][SQL] Revert [] improve data source v2 API from branch 2.4 ## What changes were proposed in this pull request? As discussed in the dev list, we don't want to include https://github.com/apache/spark/pull/22009 in Spark 2.4, as it needs data source v2 users to change the implementation intensitively, while they need to change again in next release. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #22388 from cloud-fan/revert. 12 September 2018, 18:25:24 UTC
4c1428f [SPARK-25363][SQL] Fix schema pruning in where clause by ignoring unnecessary root fields ## What changes were proposed in this pull request? Schema pruning doesn't work if nested column is used in where clause. For example, ``` sql("select name.first from contacts where name.first = 'David'") == Physical Plan == *(1) Project [name#19.first AS first#40] +- *(1) Filter (isnotnull(name#19) && (name#19.first = David)) +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>> ``` In above query plan, the scan node reads the entire schema of `name` column. This issue is reported by: https://github.com/apache/spark/pull/21320#issuecomment-419290197 The cause is that we infer a root field from expression `IsNotNull(name)`. However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields. ## How was this patch tested? Unit tests. Closes #22357 from viirya/SPARK-25363. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit 3030b82c89d3e45a2e361c469fbc667a1e43b854) Signed-off-by: DB Tsai <d_tsai@apple.com> 12 September 2018, 17:44:02 UTC
071babb [SPARK-25352][SQL] Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold ## What changes were proposed in this pull request? We have optimization on global limit to evenly distribute limit rows across all partitions. This optimization doesn't work for ordered results. For a query ending with sort + limit, in most cases it is performed by `TakeOrderedAndProjectExec`. But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, global limit will be used. At this moment, we need to do ordered global limit. ## How was this patch tested? Unit tests. Closes #22344 from viirya/SPARK-25352. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 2f422398b524eacc89ab58e423bb134ae3ca3941) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 September 2018, 14:54:39 UTC
40e4db0 [SPARK-25402][SQL] Null handling in BooleanSimplification ## What changes were proposed in this pull request? This PR is to fix the null handling in BooleanSimplification. In the rule BooleanSimplification, there are two cases that do not properly handle null values. The optimization is not right if either side is null. This PR is to fix them. ## How was this patch tested? Added test cases Closes #22390 from gatorsmile/fixBooleanSimplification. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 79cc59718fdf7785bdc37a26bb8df4c6151114a6) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 12 September 2018, 13:14:43 UTC
0dbf145 [SPARK-25399][SS] Continuous processing state should not affect microbatch execution jobs ## What changes were proposed in this pull request? The leftover state from running a continuous processing streaming job should not affect later microbatch execution jobs. If a continuous processing job runs and the same thread gets reused for a microbatch execution job in the same environment, the microbatch job could get wrong answers because it can attempt to load the wrong version of the state. ## How was this patch tested? New and existing unit tests Closes #22386 from mukulmurthy/25399-streamthread. Authored-by: Mukul Murthy <mukul.murthy@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> (cherry picked from commit 9f5c5b4cca7d4eaa30a3f8adb4cb1eebe3f77c7a) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> 11 September 2018, 22:53:25 UTC
3a6ef8b Revert "[SPARK-23820][CORE] Enable use of long form of callsite in logs" This reverts commit e58dadb77ed6cac3e1b2a037a6449e5a6e7f2cec. 11 September 2018, 19:55:11 UTC
99b37a9 [SPARK-25398] Minor bugs from comparing unrelated types ## What changes were proposed in this pull request? Correct some comparisons between unrelated types to what they seem to… have been trying to do ## How was this patch tested? Existing tests. Closes #22384 from srowen/SPARK-25398. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit cfbdd6a1f5906b848c520d3365cc4034992215d9) Signed-off-by: Sean Owen <sean.owen@databricks.com> 11 September 2018, 19:46:19 UTC
16127e8 [SPARK-24889][CORE] Update block info when unpersist rdds ## What changes were proposed in this pull request? We will update block info coming from executors, at the timing like caching a RDD. However, when removing RDDs with unpersisting, we don't ask to update block info. So the block info is not updated. We can fix this with few options: 1. Ask to update block info when unpersisting This is simplest but changes driver-executor communication a bit. 2. Update block info when processing the event of unpersisting RDD We send a `SparkListenerUnpersistRDD` event when unpersisting RDD. When processing this event, we can update block info of the RDD. This only changes event processing code so the risk seems to be lower. Currently this patch takes option 2 for lower risk. If we agree first option has no risk, we can change to it. ## How was this patch tested? Unit tests. Closes #22341 from viirya/SPARK-24889. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 14f3ad20932535fe952428bf255e7eddd8fa1b58) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 11 September 2018, 17:31:17 UTC
4414e02 [SPARK-25221][DEPLOY] Consistent trailing whitespace treatment of conf values ## What changes were proposed in this pull request? Stop trimming values of properties loaded from a file ## How was this patch tested? Added unit test demonstrating the issue hit in production. Closes #22213 from gerashegalov/gera/SPARK-25221. Authored-by: Gera Shegalov <gera@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit bcb9a8c83f4e6835af5dc51f1be7f964b8fa49a3) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 11 September 2018, 16:28:43 UTC
0b8bfbe [SPARK-25389][SQL] INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields ## What changes were proposed in this pull request? Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY STORED AS` should not generate files with duplicate fields because Spark cannot read those files back. **INSERT OVERWRITE DIRECTORY USING** ```scala scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet SELECT 'id', 'id2' id") ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ... org.apache.spark.sql.AnalysisException: Found duplicate column(s) when inserting into file:/tmp/parquet: `id`; ``` **INSERT OVERWRITE DIRECTORY STORED AS** ```scala scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS parquet SELECT 'id', 'id2' id") // It generates corrupted files scala> spark.read.parquet("/tmp/parquet").show 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data schema and the partition schema: `id`; ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes #22378 from dongjoon-hyun/SPARK-25389. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 77579aa8c35b0d98bbeac3c828bf68a1d190d13e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> 11 September 2018, 15:58:27 UTC
b7efca7 [SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes #22234 Closes #22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR #22234 Closes #22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <mmolimar@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit c9cb393dc414ae98093c1541d09fa3c8663ce276) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 11 September 2018, 12:47:30 UTC
fb4965a [SPARK-25371][SQL] struct() should allow being called with 0 args ## What changes were proposed in this pull request? SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be non-empty. This means that `struct()`, which was previously considered valid, now throws an Exception. This behavior change was introduced in 2.3.0. The change may break users' application on upgrade and it causes `VectorAssembler` to fail when an empty `inputCols` is defined. The PR removes the added check making `struct()` valid again. ## How was this patch tested? added UT Closes #22373 from mgaido91/SPARK-25371. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0736e72a66735664b191fc363f54e3c522697dba) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 September 2018, 06:17:49 UTC
ffd036a [SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs ## What changes were proposed in this pull request? Clarify docstring for Scalar functions ## How was this patch tested? Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well. Closes #20908 from holdenk/SPARK-23672-document-support-for-nested-return-types-in-scalar-with-arrow-udfs. Authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Bryan Cutler <cutlerb@gmail.com> (cherry picked from commit da5685b5bb9ee7daaeb4e8f99c488ebd50c7aac3) Signed-off-by: Bryan Cutler <cutlerb@gmail.com> 10 September 2018, 18:02:09 UTC
5d98c31 [SPARK-25278][SQL] Avoid duplicated Exec nodes when the same logical plan appears in the query ## What changes were proposed in this pull request? In the Planner, we collect the placeholder which need to be substituted in the query execution plan and once we plan them, we substitute the placeholder with the effective plan. In this second phase, we rely on the `==` comparison, ie. the `equals` method. This means that if two placeholder plans - which are different instances - have the same attributes (so that they are equal, according to the equal method) they are both substituted with their corresponding new physical plans. So, in such a situation, the first time we substitute both them with the first of the 2 new generated plan and the second time we substitute nothing. This is usually of no harm for the execution of the query itself, as the 2 plans are identical. But since they are the same instance, now, the local variables are shared (which is unexpected). This causes issues for the metrics collected, as the same node is executed 2 times, so the metrics are accumulated 2 times, wrongly. The PR proposes to use the `eq` method in checking which placeholder needs to be substituted,; thus in the previous situation, actually both the two different physical nodes which are created (one for each time the logical plan appears in the query plan) are used and the metrics are collected properly for each of them. ## How was this patch tested? added UT Closes #22284 from mgaido91/SPARK-25278. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 12e3e9f17dca11a2cddf0fb99d72b4b97517fb56) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 September 2018, 11:42:14 UTC
67bc7ef [SPARK-24849][SPARK-24911][SQL][FOLLOW-UP] Converting a value of StructType to a DDL string ## What changes were proposed in this pull request? Add the version number for the new APIs. ## How was this patch tested? N/A Closes #22377 from gatorsmile/followup24849. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6f6517837ba9934a280b11aba9d9be58bc131f25) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 September 2018, 11:18:27 UTC
c9ca359 [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirCommand output schema in Parquet issue ## What changes were proposed in this pull request? How to reproduce: ```scala spark.sql("CREATE TABLE tbl(id long)") spark.sql("INSERT OVERWRITE TABLE tbl VALUES 4") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"INSERT OVERWRITE LOCAL DIRECTORY '/tmp/spark/parquet' " + "STORED AS PARQUET SELECT ID FROM view1") spark.read.parquet("/tmp/spark/parquet").schema scala> spark.read.parquet("/tmp/spark/parquet").schema res10: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,true)) ``` The schema should be `StructType(StructField(ID,LongType,true))` as we `SELECT ID FROM view1`. This pr fix this issue. ## How was this patch tested? unit tests Closes #22359 from wangyum/SPARK-25313-FOLLOW-UP. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f8b4d5aafd1923d9524415601469f8749b3d0811) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 10 September 2018, 05:47:43 UTC
back to top