Revision history - refs/tags/v3.0.1-rc3 - origin: https://github.com/apache/spark

visit type:

Revision	Author	Date	Message	Commit Date
2b147c4	zhengruifeng	28 August 2020, 03:25:14 UTC	Preparing Spark release v3.0.1-rc3	28 August 2020, 03:25:14 UTC
60f4856	waleedfateem	27 August 2020, 14:05:50 UTC	[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class. ### What changes were proposed in this pull request? I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment. ### Why are the changes needed? An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA: https://issues.apache.org/jira/browse/MAPREDUCE-7282 ### Does this PR introduce _any_ user-facing change? Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate. ### How was this patch tested? Checked changes locally in browser Closes #29541 from waleedfateem/SPARK-32701. Authored-by: waleedfateem <waleed.fateem@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 8749b2b6fae5ee0ce7b48aae6d859ed71e98491d) Signed-off-by: Sean Owen <srowen@gmail.com>	27 August 2020, 14:06:12 UTC
9e8fb48	Yuming Wang	26 August 2020, 06:57:43 UTC	[SPARK-32659][SQL] Fix the data issue when pruning DPP on non-atomic type ### What changes were proposed in this pull request? Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example: ```scala spark.range(1000) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df1"); spark.range(100) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false") spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show ``` It should return two records, but it returns empty. ### Why are the changes needed? Fix data issue ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new unit test. Closes #29475 from wangyum/SPARK-32659. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a8b568800e64f6a163da28e5e53441f84355df14) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	26 August 2020, 06:57:55 UTC
24552b6	HyukjinKwon	26 August 2020, 03:25:59 UTC	[SPARK-32695][INFRA] Explicitly cache and hash 'build' directly in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to explicitly cache and hash the files/directories under 'build' for SBT and Zinc at GitHub Actions. Otherwise, it can end up with overwriting `build` directory. See also https://github.com/apache/spark/pull/29286#issuecomment-679368436 Previously, other files like `build/mvn` and `build/sbt` are also cached and overwritten. So, when you have some changes there, they are ignored. ### Why are the changes needed? To make GitHub Actions build stable. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? The builds in this PR test it out. Closes #29536 from HyukjinKwon/SPARK-32695. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b07e7429a6af27418da271ac7c374f325e843a25) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	26 August 2020, 03:26:15 UTC
9e937c5	Yuming Wang	26 August 2020, 01:46:10 UTC	[SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled ### What changes were proposed in this pull request? This pr reset the `numPartitions` metric when DPP is enabled. Otherwise, it is always a [static value](https://github.com/apache/spark/blob/18cac6a9f0bf4a6d449393f1ee84004623b3c893/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L215). ### Why are the changes needed? Fix metric issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and manual test For [this test case](https://github.com/apache/spark/blob/18cac6a9f0bf4a6d449393f1ee84004623b3c893/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala#L252-L280). Before this pr: ![image](https://user-images.githubusercontent.com/5399861/90301798-9310b480-ded4-11ea-9294-49bcaba46f83.png) After this pr: ![image](https://user-images.githubusercontent.com/5399861/90301709-0fef5e80-ded4-11ea-942d-4d45d1dd15bc.png) Closes #29436 from wangyum/SPARK-32620. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com> (cherry picked from commit 1354cf0842fe9cac2af84eeaa2fac9409db7b128) Signed-off-by: Yuming Wang <wgyumg@gmail.com>	26 August 2020, 01:46:31 UTC
68ff809	Sean Owen	25 August 2020, 15:25:58 UTC	[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'. To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example. Nothing beyond the effect of the bug fix. Existing tests plus new test case. Closes #29516 from srowen/SPARK-32614. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a9d4e60a90d4d6765642e6bf7810da117af6437b) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	25 August 2020, 15:28:15 UTC
21ac7e2	Kent Yao	25 August 2020, 13:17:03 UTC	[SPARK-32683][DOCS][SQL] Fix doc error and add migration guide for datetime pattern F ### What changes were proposed in this pull request? This PR fixes the doc error and add a migration guide for datetime pattern. ### Why are the changes needed? This is a bug of the doc that we inherited from JDK https://bugs.openjdk.java.net/browse/JDK-8169482 The SimpleDateFormatter(F Day of week in month) we used in 2.x and the DatetimeFormatter(F week-of-month) we use now both have the opposite meanings to what they declared in the java docs. And unfortunately, this also leads to silent data change in Spark too. The `week-of-month` is actually the pattern `W` in DatetimeFormatter, which is banned to use in Spark 3.x. If we want to keep pattern `F`, we need to accept the behavior change with proper migration guide and fix the doc in Spark ### Does this PR introduce _any_ user-facing change? Yes, doc changed ### How was this patch tested? passing ci doc generating job Closes #29538 from yaooqinn/SPARK-32683. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1f3bb5175749816be1f0bc793ed5239abf986000) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 August 2020, 13:17:16 UTC
6c88d7c	Liang-Chi Hsieh	25 August 2020, 04:42:39 UTC	[SPARK-32646][SQL][3.0][TEST-HADOOP2.7][TEST-HIVE1.2] ORC predicate pushdown should work with case-insensitive analysis ### What changes were proposed in this pull request? This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis. ### Why are the changes needed? Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis. But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too. ### Does this PR introduce _any_ user-facing change? Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work. ### How was this patch tested? Unit tests. Closes #29513 from viirya/fix-orc-pushdown-3.0. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	25 August 2020, 04:42:39 UTC
82aef3e	Yesheng Ma	25 August 2020, 02:20:01 UTC	[MINOR][SQL] Add missing documentation for LongType mapping ### What changes were proposed in this pull request? Added Java docs for Long data types in the Row class. ### Why are the changes needed? The Long datatype is somehow missing in Row.scala's `apply` and `get` methods. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs. Closes #29534 from yeshengm/docs-fix. Authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 3eee915b474c58cff9ea108f67073ed9c0c86224) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	25 August 2020, 02:20:16 UTC
007acba	Huaxin Gao	24 August 2020, 15:47:01 UTC	[SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans ### What changes were proposed in this pull request? backporting https://github.com/apache/spark/pull/29501 ### Why are the changes needed? avoid double caching ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Existing tests Closes #29528 from huaxingao/kmeans_3.0. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	24 August 2020, 15:47:01 UTC
4a67f1e	Michael Munday	24 August 2020, 15:19:30 UTC	[SPARK-32588][CORE][TEST] Fix SizeEstimator initialization in tests In order to produce consistent results from SizeEstimator the tests override some system properties that are used during SizeEstimator initialization. However there were several places where either the compressed references property wasn't set or the system properties were set but the SizeEstimator not re-initialized. This caused failures when running the tests with a large heap build of OpenJ9 because it does not use compressed references unlike most environments. ### What changes were proposed in this pull request? Initialize SizeEstimator class explicitly in the tests where required to avoid relying on a particular environment. ### Why are the changes needed? Test failures can be seen when compressed references are disabled (e.g. using an OpenJ9 large heap build or Hotspot with a large heap). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests run on machine running OpenJ9 large heap build. Closes #29407 from mundaym/fix-sizeestimator. Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit bc23bb78823f4fa02385b7b2a0270cd1b98bce34) Signed-off-by: Sean Owen <srowen@gmail.com>	24 August 2020, 15:19:44 UTC
8aa644e	Louiszr	24 August 2020, 04:10:52 UTC	[SPARK-32092][ML][PYSPARK][3.0] Removed foldCol related code ### What changes were proposed in this pull request? - Removed `foldCol` related code introduced in #29445 which is causing issues in the base branch. - Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models. ### Why are the changes needed? - `foldCol` is from 3.1 hence causing tests to fail. - `CrossValidatorModel.copy()` is supposed to shallow copy models not lists of models. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Existing tests created in #29445 ran and passed. - Updated `test_copy` to make sure `copy()` is called on models instead of lists of models. Closes #29524 from Louiszr/remove-foldcol-3.0. Authored-by: Louiszr <zxhst14@gmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	24 August 2020, 04:10:52 UTC
da60de5	Huaxin Gao	24 August 2020, 00:43:41 UTC	[SPARK-32552][SQL][DOCS] Complete the documentation for Table-valued Function # What changes were proposed in this pull request? There are two types of TVF. We only documented one type. Adding the doc for the 2nd type. ### Why are the changes needed? complete Table-valued Function doc ### Does this PR introduce _any_ user-facing change? <img width="1099" alt="Screen Shot 2020-08-06 at 5 30 25 PM" src="https://user-images.githubusercontent.com/13592258/89595926-c5eae680-d80a-11ea-918b-0c3646f9930e.png"> <img width="1100" alt="Screen Shot 2020-08-06 at 5 30 49 PM" src="https://user-images.githubusercontent.com/13592258/89595929-c84d4080-d80a-11ea-9803-30eb502ccd05.png"> <img width="1101" alt="Screen Shot 2020-08-06 at 5 31 19 PM" src="https://user-images.githubusercontent.com/13592258/89595931-ca170400-d80a-11ea-8812-2f009746edac.png"> <img width="1100" alt="Screen Shot 2020-08-06 at 5 31 40 PM" src="https://user-images.githubusercontent.com/13592258/89595934-cb483100-d80a-11ea-9e18-9357aa9f2c5c.png"> ### How was this patch tested? Manually build and check Closes #29355 from huaxingao/tvf. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit db74fd0d3320f120540133094a9975963941b98c) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	24 August 2020, 00:44:00 UTC
898211b	mingjial	24 August 2020, 00:40:59 UTC	[SPARK-32609][TEST] Add Tests for Incorrect exchange reuse with DataSourceV2 ### What changes were proposed in this pull request? Copy to master branch the unit test added for branch-2.4(https://github.com/apache/spark/pull/29430). ### Why are the changes needed? The unit test will pass at master branch, indicating that issue reported in https://issues.apache.org/jira/browse/SPARK-32609 is already fixed at master branch. But adding this unit test for future possible failure catch. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? sbt test run Closes #29435 from mingjialiu/master. Authored-by: mingjial <mingjial@google.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b9585cde31fe99aecca42146c71c552218cba591) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	24 August 2020, 00:41:17 UTC
f088c28	Max Gekk	23 August 2020, 19:43:30 UTC	[SPARK-32594][SQL][FOLLOWUP][TEST-HADOOP2.7][TEST-HIVE1.2] Override `get()` and use Julian days in `DaysWritable` ### What changes were proposed in this pull request? Override `def get: Date` in `DaysWritable` use the `daysToMillis(int d)` from the parent class `DateWritable` instead of `long daysToMillis(int d, boolean doesTimeMatter)`. ### Why are the changes needed? It fixes failures of `HiveSerDeReadWriteSuite` with the profile `hive-1.2`. In that case, the parent class `DateWritable` has different implementation before the commit to Hive https://github.com/apache/hive/commit/da3ed68eda10533f3c50aae19731ac6d059cda87. In particular, `get()` calls `new Date(daysToMillis(daysSinceEpoch))` instead of overrided `def get(doesTimeMatter: Boolean): Date` in the child class. The `get()` method returns wrong result `1970-01-01` because it uses not updated `daysSinceEpoch`. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the test suite `HiveSerDeReadWriteSuite`: ``` $ build/sbt -Phive-1.2 -Phadoop-2.7 "test:testOnly org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite" ``` and ``` $ build/sbt -Phive-2.3 -Phadoop-2.7 "test:testOnly org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite" ``` Closes #29523 from MaxGekk/insert-date-into-hive-table-1.2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 1c798f973fa8307cc1f15eec067886e8e9aecb59) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	23 August 2020, 19:43:46 UTC
f5d5422	angerszhu	23 August 2020, 15:20:05 UTC	[SPARK-32608][SQL][3.0][FOLLOW-UP][TEST-HADOOP2.7][TEST-HIVE1.2] Script Transform ROW FORMAT DELIMIT value should format value ### What changes were proposed in this pull request? As mentioned in https://github.com/apache/spark/pull/29428#issuecomment-678735163 by viirya , fix bug in UT, since in script transformation no-serde mode, output of decimal is same in both hive-1.2/hive-2.3 ### Why are the changes needed? FIX UT ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? EXISTED UT Closes #29521 from AngersZhuuuu/SPARK-32608-3.0-FOLLOW-UP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	23 August 2020, 15:20:05 UTC
85c9e8c	Louiszr	22 August 2020, 14:27:31 UTC	[SPARK-32092][ML][PYSPARK] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() ### What changes were proposed in this pull request? Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods. ### Why are the changes needed? Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`. It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls. Closes #29445 from Louiszr/master. Authored-by: Louiszr <zxhst14@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit d9eb06ea37cab185f1e49c641313be9707270252) Signed-off-by: Sean Owen <srowen@gmail.com>	22 August 2020, 14:27:43 UTC
a6df16b	Yuanjian Li	22 August 2020, 12:32:23 UTC	[SPARK-31792][SS][DOC][FOLLOW-UP] Rephrase the description for some operations ### What changes were proposed in this pull request? Rephrase the description for some operations to make it clearer. ### Why are the changes needed? Add more detail in the document. ### Does this PR introduce _any_ user-facing change? No, document only. ### How was this patch tested? Document only. Closes #29269 from xuanyuanking/SPARK-31792-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit 8b26c69ce7f9077775a3c7bbabb1c47ee6a51a23) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	22 August 2020, 12:33:29 UTC
87f1d51	Robert (Bobby) Evans	22 August 2020, 02:07:14 UTC	[SPARK-32672][SQL] Fix data corruption in boolean bit set compression ## What changes were proposed in this pull request? This fixed SPARK-32672 a data corruption. Essentially the BooleanBitSet CompressionScheme would miss nulls at the end of a CompressedBatch. The values would then default to false. ### Why are the changes needed? It fixes data corruption ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I manually tested it against the original issue that was producing errors for me. I also added in a unit test. Closes #29506 from revans2/SPARK-32672. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 12f4331b9eb563cb0cfbf6a241d1d085ca4f7676) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	22 August 2020, 02:07:38 UTC
9ccc790	Brandon Jiang	22 August 2020, 01:08:39 UTC	[MINOR][DOCS] backport PR#29443 to fix typo in doc,log messages and comments ### What changes were proposed in this pull request? backport PR #29443 to fix typo for docs, log messages and comments ### Why are the changes needed? typo fix to increase readability ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test has been performed to test the updated Closes #29512 from brandonJY/branch-3.0. Authored-by: Brandon Jiang <brandonJY@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	22 August 2020, 01:08:39 UTC
a5f4230	Chao Sun	21 August 2020, 07:48:54 UTC	[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc ### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit bf221debd02b11003b092201d0326302196e4ba5) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	21 August 2020, 07:49:13 UTC
f73e6ca	“attilapiros”	21 August 2020, 06:02:33 UTC	[SPARK-32663][CORE] Avoid individual closing of pooled TransportClients (which must be closed through the pool) ### What changes were proposed in this pull request? Removing the individual `close` method calls on the pooled `TransportClient` instances. The pooled clients should be only closed via `TransportClientFactory#close()`. ### Why are the changes needed? Reusing a closed `TransportClient` leads to the exception `java.nio.channels.ClosedChannelException`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a trivial case which is not tested by specific test. Closes #29492 from attilapiros/SPARK-32663. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit 79b4dea1b08adc9d4b545a1af29ac50fa603936a) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	21 August 2020, 06:04:21 UTC
2932926	Gengliang Wang	21 August 2020, 05:12:43 UTC	[SPARK-32660][SQL][DOC] Show Avro related API in documentation ### What changes were proposed in this pull request? Currently, the Avro related APIs are missing in the documentation https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . This PR is to: 1. Mark internal Avro related classes as private 2. Show Avro related API in Spark official API documentation ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and preview: ![image](https://user-images.githubusercontent.com/1097932/90623042-d156ee00-e1ca-11ea-9edd-2c45b3001fd8.png) ![image](https://user-images.githubusercontent.com/1097932/90623047-d451de80-e1ca-11ea-94ba-02921b64d6f1.png) ![image](https://user-images.githubusercontent.com/1097932/90623058-d6b43880-e1ca-11ea-849a-b9ea9efe6527.png) Closes #29476 from gengliangwang/avroAPIDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> (cherry picked from commit de141a32714fd2dbc4be2d540adabf328bbce2c4) Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	21 August 2020, 05:13:14 UTC
8755e3f	Terry Kim	20 August 2020, 17:18:29 UTC	[SPARK-32621][SQL][3.0] path' option can cause issues while inferring schema in CSV/JSON datasources ### What changes were proposed in this pull request? This PR is backporting #29437 to branch-3.0. When CSV/JSON datasources infer schema (e.g, `def inferSchema(files: Seq[FileStatus])`, they use the `files` along with the original options. `files` in `inferSchema` could have been deduced from the "path" option if the option was present, so this can cause issues (e.g., reading more data, listing the path again) since the "path" option is added to the `files`. ### Why are the changes needed? The current behavior can cause the following issue: ```scala class TestFileFilter extends PathFilter { override def accept(path: Path): Boolean = path.getParent.getName != "p=2" } val path = "/tmp" val df = spark.range(2) df.write.json(path + "/p=1") df.write.json(path + "/p=2") val extraOptions = Map( "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName ) // This works fine. assert(spark.read.options(extraOptions).json(path).count == 2) // The following with "path" option fails with the following: // assertion failed: Conflicting directory structures detected. Suspicious paths // file:/tmp // file:/tmp/p=1 assert(spark.read.options(extraOptions).format("json").option("path", path).load.count() === 2) ``` ### Does this PR introduce _any_ user-facing change? Yes, the above failure doesn't happen and you get the consistent experience when you use `spark.read.csv(path)` or `spark.read.format("csv").option("path", path).load`. ### How was this patch tested? Updated existing tests. Closes #29478 from imback82/backport-29437. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	20 August 2020, 17:18:29 UTC
29a10a4	Wenchen Fan	20 August 2020, 15:23:25 UTC	[SPARK-28863][SQL][FOLLOWUP] Do not reuse the physical plan ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29469 Instead of passing the physical plan to the fallbacked v1 source directly and skipping analysis, optimization, planning altogether, this PR proposes to pass the optimized plan. ### Why are the changes needed? It's a bit risky to pass the physical plan directly. When the fallbacked v1 source applies more operations to the input DataFrame, it will re-apply the post-planning physical rules like `CollapseCodegenStages`, `InsertAdaptiveSparkPlan`, etc., which is very tricky. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing test suite with some new tests Closes #29489 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d378dc5f6db6fe37426728bea714f44b94a94861) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	20 August 2020, 15:24:01 UTC
87d7ab6	angerszhu	20 August 2020, 13:43:15 UTC	[SPARK-32608][SQL][3.0] Script Transform ROW FORMAT DELIMIT value should format value ### What changes were proposed in this pull request? For SQL ``` SELECT TRANSFORM(a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' NULL DEFINED AS 'null' USING 'cat' AS (a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' FROM testData ``` The correct TOK_TABLEROWFORMATFIELD should be `, `nut actually ` ','` TOK_TABLEROWFORMATLINES should be `\n` but actually` '\n'` ### Why are the changes needed? Fix string value format ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #29487 from AngersZhuuuu/SPARK-32608-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	20 August 2020, 13:43:15 UTC
b87ec5d	Xingbo Jiang	20 August 2020, 07:08:30 UTC	[SPARK-32658][CORE] Fix `PartitionWriterStream` partition length overflow ### What changes were proposed in this pull request? The `count` in `PartitionWriterStream` should be a long value, instead of int. The issue is introduced by apache/sparkabef84a . When the overflow happens, the shuffle index file would record wrong index of a reduceId, thus lead to `FetchFailedException: Stream is corrupted` error. Besides the fix, I also added some debug logs, so in the future it's easier to debug similar issues. ### Why are the changes needed? This is a regression and bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A Spark user reported this issue when migrating their workload to 3.0. One of the jobs fail deterministically on Spark 3.0 without the patch, and the job succeed after applied the fix. Closes #29474 from jiangxb1987/fixPartitionWriteStream. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f793977e9ac2ef597fca4a95356affbfbf864f88) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	20 August 2020, 07:08:50 UTC
c4a12f2	Burak Yavuz	19 August 2020, 16:25:35 UTC	[SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters ### What changes were proposed in this pull request? This PR introduces a LogicalNode AlreadyPlanned, and related physical plan and preparation rule. With the DataSourceV2 write operations, we have a way to fallback to the V1 writer APIs using InsertableRelation. The gross part is that we're in physical land, but the InsertableRelation takes a logical plan, so we have to pass the logical plans to these physical nodes, and then potentially go through re-planning. This re-planning can cause issues for an already optimized plan. A useful primitive could be specifying that a plan is ready for execution through a logical node AlreadyPlanned. This would wrap a physical plan, and then we can go straight to execution. ### Why are the changes needed? To avoid having a physical plan that is disconnected from the physical plan that is being executed in V1WriteFallback execution. When a physical plan node executes a logical plan, the inner query is not connected to the running physical plan. The physical plan that actually runs is not visible through the Spark UI and its metrics are not exposed. In some cases, the EXPLAIN plan doesn't show it. ### Does this PR introduce _any_ user-facing change? Nope ### How was this patch tested? V1FallbackWriterSuite tests that writes still work Closes #29469 from brkyvz/alreadyAnalyzed2. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 278d0dd25bc1479ecda42d6f722106e4763edfae) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	19 August 2020, 16:25:45 UTC
2b41bc9	HyukjinKwon	19 August 2020, 14:22:52 UTC	[SPARK-32451][R][3.0] Support Apache Arrow 1.0.0 ### What changes were proposed in this pull request? This PR ports back https://github.com/apache/spark/pull/29252 to support Arrow 1.0.0. Currently, SparkR with Arrow tests fails with the latest Arrow version in branch-3.0, see https://github.com/apache/spark/pull/29460/checks?check_run_id=996972267 ### Why are the changes needed? To support higher Arrow R version with SparkR. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to use SparkR with Arrow 1.0.0+. ### How was this patch tested? Manually tested, GitHub Actions will test it. Closes #29462 from HyukjinKwon/SPARK-32451-3.0. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	19 August 2020, 14:22:52 UTC
56ec5dd	HyukjinKwon	19 August 2020, 14:20:26 UTC	[SPARK-32249][INFRA][3.0] Run Github Actions builds in branch-3.0 ### What changes were proposed in this pull request? This PR proposes to backport the following JIRAs: - SPARK-32245 - SPARK-32292 - SPARK-32252 - SPARK-32316 - SPARK-32408 - SPARK-32303 - SPARK-32363 - SPARK-32419 - SPARK-32422 - SPARK-32491 - SPARK-32493 - SPARK-32496 - SPARK-32497 - SPARK-32357 - SPARK-32606 - SPARK-32605 - SPARK-32248 - SPARK-32645 - Minor renaming d0dfe49#diff-02d9c370a663741451423342d5869b21 in order to enable GitHub Actions in branch-3.0. ### Why are the changes needed? To be able to run the tests in branch-3.0. Jenkins jobs are unstable. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Build in this PR will test. Closes #29460 from HyukjinKwon/SPARK-32249. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	19 August 2020, 14:20:26 UTC
6dc7457	Yuming Wang	19 August 2020, 12:20:26 UTC	[SPARK-32624][SQL] Use getCanonicalName to fix byte[] compile issue ### What changes were proposed in this pull request? ```scala scala> Array[Byte](1, 2).getClass.getName res13: String = [B scala> Array[Byte](1, 2).getClass.getCanonicalName res14: String = byte[] ``` This pr replace `getClass.getName` with `getClass.getCanonicalName` in `CodegenContext.addReferenceObj` to fix `byte[]` compile issue: ``` ... /* 030 / value_1 = org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) references[0] / min /)) >= 0 && org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) references[1] / max /)) <= 0; / 031 / } / 032 / return !isNull_1 && value_1; / 033 / } / 034 / / 035 / / 036 */ } 20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr codegen error and falling back to interpreter mode java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 81: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 81: Unexpected token "[" in primary ... ``` ### Why are the changes needed? Fix compile issue when compiling generated code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #29439 from wangyum/SPARK-32624. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com> (cherry picked from commit 409fea30cc40ce24a17325ec63d2f847ce49f5a6) Signed-off-by: Yuming Wang <wgyumg@gmail.com>	19 August 2020, 12:21:00 UTC
b3a971a	HyukjinKwon	18 August 2020, 10:35:15 UTC	[SPARK-32647][INFRA] Report SparkR test results with JUnit reporter This PR proposes to generate JUnit XML test report in SparkR tests that can be leveraged in both Jenkins and GitHub Actions. GitHub Actions ![Screen Shot 2020-08-18 at 12 42 46 PM](https://user-images.githubusercontent.com/6477701/90467934-55b85b00-e150-11ea-863c-c8415e764ddb.png) Jenkins ![Screen Shot 2020-08-18 at 2 03 42 PM](https://user-images.githubusercontent.com/6477701/90472509-a5505400-e15b-11ea-9165-777ec9b96eaa.png) NOTE that while I am here, I am switching back the console reporter from "progress" to "summary". Currently non-ascii codes are broken in Jenkins console and switching it to "summary" can work around it. "summary" is the default format used in testthat 1.x. To check the test failures more easily. No, dev-only It is tested in GitHub Actions at https://github.com/HyukjinKwon/spark/pull/23/checks?check_run_id=996586446 In case of Jenkins, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127525/testReport/ Closes #29456 from HyukjinKwon/sparkr-junit. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	19 August 2020, 06:12:17 UTC
753d414	Wenchen Fan	19 August 2020, 04:50:29 UTC	[SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder ### What changes were proposed in this pull request? Update `ObjectSerializerPruning.alignNullTypeInIf`, to consider the isNull check generated in `RowEncoder`, which is `Invoke(inputObject, "isNullAt", BooleanType, Literal(index) :: Nil)`. ### Why are the changes needed? Query fails if we don't fix this bug, due to type mismatch in `If`. ### Does this PR introduce _any_ user-facing change? Yes, the failed query can run after this fix. ### How was this patch tested? new tests Closes #29467 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit f33b64a6567f93e5515521b8b1e7761e16f667d0) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	19 August 2020, 04:50:44 UTC
a36514e	yi.wu	18 August 2020, 06:50:05 UTC	[3.0][SPARK-32518][CORE] CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources ### What changes were proposed in this pull request? 1. Make `CoarseGrainedSchedulerBackend.maxNumConcurrentTasks()` considers all kinds of resources when calculating the max concurrent tasks 2. Refactor `calculateAvailableSlots()` to make it be able to be used for both `CoarseGrainedSchedulerBackend` and `TaskSchedulerImpl` ### Why are the changes needed? Currently, `CoarseGrainedSchedulerBackend.maxNumConcurrentTasks()` only considers the CPU for the max concurrent tasks. This can cause the application to hang when a barrier stage requires extra custom resources but the cluster doesn't have enough corresponding resources. Because, without the checking for other custom resources in `maxNumConcurrentTasks`, the barrier stage can be submitted to the `TaskSchedulerImpl`. But the `TaskSchedulerImpl` won't launch tasks for the barrier stage due to the insufficient task slots calculated by `TaskSchedulerImpl.calculateAvailableSlots` (which does check all kinds of resources). If the barrier stage doesn't launch all the tasks in one true, the application will fail and suggest user to disable delay scheduling. However, this actually a misleading suggestion since the real root cause is not enough resources. ### Does this PR introduce _any_ user-facing change? Yes. In case of a barrier stage requires more custom resources than the cluster has, previously, the application will fail with misleading suggestion of disabling delay scheduling. After this PR, the application will fail with the error message saying not enough resources. ### How was this patch tested? Added a unit test. Closes #29395 from Ngone51/backport-spark-32518. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	18 August 2020, 06:50:05 UTC
6cdc32f	Liang-Chi Hsieh	17 August 2020, 20:19:49 UTC	[SPARK-32622][SQL][TEST] Add case-sensitivity test for ORC predicate pushdown ### What changes were proposed in this pull request? During working on SPARK-25557, we found that ORC predicate pushdown doesn't have case-sensitivity test. This PR proposes to add case-sensitivity test for ORC predicate pushdown. ### Why are the changes needed? Increasing test coverage for ORC predicate pushdown. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Jenkins tests. Closes #29427 from viirya/SPARK-25557-followup3. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b33066f42bd474f5f80b14221f97d09a76e0b398) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	17 August 2020, 20:20:08 UTC
ee12374	Gengliang Wang	17 August 2020, 13:46:41 UTC	[3.0][SQL] Revert SPARK-32018 ### What changes were proposed in this pull request? Revert SPARK-32018 related changes in branch 3.0: https://github.com/apache/spark/pull/29125 and https://github.com/apache/spark/pull/29404 ### Why are the changes needed? https://github.com/apache/spark/pull/29404 is made to fix correctness regression introduced by https://github.com/apache/spark/pull/29125. However, the behavior of decimal overflow is strange in non-ansi mode: 1. from 3.0.0 to 3.0.1: decimal overflow will throw exceptions instead of returning null on decimal overflow 2. from 3.0.1 to 3.1.0: decimal overflow will return null instead of throwing exceptions. So, this PR proposes to revert both https://github.com/apache/spark/pull/29404 and https://github.com/apache/spark/pull/29125. So that Spark will return null on decimal overflow in Spark 3.0.0 and Spark 3.0.1. ### Does this PR introduce _any_ user-facing change? Yes, Spark will return null on decimal overflow in Spark 3.0.1. ### How was this patch tested? Unit tests Closes #29450 from gengliangwang/revertDecimalOverflow. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	17 August 2020, 13:46:41 UTC
c4807ce	Kousuke Saruta	16 August 2020, 17:07:37 UTC	[SPARK-32610][DOCS] Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version ### What changes were proposed in this pull request? This PR fixes the link to metrics.dropwizard.io in monitoring.md to refer the proper version of the library. ### Why are the changes needed? There are links to metrics.dropwizard.io in monitoring.md but the link targets refer the version 3.1.0, while we use 4.1.1. Now that users can create their own metrics using the dropwizard library, it's better to fix the links to refer the proper version. ### Does this PR introduce _any_ user-facing change? Yes. The modified links refer the version 4.1.1. ### How was this patch tested? Build the docs and visit all the modified links. Closes #29426 from sarutak/fix-dropwizard-url. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 9a79bbc8b6e426e7b29a9f4867beb396014d8046) Signed-off-by: Sean Owen <srowen@gmail.com>	16 August 2020, 17:07:50 UTC
6a88924	Yuming Wang	15 August 2020, 19:31:32 UTC	[SPARK-32625][SQL] Log error message when falling back to interpreter mode ### What changes were proposed in this pull request? This pr log the error message when falling back to interpreter mode. ### Why are the changes needed? Not all error messages are in `CodeGenerator`, such as: ``` 21:48:44.612 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr codegen error and falling back to interpreter mode java.lang.IllegalArgumentException: Can not interpolate org.apache.spark.sql.types.Decimal into code block. at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1(javaCode.scala:240) at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1$adapted(javaCode.scala:236) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #29440 from wangyum/SPARK-32625. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c280c7f529e2766dd7dd45270bde340c28b9d74b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	15 August 2020, 19:31:45 UTC
38ab936	Ruifeng Zheng	15 August 2020, 01:37:54 UTC	Preparing development version 3.0.2-SNAPSHOT	15 August 2020, 01:37:54 UTC
05144a5	Ruifeng Zheng	15 August 2020, 01:37:47 UTC	Preparing Spark release v3.0.1-rc1	15 August 2020, 01:37:47 UTC
81d7747	alexander-daskalov	14 August 2020, 13:10:41 UTC	[MINOR][SQL] Fixed approx_count_distinct rsd param description ### What changes were proposed in this pull request? In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from _maximum estimation error allowed_ to _maximum relative standard deviation allowed_ ### Why are the changes needed? Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter. ### Does this PR introduce _any_ user-facing change? This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead, ### How was this patch tested? No tests, as no code changes were made. Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description. Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 10edeafc69250afef8c71ed7b3c77992f67aa4ff) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	14 August 2020, 13:11:07 UTC
89765f5	Gengliang Wang	13 August 2020, 03:52:12 UTC	[SPARK-32018][SQL][FOLLOWUP][3.0] Throw exception on decimal value overflow of sum aggregation ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29125 In branch 3.0: 1. for hash aggregation, before https://github.com/apache/spark/pull/29125 there will be a runtime exception on decimal overflow of sum aggregation; after https://github.com/apache/spark/pull/29125, there could be a wrong result. 2. for sort aggregation, with/without https://github.com/apache/spark/pull/29125, there could be a wrong result on decimal overflow. While in master branch(the future 3.1 release), the problem doesn't exist since in https://github.com/apache/spark/pull/27627 there is a flag for marking whether overflow happens in aggregation buffer. However, the aggregation buffer is written in steaming checkpoints. Thus, we can't change to aggregation buffer to resolve the issue. As there is no easy solution for returning null/throwing exception regarding `spark.sql.ansi.enabled` on overflow in branch 3.0, we have to make a choice here: always throw exception on decimal value overflow of sum aggregation. ### Why are the changes needed? Avoid returning wrong result in decimal value sum aggregation. ### Does this PR introduce _any_ user-facing change? Yes, there is always exception on decimal value overflow of sum aggregation, instead of a possible wrong result. ### How was this patch tested? Unit test case Closes #29404 from gengliangwang/fixSum. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	13 August 2020, 03:52:12 UTC
21c2fa4	Chen Zhang	13 August 2020, 02:54:43 UTC	[MINOR] Update URL of the parquet project in code comment ### What changes were proposed in this pull request? Update URL of the parquet project in code comment. ### Why are the changes needed? The original url is not available. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? No test needed. Closes #29416 from izchen/Update-Parquet-URL. Authored-by: Chen Zhang <izchen@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 08d86ebc057165865c58b4fa3d4dd67039946e26) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	13 August 2020, 02:54:57 UTC
742c4df	yi.wu	12 August 2020, 12:05:50 UTC	[SPARK-32250][SPARK-27510][CORE][TEST] Fix flaky MasterSuite.test(...) in Github Actions ### What changes were proposed in this pull request? Set more dispatcher threads for the flaky test. ### Why are the changes needed? When running test on Github Actions machine, the available processors in JVM is only 2, while on Jenkins it's 32. For this specific test, 2 available processors, which also decides the number of threads in Dispatcher, are not enough to consume the messages. In the worst situation, `MockExecutorLaunchFailWorker` would occupy these 2 threads for handling messages `LaunchDriver`, `LaunchExecutor` at the same time but leave no thread for the driver to handle the message `RegisteredApplication`. At the end, it results in a deadlock situation and causes the test failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We can check whether the test is still flaky in Github Actions after this fix. Closes #29408 from Ngone51/spark-32250. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit c6ea98323fd23393541efadd814a611a25fa78b2) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	12 August 2020, 12:06:04 UTC
ecc2997	Max Gekk	12 August 2020, 11:59:59 UTC	[SPARK-32599][SQL][TESTS] Check the TEXTFILE file format in `HiveSerDeReadWriteSuite` ### What changes were proposed in this pull request? - Test TEXTFILE together with the PARQUET and ORC file formats in `HiveSerDeReadWriteSuite` - Remove the "SPARK-32594: insert dates to a Hive table" added by #29409 ### Why are the changes needed? - To improve test coverage, and test other row SerDe - `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe`. - The removed test is not needed anymore because the bug reported in SPARK-32594 is triggered by the TEXTFILE file format too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite `HiveSerDeReadWriteSuite`. Closes #29417 from MaxGekk/textfile-HiveSerDeReadWriteSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit f664aaaab13997bf61381aecfd4703f7e32e8fa1) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	12 August 2020, 12:00:19 UTC
9a3811d	Tin Hang To	12 August 2020, 06:39:10 UTC	[SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big endian platforms ### What changes were proposed in this pull request? This PR fixes the issue introduced during SPARK-26985. SPARK-26985 changes the `putDoubles()` and `putFloats()` methods to respect the platform's endian-ness. However, that causes the RLE paths in VectorizedRleValuesReader.java to read the RLE entries in parquet as BIG_ENDIAN on big endian platforms (i.e., as is), even though parquet data is always in little endian format. The comments in `WriteableColumnVector.java` say those methods are used for "ieee formatted doubles in platform native endian" (or floats), but since the data in parquet is always in little endian format, use of those methods appears to be inappropriate. To demonstrate the problem with spark-shell: ```scala import org.apache.spark._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ var data = Seq( (1.0, 0.1), (2.0, 0.2), (0.3, 3.0), (4.0, 4.0), (5.0, 5.0)) var df = spark.createDataFrame(data).write.mode(SaveMode.Overwrite).parquet("/tmp/data.parquet2") var df2 = spark.read.parquet("/tmp/data.parquet2") df2.show() ``` result: ```scala +--------------------+--------------------+ \| _1\| _2\| +--------------------+--------------------+ \| 3.16E-322\|-1.54234871366845...\| \| 2.0553E-320\| 2.0553E-320\| \| 2.561E-320\| 2.561E-320\| \|4.66726145843124E-62\| 1.0435E-320\| \| 3.03865E-319\|-1.54234871366757...\| +--------------------+--------------------+ ``` Also tests in ParquetIOSuite that involve float/double data would fail, e.g., - basic data types (without binary) - read raw Parquet file /examples/src/main/python/mllib/isotonic_regression_example.py would fail as well. Purposed code change is to add `putDoublesLittleEndian()` and `putFloatsLittleEndian()` methods for parquet to invoke, just like the existing `putIntsLittleEndian()` and `putLongsLittleEndian()`. On little endian platforms they would call `putDoubles()` and `putFloats()`, on big endian they would read the entries as little endian like pre-SPARK-26985. No new unit-test is introduced as the existing ones are actually sufficient. ### Why are the changes needed? RLE float/double data in parquet files will not be read back correctly on big endian platforms. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All unit tests (mvn test) were ran and OK. Closes #29383 from tinhto-000/SPARK-31703. Authored-by: Tin Hang To <tinto@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a418548dad57775fbb10b4ea690610bad1a8bfb0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	12 August 2020, 06:39:20 UTC
e7d45f8	Max Gekk	12 August 2020, 04:32:16 UTC	[SPARK-32594][SQL] Fix serialization of dates inserted to Hive tables ### What changes were proposed in this pull request? Fix `DaysWritable` by overriding parent's method `def get(doesTimeMatter: Boolean): Date` from `DateWritable` instead of `Date get()` because the former one uses the first one. The bug occurs because `HiveOutputWriter.write()` call `def get(doesTimeMatter: Boolean): Date` transitively with default implementation from the parent class `DateWritable` which doesn't respect date rebases and uses not initialized `daysSinceEpoch` (0 which `1970-01-01`). ### Why are the changes needed? The changes fix the bug: ```sql spark-sql> CREATE TABLE table1 (d date); spark-sql> INSERT INTO table1 VALUES (date '2020-08-11'); spark-sql> SELECT * FROM table1; 1970-01-01 ``` The expected result of the last SQL statement must be 2020-08-11 but got 1970-01-01. ### Does this PR introduce _any_ user-facing change? Yes. After the fix, `INSERT` work correctly: ```sql spark-sql> SELECT * FROM table1; 2020-08-11 ``` ### How was this patch tested? Add new test to `HiveSerDeReadWriteSuite` Closes #29409 from MaxGekk/insert-date-into-hive-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0477d234672d6b02f906428dcf2536f26fb4fd04) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	12 August 2020, 04:32:29 UTC
292bfc3	Yuming Wang	12 August 2020, 04:16:57 UTC	[SPARK-32586][SQL] Fix NumberFormatException error message when ansi is enabled ### What changes were proposed in this pull request? This pr fixes the error message of `NumberFormatException` when casting invalid input to FractionalType and enabling ansi: ``` spark-sql> set spark.sql.ansi.enabled=true; spark.sql.ansi.enabled true spark-sql> create table SPARK_32586 using parquet as select 's' s; spark-sql> select * from SPARK_32586 where s > 1.13D; java.lang.NumberFormatException: invalid input syntax for type numeric: columnartorow_value_0 ``` After this pr: ``` spark-sql> select * from SPARK_32586 where s > 1.13D; java.lang.NumberFormatException: invalid input syntax for type numeric: s ``` ### Why are the changes needed? Improve error message. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #29405 from wangyum/SPARK-32586. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 5d130f03607d2448e2f01814de7d330c512901b7) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	12 August 2020, 04:17:12 UTC
bfe9489	HyukjinKwon	05 August 2020, 17:35:03 UTC	[SPARK-32543][R] Remove arrow::as_tibble usage in SparkR SparkR increased the minimal version of Arrow R version to 1.0.0 at SPARK-32452, and Arrow R 0.14 dropped `as_tibble`. We can remove the usage in SparkR. To remove codes unused anymore. No. GitHub Actions will test them out. Closes #29361 from HyukjinKwon/SPARK-32543. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	11 August 2020, 02:00:26 UTC
93eb567	Wenchen Fan	10 August 2020, 23:24:08 UTC	[SPARK-32528][SQL][TEST][3.0] The analyze method should make sure the plan is analyzed ### What changes were proposed in this pull request? backport https://github.com/apache/spark/pull/29349 to 3.0. This PR updates the `analyze` method to make sure the plan can be resolved. It also fixes some miswritten optimizer tests. ### Why are the changes needed? It's error-prone if the `analyze` method can return an unresolved plan. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test only Closes #29400 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	10 August 2020, 23:24:08 UTC
6749ad8	Luca Canali	10 August 2020, 16:32:01 UTC	[SPARK-32409][DOC] Document dependency between spark.metrics.staticSources.enabled and JVMSource registration ### What changes were proposed in this pull request? Document the dependency between the config `spark.metrics.staticSources.enabled` and JVMSource registration. ### Why are the changes needed? This PT just documents the dependency between config `spark.metrics.staticSources.enabled` and JVM source registration. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes #29203 from LucaCanali/bugJVMMetricsRegistration. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 99f50c6286f4d86c02a15a0efd3046888ac45c75) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	10 August 2020, 16:32:16 UTC
843ff03	Takeshi Yamamuro	10 August 2020, 10:05:50 UTC	[SPARK-32576][SQL][TEST][FOLLOWUP] Add tests for all the character array types in PostgresIntegrationSuite ### What changes were proposed in this pull request? This is a follow-up PR of #29192 that adds integration tests for character arrays in `PostgresIntegrationSuite`. ### Why are the changes needed? For better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add tests. Closes #29397 from maropu/SPARK-32576-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 7990ea14090c13e1fd1e42bc519b54144bd3aa76) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	10 August 2020, 10:06:25 UTC
eaae91b	Weichen Xu	10 August 2020, 09:43:41 UTC	[MINOR] add test_createDataFrame_empty_partition in pyspark arrow tests ### What changes were proposed in this pull request? add test_createDataFrame_empty_partition in pyspark arrow tests ### Why are the changes needed? test edge cases. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? N/A Closes #29398 from WeichenXu123/add_one_pyspark_arrow_test. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit fc62d720769e3267132f31ee847f2783923b3195) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	10 August 2020, 09:43:59 UTC
8ff615d	Yuanjian Li	10 August 2020, 05:01:31 UTC	[SPARK-32456][SS] Check the Distinct by assuming it as Aggregate for Structured Streaming ### What changes were proposed in this pull request? Check the Distinct nodes by assuming it as Aggregate in `UnsupportOperationChecker` for streaming. ### Why are the changes needed? We want to fix 2 things here: 1. Give better error message for Distinct related operations in append mode that doesn't have a watermark We use the union streams as the example, distinct in SQL has the same issue. Since the union clause in SQL has the requirement of deduplication, the parser will generate `Distinct(Union)` and the optimizer rule `ReplaceDistinctWithAggregate` will change it to `Aggregate(Union)`. This logic is of both batch and streaming queries. However, in the streaming, the aggregation will be wrapped by state store operations so we need extra checking logic in `UnsupportOperationChecker`. Before this change, the SS union queries in Append mode will get the following confusing error when the watermark is lacking. ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) ... ``` 2. Make `Distinct` in complete mode runnable. Before this fix, the distinct in complete mode will throw the exception: ``` Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets; ``` ### Does this PR introduce _any_ user-facing change? Yes, return a better error message. ### How was this patch tested? New UT added. Closes #29256 from xuanyuanking/SPARK-32456. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b03761e3303e979999d4faa5cf4d1719a82e06cb) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	10 August 2020, 05:01:54 UTC
e4c6ebf	kujon	10 August 2020, 02:03:37 UTC	[SPARK-32576][SQL] Support PostgreSQL `bpchar` type and array of char type ### What changes were proposed in this pull request? This PR fixes the support for char(n)[], character(n)[] data types. Prior to this change, a user would get `Unsupported type ARRAY` exception when attempting to interact with the table with such types. The description is a bit more detailed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-32393) itself, but the crux of the issue is that postgres driver names char and character types as `bpchar`. The relevant driver code can be found [here](https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/TypeInfoCache.java#L85-L87). `char` is very likely to be still needed, as it seems that pg makes a distinction between `char(1)` and `char(n > 1)` as per [this code](https://github.com/pgjdbc/pgjdbc/blob/b7fd9f3cef734b4c219e2f6bc6c19acf68b2990b/pgjdbc/src/main/java/org/postgresql/core/Oid.java#L64). ### Why are the changes needed? For completeness of the pg dialect support. ### Does this PR introduce _any_ user-facing change? Yes, successful reads of tables with bpchar array instead of errors after this fix. ### How was this patch tested? Unit tests Closes #29192 from kujon/fix_postgres_bpchar_array_support. Authored-by: kujon <jakub.korzeniowski@vortexa.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0ae94ad32fc19abb9845528b10f79915a03224f2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	10 August 2020, 02:03:52 UTC
0f4989c	Takeshi Yamamuro	10 August 2020, 01:36:35 UTC	[SPARK-32393][SQL][TEST] Add tests for all the character types in PostgresIntegrationSuite ### What changes were proposed in this pull request? This PR intends to add tests to check if all the character types in PostgreSQL supported. The document for character types in PostgreSQL: https://www.postgresql.org/docs/current/datatype-character.html Closes #29192. ### Why are the changes needed? For better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add tests. Closes #29394 from maropu/pr29192. Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: kujon <jakub.korzeniowski@vortexa.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b2c45f7dcfe62e76f74726c97385440fead70646) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	10 August 2020, 01:36:48 UTC
9391705	wangguangxin.cn	09 August 2020, 19:12:14 UTC	[SPARK-32559][SQL][3.0] Fix the trim logic in UTF8String.toInt/toLong did't handle non-ASCII characters correctly ### What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/29375 The trim logic in Cast expression introduced in https://github.com/apache/spark/pull/26622 trim non-ASCII characters unexpectly. Before this patch ![image](https://user-images.githubusercontent.com/1312321/89513154-caad9b80-d806-11ea-9ebe-17c9e7d1b5b3.png) After this patch ![image](https://user-images.githubusercontent.com/1312321/89513196-d731f400-d806-11ea-959c-6a7dc29dcd49.png) ### Why are the changes needed? The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added more UT Closes #29393 from WangGuangxin/cast-bugfix-branch-3.0. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	09 August 2020, 19:12:14 UTC
1f1fc8b	Takeshi Yamamuro	08 August 2020, 23:33:25 UTC	[SPARK-32564][SQL][TEST][FOLLOWUP] Re-enable TPCDSQuerySuite with empty tables ### What changes were proposed in this pull request? This is the follow-up PR of #29384 to address the cloud-fan comment: https://github.com/apache/spark/pull/29384#issuecomment-670595111 This PR re-enables `TPCDSQuerySuite` with empty tables for better test coverages. ### Why are the changes needed? For better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29391 from maropu/SPARK-32564-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1df855bef2b2dbe330cafb0d10e0b4af813a311a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	08 August 2020, 23:33:45 UTC
7caecae	JoeyValentine	08 August 2020, 19:36:07 UTC	[MINOR][DOCS] Fix typos at ExecutorAllocationManager.scala ### What changes were proposed in this pull request? This PR fixes some typos in <code>core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala</code> file. ### Why are the changes needed? <code>spark.dynamicAllocation.sustainedSchedulerBacklogTimeout</code> (N) is used only after the <code>spark.dynamicAllocation.schedulerBacklogTimeout</code> (M) is exceeded. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No test needed. Closes #29351 from JoeyValentine/master. Authored-by: JoeyValentine <rlaalsdn0506@naver.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit dc3fac81848f557e3dac3f35686af325a18d0291) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	08 August 2020, 19:36:24 UTC
cfe62fc	Takeshi Yamamuro	08 August 2020, 15:53:54 UTC	[SPARK-32564][SQL][TEST][3.0] Inject data statistics to simulate plan generation on actual TPCDS data ### What changes were proposed in this pull request? `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then checks if plans can be generated correctly. But, the generated plans can be different from actual ones because the input tables are empty (e.g., the plans always use broadcast-hash joins, but actual ones use sort-merge joins for larger tables). To mitigate the issue, this PR defines data statistics constants extracted from generated TPCDS data in `TPCDSTableStats`, then injects the statistics via `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in `TPCDSQuerySuite`. Please see a link below about how to extract the table statistics: - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3 For example, the generated plans of TPCDS `q2` are different with/without this fix: ``` ==== w/ this fix: q2 ==== == Physical Plan == * Sort (43) +- Exchange (42) +- * Project (41) +- * SortMergeJoin Inner (40) :- * Sort (28) : +- Exchange (27) : +- * Project (26) : +- * BroadcastHashJoin Inner BuildRight (25) : :- * HashAggregate (19) : : +- Exchange (18) : : +- * HashAggregate (17) : : +- * Project (16) : : +- * BroadcastHashJoin Inner BuildRight (15) : : :- Union (9) : : : :- * Project (4) : : : : +- * Filter (3) : : : : +- * ColumnarToRow (2) : : : : +- Scan parquet default.web_sales (1) : : : +- * Project (8) : : : +- * Filter (7) : : : +- * ColumnarToRow (6) : : : +- Scan parquet default.catalog_sales (5) : : +- BroadcastExchange (14) : : +- * Project (13) : : +- * Filter (12) : : +- * ColumnarToRow (11) : : +- Scan parquet default.date_dim (10) : +- BroadcastExchange (24) : +- * Project (23) : +- * Filter (22) : +- * ColumnarToRow (21) : +- Scan parquet default.date_dim (20) +- * Sort (39) +- Exchange (38) +- * Project (37) +- * BroadcastHashJoin Inner BuildRight (36) :- * HashAggregate (30) : +- ReusedExchange (29) +- BroadcastExchange (35) +- * Project (34) +- * Filter (33) +- * ColumnarToRow (32) +- Scan parquet default.date_dim (31) ==== w/o this fix: q2 ==== == Physical Plan == * Sort (40) +- Exchange (39) +- * Project (38) +- * BroadcastHashJoin Inner BuildRight (37) :- * Project (26) : +- * BroadcastHashJoin Inner BuildRight (25) : :- * HashAggregate (19) : : +- Exchange (18) : : +- * HashAggregate (17) : : +- * Project (16) : : +- * BroadcastHashJoin Inner BuildRight (15) : : :- Union (9) : : : :- * Project (4) : : : : +- * Filter (3) : : : : +- * ColumnarToRow (2) : : : : +- Scan parquet default.web_sales (1) : : : +- * Project (8) : : : +- * Filter (7) : : : +- * ColumnarToRow (6) : : : +- Scan parquet default.catalog_sales (5) : : +- BroadcastExchange (14) : : +- * Project (13) : : +- * Filter (12) : : +- * ColumnarToRow (11) : : +- Scan parquet default.date_dim (10) : +- BroadcastExchange (24) : +- * Project (23) : +- * Filter (22) : +- * ColumnarToRow (21) : +- Scan parquet default.date_dim (20) +- BroadcastExchange (36) +- * Project (35) +- * BroadcastHashJoin Inner BuildRight (34) :- * HashAggregate (28) : +- ReusedExchange (27) +- BroadcastExchange (33) +- * Project (32) +- * Filter (31) +- * ColumnarToRow (30) +- Scan parquet default.date_dim (29) ``` This comes from the cloud-fan comment: https://github.com/apache/spark/pull/29270#issuecomment-666098964 This is the backport of #29384. ### Why are the changes needed? For better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29390 from maropu/SPARK-32564-BRANCH3.0. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	08 August 2020, 15:53:54 UTC
17ce605	Prashant Sharma	07 August 2020, 08:31:45 UTC	[SPARK-32556][INFRA] Fix release script to have urlencoded passwords where required ### What changes were proposed in this pull request? 1. URL encode the `ASF_PASSWORD` of the release manager. 2. Update the image to install `qpdf` and `jq` dep 3. Increase the JVM HEAM memory for maven build. ### Why are the changes needed? Release script takes hours to run, and if a single failure happens about somewhere midway, then either one has to get down to manually doing stuff or re run the entire script. (This is my understanding) So, I have made the fixes of a few failures, discovered so far. 1. If the release manager password contains a char, that is not allowed in URL, then it fails the build at the clone spark step. `git clone "https://$ASF_USERNAME:$ASF_PASSWORD$ASF_SPARK_REPO" -b $GIT_BRANCH` ^^^ Fails with bad URL `ASF_USERNAME` may not be URL encoded, but we need to encode `ASF_PASSWORD`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the release for branch-2.4, using both type of passwords, i.e. passwords with special chars and without it. Closes #29373 from ScrapCodes/release-script-fix2. Lead-authored-by: Prashant Sharma <prashant@apache.org> Co-authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Prashant Sharma <prashant@apache.org> (cherry picked from commit 6c3d0a44056cc5ca1d304b5a8a03d2f02974e58b) Signed-off-by: Prashant Sharma <prashant@apache.org>	07 August 2020, 08:33:07 UTC
4d8642f	GuoPhilipse	07 August 2020, 05:29:32 UTC	[SPARK-32560][SQL] Improve exception message at InsertIntoHiveTable.processInsert ### What changes were proposed in this pull request? improve exception message ### Why are the changes needed? the before message lack of single quotes, we may improve it to keep consisent. ![image](https://user-images.githubusercontent.com/46367746/89595808-15bbc300-d888-11ea-9914-b05ea7b66461.png) ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? No ,it is only improving the message. Closes #29376 from GuoPhilipse/improve-exception-message. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit aa4d3c19fead4ec2f89b4957b4ccc7482e121e4d) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	07 August 2020, 05:29:47 UTC
c7af0be	Kousuke Saruta	07 August 2020, 02:29:18 UTC	[SPARK-32538][CORE][TEST] Use local time zone for the timestamp logged in unit-tests.log ### What changes were proposed in this pull request? This PR lets the logger log timestamp based on local time zone during test. `SparkFunSuite` fixes the default time zone to America/Los_Angeles so the timestamp logged in unit-tests.log is also based on the fixed time zone. ### Why are the changes needed? It's confusable for developers whose time zone is not America/Los_Angeles. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run existing tests and confirmed uint-tests.log. If your local time zone is America/Los_Angeles, you can test by setting the environment variable `TZ` like as follows. ``` $ TZ=Asia/Tokyo build/sbt "testOnly org.apache.spark.executor.ExecutorSuite" $ tail core/target/unit-tests.log ``` Closes #29356 from sarutak/fix-unit-test-log-timezone. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 4e267f3eb9ca0df18647c859b75b61b1af800120) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	07 August 2020, 02:29:42 UTC
30c3a50	Huaxin Gao	06 August 2020, 20:54:15 UTC	[SPARK-32506][TESTS] Flaky test: StreamingLinearRegressionWithTests ### What changes were proposed in this pull request? The test creates 10 batches of data to train the model and expects to see error on test data improves as model is trained. If the difference between the 2nd error and the 10th error is smaller than 2, the assertion fails: ``` FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests) Test that error on test data improves as model is trained. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 466, in test_train_prediction eventually(condition, timeout=180.0) File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 81, in eventually lastValue = condition() File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 461, in condition self.assertGreater(errors[1] - errors[-1], 2) AssertionError: 1.672640157855923 not greater than 2 ``` I saw this quite a few time on Jenkins but was not able to reproduce this on my local. These are the ten errors I got: ``` 4.517395047937127 4.894265404350079 3.0392090466559876 1.8786361640757654 0.8973106042078115 0.3715780507684368 0.20815690742907672 0.17333033743125845 0.15686783249863873 0.12584413600569616 ``` I am thinking of having 15 batches of data instead of 10, so the model can be trained for a longer time. Hopefully the 15th error - 2nd error will always be larger than 2 on Jenkins. These are the 15 errors I got on my local: ``` 4.517395047937127 4.894265404350079 3.0392090466559876 1.8786361640757658 0.8973106042078115 0.3715780507684368 0.20815690742907672 0.17333033743125845 0.15686783249863873 0.12584413600569616 0.11883853835108477 0.09400261862100823 0.08887491447353497 0.05984929624986607 0.07583948141520978 ``` ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #29380 from huaxingao/flaky_test. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com> (cherry picked from commit 75c2c53e931187912a92e0b52dae0f772fa970e3) Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	06 August 2020, 20:54:39 UTC
d3eea05	Max Gekk	06 August 2020, 13:30:42 UTC	[SPARK-32546][SQL][3.0] Get table names directly from Hive tables ### What changes were proposed in this pull request? Get table names directly from a sequence of Hive tables in `HiveClientImpl.listTablesByType()` by skipping conversions Hive tables to Catalog tables. ### Why are the changes needed? A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance `ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"`. In the current implementation, other clients get the following exception while getting views: ``` java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found) ``` when `com.ibm.spss.hive.serde2.xml.XmlSerDe` is not available. ### Does this PR introduce _any_ user-facing change? Yes. For example, `SHOW VIEWS` returns a list of views instead of throwing an exception. ### How was this patch tested? - By existing test suites like: ``` $ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite" ``` - And manually: 1. Build Spark with Hive 1.2: `./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5` 2. Run spark-shell with a custom Hive SerDe, for instance download [json-serde-1.3.8-jar-with-dependencies.jar](https://github.com/cdamak/Twitter-Hive/blob/master/json-serde-1.3.8-jar-with-dependencies.jar) from https://github.com/cdamak/Twitter-Hive: ``` $ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar ``` 3. Create a Hive table using this SerDe: ```scala scala> :paste // Entering paste mode (ctrl-D to finish) sql(s""" \|CREATE TABLE json_table2(page_id INT NOT NULL) \|ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' \|""".stripMargin) // Exiting paste mode, now interpreting. res0: org.apache.spark.sql.DataFrame = [] scala> sql("SHOW TABLES").show +--------+-----------+-----------+ \|database\| tableName\|isTemporary\| +--------+-----------+-----------+ \| default\|json_table2\| false\| +--------+-----------+-----------+ scala> sql("SHOW VIEWS").show +---------+--------+-----------+ \|namespace\|viewName\|isTemporary\| +---------+--------+-----------+ +---------+--------+-----------+ ``` 4. Quit from the current `spark-shell` and run it without jars: ``` $ ./bin/spark-shell ``` 5. Show views. Without the fix, it throws the exception: ```scala scala> sql("SHOW VIEWS").show 20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605) ``` After the fix: ```scala scala> sql("SHOW VIEWS").show +---------+--------+-----------+ \|namespace\|viewName\|isTemporary\| +---------+--------+-----------+ +---------+--------+-----------+ ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit dc96f2f8d6e08c4bc30bc11d6b29109d2aeb604b) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #29377 from MaxGekk/fix-listTablesByType-for-views-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	06 August 2020, 13:30:42 UTC
ab5034f	Yan Xiaole	05 August 2020, 17:57:11 UTC	[SPARK-32529][CORE] Fix Historyserver log scan aborted by application status change # What changes were proposed in this pull request? This PR adds a `FileNotFoundException` try catch block while adding a new entry to history server application listing to skip the non-existing path. ### Why are the changes needed? If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from `foo.inprogress` to `foo`. It leads to a problem when adding an entry to the listing, querying file status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log. ``` 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11111111111111.lz4.inprogress at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. setup another script keeps changing the filename of applications under history log dir 2. launch the history server 3. check whether the `File does not exist` error log was gone. Closes #29350 from yanxiaole/SPARK-32529. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c1d17df826541580162c9db8ebfbc408ec0c9922) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	05 August 2020, 17:57:37 UTC
fd445cb	Wing Yew Poon	04 August 2020, 16:33:56 UTC	[SPARK-32003][CORE][3.0] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost ### What changes were proposed in this pull request? If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files. In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased. We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros. ### Why are the changes needed? Without the changes, the loss of a node could require two stage attempts to recover instead of one. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. This test fails without the change and passes with it. Closes #29193 from wypoon/SPARK-32003-3.0. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	04 August 2020, 16:33:56 UTC
6d7ae4a	Takuya UESHIN	04 August 2020, 03:43:01 UTC	[SPARK-32160][CORE][PYSPARK][3.0][FOLLOWUP] Change the config name to switch allow/disallow SparkContext in executors ### What changes were proposed in this pull request? This is a follow-up of #29294. This PR changes the config name to switch allow/disallow `SparkContext` in executors as per the comment https://github.com/apache/spark/pull/29278#pullrequestreview-460256338. ### Why are the changes needed? The config name `spark.executor.allowSparkContext` is more reasonable. ### Does this PR introduce _any_ user-facing change? Yes, the config name is changed. ### How was this patch tested? Updated tests. Closes #29341 from ueshin/issues/SPARK-32160/3.0/change_config_name. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	04 August 2020, 03:43:01 UTC
c148a98	Takuya UESHIN	03 August 2020, 13:57:24 UTC	[SPARK-32160][CORE][PYSPARK][3.0] Add a config to switch allow/disallow to create SparkContext in executors ### What changes were proposed in this pull request? This is a backport of #29278, but with allowing to create `SparkContext` in executors by default. This PR adds a config to switch allow/disallow to create `SparkContext` in executors. - `spark.driver.allowSparkContextInExecutors` ### Why are the changes needed? Some users or libraries actually create `SparkContext` in executors. We shouldn't break their workloads. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to disallow to create `SparkContext` in executors with the config disabled. ### How was this patch tested? More tests are added. Closes #29294 from ueshin/issues/SPARK-32160/3.0/add_configs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	03 August 2020, 13:57:24 UTC
64b3b56	Wenchen Fan	03 August 2020, 12:56:23 UTC	[SPARK-32083][SQL][3.0] AQE should not coalesce partitions for SinglePartition This is a partial backport of https://github.com/apache/spark/pull/29307 Most of the changes are not needed because https://github.com/apache/spark/pull/28226 is in master only. This PR only backports the safeguard in `ShuffleExchangeExec.canChangeNumPartitions` Closes #29321 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	03 August 2020, 12:56:23 UTC
d8d3e87	Prakhar Jain	03 August 2020, 03:26:03 UTC	[SPARK-32509][SQL] Ignore unused DPP True Filter in Canonicalization ### What changes were proposed in this pull request? This PR fixes issues relate to Canonicalization of FileSourceScanExec when it contains unused DPP Filter. ### Why are the changes needed? As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` partition filter inside the FileSourceScanExec affects the canonicalization of the node and so in many cases, this can prevent ReuseExchange from happening. This PR fixes this issue by ignoring the unused DPP filter in the `def doCanonicalize` method. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT. Closes #29318 from prakharjain09/SPARK-32509_df_reuse. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7a09e71198a094250f04e0f82f0c7c9860169540) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	03 August 2020, 03:26:21 UTC
ea4b288	Gengliang Wang	01 August 2020, 05:09:26 UTC	[SPARK-32467][UI] Avoid encoding URL twice on https redirect ### What changes were proposed in this pull request? When https is enabled for Spark UI, an HTTP request will be redirected as an encoded HTTPS URL: https://github.com/apache/spark/pull/10238/files#diff-f79a5ead735b3d0b34b6b94486918e1cR312 When we create the redirect url, we will call getRequestURI and getQueryString. Both two methods may return an encoded string. However, we pass them directly to the following URI constructor ``` URI(String scheme, String authority, String path, String query, String fragment) ``` As this URI constructor assumes both path and query parameters are decoded strings, it will encode them again. This makes the redirect URL encoded twice. This problem is on stage page with HTTPS enabled. The URL of "/taskTable" contains query parameter `order%5B0%5D%5Bcolumn%5D`. After encoded it becomes `order%255B0%255D%255Bcolumn%255D` and it will be decoded as `order%5B0%5D%5Bcolumn%5D` instead of `order[0][dir]`. When the parameter `order[0][dir]` is missing, there will be an excetpion from: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176 and the stage page fail to load. To fix the problem, we can try decoding the query parameters before encoding it. This is to make sure we encode the URL ### Why are the changes needed? Fix a UI issue when HTTPS is enabled ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new Unit test + manually test on a cluster Closes #29271 from gengliangwang/urlEncode. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> (cherry picked from commit 71aea02e9ffb0c6f7c72c91054c2a4653e22e801) Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	01 August 2020, 05:09:57 UTC
7c91b15	Andy Grove	31 July 2020, 16:14:33 UTC	[SPARK-32332][SQL][3.0] Support columnar exchanges ### What changes were proposed in this pull request? Backports SPARK-32332 to 3.0 branch. ### Why are the changes needed? Plugins cannot replace exchanges with columnar versions when AQE is enabled without this patch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests included. Closes #29310 from andygrove/backport-SPARK-32332. Authored-by: Andy Grove <andygrove@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	31 July 2020, 16:14:33 UTC
2a38090	Kousuke Saruta	31 July 2020, 01:37:05 UTC	[SPARK-32175][SPARK-32175][FOLLOWUP] Remove flaky test added in ### What changes were proposed in this pull request? This PR removes a test added in SPARK-32175(#29002). ### Why are the changes needed? That test is flaky. It can be mitigated by increasing the timeout but it would rather be simpler to remove the test. See also the [discussion](https://github.com/apache/spark/pull/29002#issuecomment-666746857). ### Does this PR introduce _any_ user-facing change? No. Closes #29314 from sarutak/remove-flaky-test. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 9d7b1d935f7a2b770d8b2f264cfe4a4db2ad64b6) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	31 July 2020, 01:39:28 UTC
b40df01	Warren Zhu	30 July 2020, 12:44:49 UTC	[SPARK-32227] Fix regression bug in load-spark-env.cmd with Spark 3.0.0 Fix regression bug in load-spark-env.cmd with Spark 3.0.0 cmd doesn't support set env twice. So set `SPARK_ENV_CMD=%SPARK_CONF_DIR%\%SPARK_ENV_CMD%` doesn't take effect, which caused regression. No Manually tested. 1. Create a spark-env.cmd under conf folder. Inside this, `echo spark-env.cmd` 2. Run old load-spark-env.cmd, nothing printed in the output 2. Run fixed load-spark-env.cmd, `spark-env.cmd` showed in the output. Closes #29044 from warrenzhu25/32227. Lead-authored-by: Warren Zhu <zhonzh@microsoft.com> Co-authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 743772095273b464f845efefb3eb59284b06b9be) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	30 July 2020, 12:46:28 UTC
235552a	HyukjinKwon	30 July 2020, 06:16:02 UTC	[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization ### What changes were proposed in this pull request? This PR proposes to: 1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example, ```R df <- createDataFrame(list(list(a=1L, b="2"))) count(gapply(df, "a", function(key, group) { group }, structType("a int, b int"))) ``` Before: ``` Error in handleErrors(returnStatus, conn) : ... java.lang.UnsupportedOperationException ... ``` After: ``` Error in handleErrors(returnStatus, conn) : ... java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType ... ``` 2. Update documentation about the schema matching for `gapply` and `dapply`. ### Why are the changes needed? To show which schema is not matched, and let users know what's going on. ### Does this PR introduce _any_ user-facing change? Yes, error message is updated as above, and documentation is updated. ### How was this patch tested? Manually tested and unitttests were added. Closes #29283 from HyukjinKwon/r-vectorized-error. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	30 July 2020, 06:16:59 UTC
d00e104	Holden Karau	29 July 2020, 21:39:14 UTC	[SPARK-32397][BUILD] Allow specifying of time for build to keep time consistent between modules ### What changes were proposed in this pull request? Upgrade codehaus maven build helper to allow people to specify a time during the build to avoid snapshot artifacts with different version strings. ### Why are the changes needed? During builds of snapshots the maven may assign different versions to different artifacts based on the time each individual sub-module starts building. The timestamp is used as part of the version string when run `maven deploy` on a snapshot build. This results in different sub-modules having different version strings. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual build while specifying the current time, ensured the time is consistent in the sub components. Open question: Ideally I'd like to backport this as well since it's sort of a bug fix and while it does change a dependency version it's not one that is propagated. I'd like to hear folks thoughts about this. Closes #29274 from holdenk/SPARK-32397-snapshot-artifact-timestamp-differences. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit 50911df08eb7a27494dc83bcec3d09701c2babfe) Signed-off-by: DB Tsai <d_tsai@apple.com>	29 July 2020, 21:39:26 UTC
e5b5b7e	Kousuke Saruta	29 July 2020, 13:44:56 UTC	[SPARK-32175][CORE] Fix the order between initialization for ExecutorPlugin and starting heartbeat thread ### What changes were proposed in this pull request? This PR changes the order between initialization for ExecutorPlugin and starting heartbeat thread in Executor. ### Why are the changes needed? In the current master, heartbeat thread in a executor starts after plugin initialization so if the initialization takes long time, heartbeat is not sent to driver and the executor will be removed from cluster. ### Does this PR introduce _any_ user-facing change? Yes. Plugins for executors will be allowed to take long time for initialization. ### How was this patch tested? New testcase. Closes #29002 from sarutak/fix-heartbeat-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 9be088357eff4328248b29a3a49a816756745345) Signed-off-by: Thomas Graves <tgraves@apache.org>	29 July 2020, 13:46:01 UTC
9f18d54	LantaoJin	29 July 2020, 03:58:03 UTC	[SPARK-32283][CORE] Kryo should support multiple user registrators ### What changes were proposed in this pull request? `spark.kryo.registrator` in 3.0 has a regression problem. From [SPARK-12080](https://issues.apache.org/jira/browse/SPARK-12080), it supports multiple user registrators by ```scala private val userRegistrators = conf.get("spark.kryo.registrator", "") .split(',').map(_.trim) .filter(!_.isEmpty) ``` But it donsn't work in 3.0. Fix it by `toSequence` in `Kryo.scala` ### Why are the changes needed? In previous Spark version (2.x), it supported multiple user registrators by ```scala private val userRegistrators = conf.get("spark.kryo.registrator", "") .split(',').map(_.trim) .filter(!_.isEmpty) ``` But it doesn't work in 3.0. It's should be a regression. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed unit tests. Closes #29123 from LantaoJin/SPARK-32283. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 26e6574d58429add645db820a83b70ef9dcd49fe) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	29 July 2020, 03:58:12 UTC
8cfb718	Xiaochang Wu	28 July 2020, 15:36:11 UTC	[SPARK-32339][ML][DOC] Improve MLlib BLAS native acceleration docs ### What changes were proposed in this pull request? Rewrite a clearer and complete BLAS native acceleration enabling guide. ### Why are the changes needed? The document of enabling BLAS native acceleration in ML guide (https://spark.apache.org/docs/latest/ml-guide.html#dependencies) is incomplete and unclear to the user. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29139 from xwu99/blas-doc. Lead-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com> (cherry picked from commit 44c868b73a7cb293ec81927c28991677bf33ea90) Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	28 July 2020, 15:36:41 UTC
f349a78	Kent Yao	28 July 2020, 10:03:39 UTC	[SPARK-32424][SQL][3.0] Fix silent data change for timestamp parsing if f overflow happens This PR backports https://github.com/apache/spark/commit/d315ebf3a739a05a68d0f0ab319920765bf65b0f to branch-3.0 ### What changes were proposed in this pull request? When using `Seconds.toMicros` API to convert epoch seconds to microseconds, ```scala /** * Equivalent to * {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}. * param duration the duration * return the converted duration, * or {code Long.MIN_VALUE} if conversion would negatively * overflow, or {code Long.MAX_VALUE} if it would positively overflow. */ ``` This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)` ### Why are the changes needed? fix silent data change between 3.x and 2.x ``` ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" +294247-01-10 12:00:54.775807 ``` ``` kentyaohulk  ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" 284550-10-19 15:58:1010.448384 ``` ### Does this PR introduce _any_ user-facing change? Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow. ### How was this patch tested? add unit test Closes #29267 from yaooqinn/SPARK-32424-30. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	28 July 2020, 10:03:39 UTC
b35b3eb	Shantanu	28 July 2020, 02:22:18 UTC	[MINOR][PYTHON] Fix spacing in error message ### What changes were proposed in this pull request? Fixes spacing in an error message ### Why are the changes needed? Makes error messages easier to read ### Does this PR introduce _any_ user-facing change? Yes, it changes the error message ### How was this patch tested? This patch doesn't affect any logic, so existing tests should cover it Closes #29264 from hauntsaninja/patch-1. Authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 77f2ca6cced1c723d1c2e6082a1534f6436c6d2a) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	28 July 2020, 02:22:34 UTC
6ed93c3	GuoPhilipse	28 July 2020, 00:41:53 UTC	[SPARK-31753][SQL][DOCS] Add missing keywords in the SQL docs ### What changes were proposed in this pull request? update sql-ref docs, the following key words will be added in this PR. CASE/ELSE WHEN/THEN MAP KEYS TERMINATED BY NULL DEFINED AS LINES TERMINATED BY ESCAPED BY COLLECTION ITEMS TERMINATED BY PIVOT LATERAL VIEW OUTER? ROW FORMAT SERDE ROW FORMAT DELIMITED FIELDS TERMINATED BY IGNORE NULLS FIRST LAST ### Why are the changes needed? let more users know the sql key words usage ### Does this PR introduce _any_ user-facing change? ![image](https://user-images.githubusercontent.com/46367746/88148830-c6dc1f80-cc31-11ea-81ea-13bc9dc34550.png) ![image](https://user-images.githubusercontent.com/46367746/88148968-fb4fdb80-cc31-11ea-8649-e8297cf5813e.png) ![image](https://user-images.githubusercontent.com/46367746/88149000-073b9d80-cc32-11ea-9aa4-f914ecd72663.png) ![image](https://user-images.githubusercontent.com/46367746/88149021-0f93d880-cc32-11ea-86ed-7db8672b5aac.png) ### How was this patch tested? No Closes #29056 from GuoPhilipse/add-missing-keywords. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 8de43338be879f0cfeebca328dbbcfd1e5bd70da) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	28 July 2020, 00:42:11 UTC
4b8761e	Dongjoon Hyun	27 July 2020, 02:25:41 UTC	[SPARK-32448][K8S][TESTS] Use single version for exec-maven-plugin/scalatest-maven-plugin ### What changes were proposed in this pull request? Two different versions are used for the same artifacts, `exec-maven-plugin` and `scalatest-maven-plugin`. This PR aims to use the same versions for `exec-maven-plugin` and `scalatest-maven-plugin`. In addition, this PR removes `scala-maven-plugin.version` from `K8s` integration suite because it's unused. ### Why are the changes needed? This will prevent the mistake which upgrades only one place and forgets the others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins K8S IT. Closes #29248 from dongjoon-hyun/SPARK-32448. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 13c64c298016eb3882ed20a6f6c60f1ea3988b3b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	27 July 2020, 05:31:50 UTC
d71be73	Itsuki Toyota	26 July 2020, 14:12:43 UTC	[SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons… …istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 86ead044e3789b3291a38ec2142cbb343d1290c1) Signed-off-by: Sean Owen <srowen@gmail.com>	26 July 2020, 14:13:03 UTC
ab14d05	Thomas Graves	24 July 2020, 18:12:28 UTC	[SPARK-32287][TESTS] Flaky Test: ExecutorAllocationManagerSuite.add executors default profile I wasn't able to reproduce the failure but the best I can tell is that the allocation manager timer triggers and call doRequest. The timeout is 10s so try to increase that to 30seconds. test failure no unit test Closes #29225 from tgravescs/SPARK-32287. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit e6ef27be52dcd14dc94384c2ada85861be44d843) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	24 July 2020, 18:14:03 UTC
7004c98	Andy Grove	24 July 2020, 18:03:57 UTC	[SPARK-32430][SQL] Extend SparkSessionExtensions to inject rules into AQE query stage preparation ### What changes were proposed in this pull request? Provide a generic mechanism for plugins to inject rules into the AQE "query prep" stage that happens before query stage creation. This goes along with https://issues.apache.org/jira/browse/SPARK-32332 where the current AQE implementation doesn't allow for users to properly extend it for columnar processing. ### Why are the changes needed? The issue here is that we create new query stages but we do not have access to the parent plan of the new query stage so certain things can not be determined because you have to know what the parent did. With this change it would allow you to add TAGs to be able to figure out what is going on. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new unit test is included in the PR. Closes #29224 from andygrove/insert-aqe-rule. Authored-by: Andy Grove <andygrove@nvidia.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 64a01c0a559396fccd615dc00576a80bc8cc5648) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	24 July 2020, 18:04:14 UTC
8a52bda	Huaxin Gao	24 July 2020, 16:55:20 UTC	[SPARK-32310][ML][PYSPARK][3.0] ML params default value parity ### What changes were proposed in this pull request? backporting the changes to 3.0 set params default values in trait Params for feature and tuning in both Scala and Python. ### Why are the changes needed? Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and modified tests Closes #29159 from huaxingao/set_default_3.0. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	24 July 2020, 16:55:20 UTC
f50432f	HyukjinKwon	24 July 2020, 14:18:15 UTC	[SPARK-32363][PYTHON][BUILD][3.0] Fix flakiness in pip package testing in Jenkins ### What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/29117 to branch-3.0 as the flakiness was found in branch-3.0 too: https://github.com/apache/spark/pull/29201#issuecomment-663114741 and https://github.com/apache/spark/pull/29201#issuecomment-663114741 This PR proposes: - ~~Don't use `--user` in pip packaging test~~ - ~~Pull `source` out of the subshell, and place it first.~~ - Exclude user sitepackages in Python path during pip installation test to address the flakiness of the pip packaging test in Jenkins. ~~(I think) #29116 caused this flakiness given my observation in the Jenkins log. I had to work around by specifying `--user` but it turned out that it does not properly work in old Conda on Jenkins for some reasons. Therefore, reverting this change back.~~ (I think) the installation at user site-packages affects other environments created by Conda in the old Conda version that Jenkins has. Seems it fails to isolate the environments for some reasons. So, it excludes user sitepackages in the Python path during the test. ~~In addition, #29116 also added some fallback logics of `conda (de)activate` and `source (de)activate` because Conda prefers to use `conda (de)activate` now per the official documentation and `source (de)activate` doesn't work for some reasons in certain environments (see also https://github.com/conda/conda/issues/7980). The problem was that `source` loads things to the current shell so does not affect the current shell. Therefore, this PR pulls `source` out of the subshell.~~ Disclaimer: I made the analysis purely based on Jenkins machine's log in this PR. It may have a different reason I missed during my observation. ### Why are the changes needed? To make the build and tests pass in Jenkins. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins tests should test it out. Closes #29215 from HyukjinKwon/SPARK-32363-3.0. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	24 July 2020, 14:18:15 UTC
be1b282	LantaoJin	24 July 2020, 03:48:16 UTC	[SPARK-32237][SQL][3.0] Resolve hint in CTE ### What changes were proposed in this pull request? The backport of #29062 This PR is to move `Substitution` rule before `Hints` rule in `Analyzer` to avoid hint in CTE not working. ### Why are the changes needed? Below SQL in Spark3.0 will throw AnalysisException, but it works in Spark2.x ```sql WITH cte AS (SELECT /+ REPARTITION(3) / T.id, T.data FROM $t1 T) SELECT cte.id, cte.data FROM cte ``` ``` Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`cte.id`' given input columns: [cte.data, cte.id]; line 3 pos 7; 'Project ['cte.id, 'cte.data] +- SubqueryAlias cte +- Project [id#21L, data#22] +- SubqueryAlias T +- SubqueryAlias testcat.ns1.ns2.tbl +- RelationV2[id#21L, data#22] testcat.ns1.ns2.tbl 'Project ['cte.id, 'cte.data] +- SubqueryAlias cte +- Project [id#21L, data#22] +- SubqueryAlias T +- SubqueryAlias testcat.ns1.ns2.tbl +- RelationV2[id#21L, data#22] testcat.ns1.ns2.tbl ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a unit test Closes #29201 from LantaoJin/SPARK-32237_branch-3.0. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Alan Jin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	24 July 2020, 03:48:16 UTC
ebac47b	yi.wu	23 July 2020, 14:24:47 UTC	[SPARK-32280][SPARK-32372][SQL] ResolveReferences.dedupRight should only rewrite attributes for ancestor nodes of the conflict plan This PR refactors `ResolveReferences.dedupRight` to make sure it only rewrite attributes for ancestor nodes of the conflict plan. This is a bug fix. ```scala sql("SELECT name, avg(age) as avg_age FROM person GROUP BY name") .createOrReplaceTempView("person_a") sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = p2.name") .createOrReplaceTempView("person_b") sql("SELECT * FROM person_a UNION SELECT * FROM person_b") .createOrReplaceTempView("person_c") sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name = p2.name").show() ``` When executing the above query, we'll hit the error: ```scala [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: Resolved attribute(s) avg_age#231 missing from name#223,avg_age#218,id#232,age#234,name#233 in operator !Project [name#233, avg_age#231]. Attribute(s) with the same name appear in the operation: avg_age. Please check if the right attribute(s) are used.;; ... ``` The plan below is the problematic plan which is the right plan of a `Join` operator. And, it has conflict plans comparing to the left plan. In this problematic plan, the first `Aggregate` operator (the one under the first child of `Union`) becomes a conflict plan compares to the left one and has a rewrite attribute pair as `avg_age#218` -> `avg_age#231`. With the current `dedupRight` logic, we'll first replace this `Aggregate` with a new one, and then rewrites the attribute `avg_age#218` from bottom to up. As you can see, projects with the attribute `avg_age#218` of the second child of the `Union` can also be replaced with `avg_age#231`(That means we also rewrite attributes for non-ancestor plans for the conflict plan). Ideally, the attribute `avg_age#218` in the second `Aggregate` operator (the one under the second child of `Union`) should also be replaced. But it didn't because it's an `Alias` while we only rewrite `Attribute` yet. Therefore, the project above the second `Aggregate` becomes unresolved. ```scala :  : +- SubqueryAlias p2 +- SubqueryAlias person_c +- Distinct +- Union :- Project [name#233, avg_age#231] : +- SubqueryAlias person_a : +- Aggregate [name#233], [name#233, avg(cast(age#234 as bigint)) AS avg_age#231] : +- SubqueryAlias person : +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234] : +- ExternalRDD [obj#165] +- Project [name#233 AS name#227, avg_age#231 AS avg_age#228] +- Project [name#233, avg_age#231] +- SubqueryAlias person_b +- !Project [name#233, avg_age#231] +- Join Inner, (name#233 = name#223) :- SubqueryAlias p1 : +- SubqueryAlias person : +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234] : +- ExternalRDD [obj#165] +- SubqueryAlias p2 +- SubqueryAlias person_a +- Aggregate [name#223], [name#223, avg(cast(age#224 as bigint)) AS avg_age#218] +- SubqueryAlias person +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#222, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#223, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#224] +- ExternalRDD [obj#165] ``` Yes, users would no longer hit the error after this fix. Added test. Closes #29166 from Ngone51/impr-dedup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a8e3de36e7d543f1c7923886628ac3178f45f512) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	23 July 2020, 14:27:18 UTC
01c88be	Wenchen Fan	23 July 2020, 14:02:38 UTC	[SPARK-32251][SQL][TESTS][FOLLOWUP] improve SQL keyword test Improve the `SQLKeywordSuite` so that: 1. it checks keywords under default mode as well 2. it checks if there are typos in the doc (found one and fixed in this PR) better test coverage no N/A Closes #29200 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit aa54dcf193a2149182da779191cf12f087305726) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	23 July 2020, 14:04:56 UTC
f6f6026	Dongjoon Hyun	23 July 2020, 13:28:08 UTC	[SPARK-32364][SQL][FOLLOWUP] Add toMap to return originalMap and documentation ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/29160. We already removed the indeterministicity. This PR aims the following for the existing code base. 1. Add an explicit document to `DataFrameReader/DataFrameWriter`. 2. Add `toMap` to `CaseInsensitiveMap` in order to return `originalMap: Map[String, T]` because it's more consistent with the existing `case-sensitive key names` behavior for the existing code pattern like `AppendData.byName(..., extraOptions.toMap)`. Previously, it was `HashMap.toMap`. 3. During (2), we need to change the following to keep the original logic using `CaseInsensitiveMap.++`. ```scala - val params = extraOptions.toMap ++ connectionProperties.asScala.toMap + val params = extraOptions ++ connectionProperties.asScala ``` 4. Additionally, use `.toMap` in the following because `dsOptions.asCaseSensitiveMap()` is used later. ```scala - val options = sessionOptions ++ extraOptions + val options = sessionOptions.filterKeys(!extraOptions.contains(_)) ++ extraOptions.toMap val dsOptions = new CaseInsensitiveStringMap(options.asJava) ``` ### Why are the changes needed? `extraOptions.toMap` is used in several places (e.g. `DataFrameReader`) to hand over `Map[String, T]`. In this case, `CaseInsensitiveMap[T] private (val originalMap: Map[String, T])` had better return `originalMap`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action with the existing tests and newly add test case at `JDBCSuite`. Closes #29191 from dongjoon-hyun/SPARK-32364-3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit aed8dbab1d6725eb17f743c300451fcbdbfa3e97) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	23 July 2020, 13:28:24 UTC
ad9e7a2	Dongjoon Hyun	22 July 2020, 14:58:45 UTC	[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options ### What changes were proposed in this pull request? When a user have multiple options like `path`, `paTH`, and `PATH` for the same key `path`, `option/options` is non-deterministic because `extraOptions` is `HashMap`. This PR aims to use `CaseInsensitiveMap` instead of `HashMap` to fix this bug fundamentally. ### Why are the changes needed? Like the following, DataFrame's `option/options` have been non-deterministic in terms of case-insensitivity because it stores the options at `extraOptions` which is using `HashMap` class. ```scala spark.read .option("paTh", "1") .option("PATH", "2") .option("Path", "3") .option("patH", "4") .load("5") ... org.apache.spark.sql.AnalysisException: Path does not exist: file:/.../1; ``` ### Does this PR introduce _any_ user-facing change? Yes. However, this is a bug fix for the indeterministic cases. ### How was this patch tested? Pass the Jenkins or GitHub Action with newly added test cases. Closes #29160 from dongjoon-hyun/SPARK-32364. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cd16a10475c110dbf5739a37e8f5f103b5541234) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	22 July 2020, 14:59:00 UTC
9729144	Dongjoon Hyun	21 July 2020, 05:20:16 UTC	[SPARK-32377][SQL] CaseInsensitiveMap should be deterministic for addition ### What changes were proposed in this pull request? This PR aims to fix `CaseInsensitiveMap` to be deterministic for addition. ### Why are the changes needed? ```scala import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap var m = CaseInsensitiveMap(Map.empty[String, String]) Seq(("paTh", "1"), ("PATH", "2"), ("Path", "3"), ("patH", "4"), ("path", "5")).foreach { kv => m = (m + kv).asInstanceOf[CaseInsensitiveMap[String]] println(m.get("path")) } ``` BEFORE ``` Some(1) Some(2) Some(3) Some(4) Some(1) ``` AFTER ``` Some(1) Some(2) Some(3) Some(4) Some(5) ``` ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug fix on non-deterministic behavior. ### How was this patch tested? Pass the newly added test case. Closes #29172 from dongjoon-hyun/SPARK-32377. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8c7d6f9733751503f80d5a1b2463904dfefd6843) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	21 July 2020, 05:20:33 UTC
aaee1f8	Brandon	21 July 2020, 04:42:19 UTC	[MINOR][DOCS] add link for Debugging your Application in running-on-yarn.html#launching-spark-on-yarn ### What changes were proposed in this pull request? add link for Debugging your Application in `running-on-yarn.html#launching-spark-on-yar` ### Why are the changes needed? Currrently on running-on-yarn.html page launching-spark-on-yarn section, it mentions to refer for Debugging your Application. It is better to add a direct link for it to save reader time to find the section ![image](https://user-images.githubusercontent.com/20021316/87867542-80cc5500-c9c0-11ea-8560-5ddcb5a308bc.png) ### Does this PR introduce _any_ user-facing change? Yes. Docs changes. 1. add link for Debugging your Application in `running-on-yarn.html#launching-spark-on-yarn` section Updated behavior: ![image](https://user-images.githubusercontent.com/20021316/87867534-6eeab200-c9c0-11ea-94ee-d3fa58157156.png) 2. update Spark Properties link to anchor link only ### How was this patch tested? manual test has been performed to test the updated Closes #29154 from brandonJY/patch-1. Authored-by: Brandon <brandonJY@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1267d80db6abaa130384b8e7b514c39aec3a8c77) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	21 July 2020, 04:42:32 UTC
ba15146	LantaoJin	21 July 2020, 03:47:45 UTC	[SPARK-32362][SQL][TEST] AdaptiveQueryExecSuite misses verifying AE results ### What changes were proposed in this pull request? Verify results for `AdaptiveQueryExecSuite` ### Why are the changes needed? `AdaptiveQueryExecSuite` misses verifying AE results ```scala QueryTest.sameRows(result.toSeq, df.collect().toSeq) ``` Even the results are different, no fail. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exists unit tests. Closes #29158 from LantaoJin/SPARK-32362. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8a1c24bb0364313f20382e2d14d5670b111a5742) Signed-off-by: HyukjinKwon <gurwls223@apache.org>	21 July 2020, 03:48:00 UTC
6d8b95c	gengjiaan	21 July 2020, 03:34:51 UTC	[SPARK-32365][SQL] Add a boundary condition for negative index in regexp_extract ### What changes were proposed in this pull request? The current implement of regexp_extract will throws a unprocessed exception show below: SELECT regexp_extract('1a 2b 14m', 'd+' -1) ``` java.lang.IndexOutOfBoundsException: No group -1 java.util.regex.Matcher.group(Matcher.java:538) org.apache.spark.sql.catalyst.expressions.RegExpExtract.nullSafeEval(regexpExpressions.scala:455) org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:704) org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52) org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:45) ``` ### Why are the changes needed? Fix a bug `java.lang.IndexOutOfBoundsException: No group -1` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? new UT Closes #29161 from beliefer/regexp_extract-group-not-allow-less-than-zero. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 02114f96d64ec5be23fc61be6f6b32df7ad48a6c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	21 July 2020, 03:35:06 UTC
9e7b130	HyukjinKwon	20 July 2020, 20:56:00 UTC	[SPARK-32368][SQL] pathGlobFilter, recursiveFileLookup and basePath should respect case insensitivity ### What changes were proposed in this pull request? This PR proposes to make the datasource options at `PartitioningAwareFileIndex` respect case insensitivity consistently: - `pathGlobFilter` - `recursiveFileLookup ` - `basePath` ### Why are the changes needed? To support consistent case insensitivity in datasource options. ### Does this PR introduce _any_ user-facing change? Yes, now users can also use case insensitive options such as `PathglobFilter`. ### How was this patch tested? Unittest were added. It reuses existing tests and adds extra clues to make it easier to track when the test is broken. Closes #29165 from HyukjinKwon/SPARK-32368. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 133c5edc807ca87825f61dd9a5d36018620033ee) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	20 July 2020, 20:56:17 UTC
2d94386	maruilei	20 July 2020, 20:48:57 UTC	[SPARK-32367][K8S][TESTS] Correct the spelling of parameter in KubernetesTestComponents ### What changes were proposed in this pull request? Correct the spelling of parameter 'spark.executor.instances' in KubernetesTestComponents ### Why are the changes needed? Parameter spelling error ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test is not needed. Closes #29164 from merrily01/SPARK-32367. Authored-by: maruilei <maruilei@jd.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ffdca8285ef7c7bd0da2622a81d9c21ada035794) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	20 July 2020, 20:49:13 UTC

Newer
Older