sort by:
Revision Author Date Message Commit Date
ebd83a5 [SPARK-30009][CORE][SQL][FOLLOWUP] Remove OrderingUtil and Utils.nanSafeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly ### What changes were proposed in this pull request? Follow up on https://github.com/apache/spark/pull/26654#discussion_r353826162 Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`. ### Why are the changes needed? Simplification of the previous change, which existed to support Scala 2.13 migration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #26761 from srowen/SPARK-30009.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 05 December 2019, 03:27:25 UTC
c5f312a [SPARK-30129][CORE] Set client's id in TransportClient after successful auth The new auth code was missing this bit, so it was not possible to know which app a client belonged to when auth was on. I also refactored the SASL test that checks for this so it also checks the new protocol (test failed before the fix, passes now). Closes #26760 from vanzin/SPARK-30129. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 05 December 2019, 01:11:50 UTC
29e09a8 [SPARK-30084][DOCS] Document how to trigger Jekyll build on Python API doc changes ### What changes were proposed in this pull request? This PR adds a note to the docs README showing how to get Jekyll to automatically pick up changes to the Python API docs. ### Why are the changes needed? `jekyll serve --watch` doesn't watch for changes to the API docs. Without the technique documented in this note, or something equivalent, developers have to manually retrigger a Jekyll build any time they update the Python API docs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I tested this PR manually by making changes to Python docstrings and confirming that Jekyll automatically picks them up and serves them locally. Closes #26719 from nchammas/SPARK-30084-watch-api-docs. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 04 December 2019, 23:31:23 UTC
2ceed6f [SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings ### What changes were proposed in this pull request? Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent. ### Why are the changes needed? Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13. The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings. While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`. ### Does this PR introduce any user-facing change? Should not change behavior. ### How was this patch tested? Existing tests. Closes #26748 from srowen/SPARK-29392.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 04 December 2019, 23:03:26 UTC
a2102c8 [SPARK-29453][WEBUI] Improve tooltips information for SQL tab ### What changes were proposed in this pull request? Adding tooltip to SQL tab for better usability. ### Why are the changes needed? There are a few common points of confusion in the UI that could be clarified with tooltips. We should add tooltips to explain. ### Does this PR introduce any user-facing change? yes. ![Screenshot 2019-11-23 at 9 47 41 AM](https://user-images.githubusercontent.com/8948111/69472963-aaec5980-0dd6-11ea-881a-fe6266171054.png) ### How was this patch tested? Manual test. Closes #26641 from 07ARB/SPARK-29453. Authored-by: 07ARB <ankitrajboudh@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 04 December 2019, 18:33:43 UTC
710ddab [SPARK-29914][ML] ML models attach metadata in `transform`/`transformSchema` ### What changes were proposed in this pull request? 1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute` 2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses` 3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k` 4, `leafCol` in GBT/RF add `AttributeGroup` containing vectorsize=`numTrees` 5, `leafCol` in DecisionTree add `NominalAttribute` 6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize 7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize ### Why are the changes needed? Appened metadata can be used in downstream ops, like `Classifier.getNumClasses` There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method. However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred. ### Does this PR introduce any user-facing change? Yes, add some metadatas in transformed dataset. ### How was this patch tested? existing testsuites and added testsuites Closes #26547 from zhengruifeng/add_output_vecSize. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com> 04 December 2019, 08:39:57 UTC
55132ae [SPARK-30099][SQL] Improve Analyzed Logical Plan ### What changes were proposed in this pull request? Avoid duplicate error message in Analyzed Logical plan. ### Why are the changes needed? Currently, when any query throws `AnalysisException`, same error message will be repeated because of following code segment. https://github.com/apache/spark/blob/04a5b8f5f80ee746bdc16267e44a993a9941d335/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L157-L166 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Result of `explain extended select * from wrong;` BEFORE > == Parsed Logical Plan == > 'Project [*] > +- 'UnresolvedRelation [wrong] > > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > AFTER > == Parsed Logical Plan == > 'Project [*] > +- 'UnresolvedRelation [wrong] > > == Analyzed Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > > == Physical Plan == > org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31; > 'Project [*] > +- 'UnresolvedRelation [wrong] > Closes #26734 from amanomer/cor_APlan. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 December 2019, 05:51:40 UTC
c8922d9 [SPARK-30113][SQL][PYTHON] Expose mergeSchema option in PySpark's ORC APIs ### What changes were proposed in this pull request? This PR is a follow-up to #24043 and cousin of #26730. It exposes the `mergeSchema` option directly in the ORC APIs. ### Why are the changes needed? So the Python API matches the Scala API. ### Does this PR introduce any user-facing change? Yes, it adds a new option directly in the ORC reader method signatures. ### How was this patch tested? I tested this manually as follows: ``` >>> spark.range(3).write.orc('test-orc') >>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested') >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] >>> spark.conf.set('spark.sql.orc.mergeSchema', True) >>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] ``` Closes #26755 from nchammas/SPARK-30113-ORC-mergeSchema. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 04 December 2019, 02:44:24 UTC
e766a32 [SPARK-30091][SQL][PYTHON] Document mergeSchema option directly in the PySpark Parquet APIs ### What changes were proposed in this pull request? This change properly documents the `mergeSchema` option directly in the Python APIs for reading Parquet data. ### Why are the changes needed? The docstring for `DataFrameReader.parquet()` mentions `mergeSchema` but doesn't show it in the API. It seems like a simple oversight. Before this PR, you'd have to do this to use `mergeSchema`: ```python spark.read.option('mergeSchema', True).parquet('test-parquet').show() ``` After this PR, you can use the option as (I believe) it was intended to be used: ```python spark.read.parquet('test-parquet', mergeSchema=True).show() ``` ### Does this PR introduce any user-facing change? Yes, this PR changes the signatures of `DataFrameReader.parquet()` and `DataStreamReader.parquet()` to match their docstrings. ### How was this patch tested? Testing the `mergeSchema` option directly seems to be left to the Scala side of the codebase. I tested my change manually to confirm the API works. I also confirmed that setting `spark.sql.parquet.mergeSchema` at the session does not get overridden by leaving `mergeSchema` at its default when calling `parquet()`: ``` >>> spark.conf.set('spark.sql.parquet.mergeSchema', True) >>> spark.range(3).write.parquet('test-parquet/id') >>> spark.range(3).withColumnRenamed('id', 'name').write.parquet('test-parquet/name') >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet').show() +----+----+ | id|name| +----+----+ |null| 1| |null| 2| |null| 0| | 1|null| | 2|null| | 0|null| +----+----+ >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', mergeSchema=False).show() +----+ | id| +----+ |null| |null| |null| | 1| | 2| | 0| +----+ ``` Closes #26730 from nchammas/parquet-merge-schema. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 04 December 2019, 02:31:57 UTC
708cf16 [SPARK-30111][K8S] Apt-get update to fix debian issues ### What changes were proposed in this pull request? Added apt-get update as per [docker best-practices](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#apt-get) ### Why are the changes needed? Builder is failing because: Without doing apt-get update, the APT lists get outdated and begins referring to package versions that no longer exist, hence the 404 trying to download them (Debian does not keep old versions in the archive when a package is updated). ### Does this PR introduce any user-facing change? no ### How was this patch tested? k8s builder Closes #26753 from ifilonenko/SPARK-30111. Authored-by: Ilan Filonenko <ifilonenko@bloomberg.net> Signed-off-by: shane knapp <incomplete@gmail.com> 04 December 2019, 01:59:02 UTC
5496e98 [SPARK-30109][ML] PCA use BLAS.gemv for sparse vectors ### What changes were proposed in this pull request? When PCA was first impled in [SPARK-5521](https://issues.apache.org/jira/browse/SPARK-5521), at that time Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So worked around it by applying a sparse matrix multiplication. Since [SPARK-7681](https://issues.apache.org/jira/browse/SPARK-7681), BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now. ### Why are the changes needed? for simplity ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26745 from zhengruifeng/pca_mul. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com> 04 December 2019, 01:50:00 UTC
3dd3a62 [SPARK-27990][SPARK-29903][PYTHON] Add recursiveFileLookup option to Python DataFrameReader ### What changes were proposed in this pull request? As a follow-up to #24830, this PR adds the `recursiveFileLookup` option to the Python DataFrameReader API. ### Why are the changes needed? This PR maintains Python feature parity with Scala. ### Does this PR introduce any user-facing change? Yes. Before this PR, you'd only be able to use this option as follows: ```python spark.read.option("recursiveFileLookup", True).text("test-data").show() ``` With this PR, you can reference the option from within the format-specific method: ```python spark.read.text("test-data", recursiveFileLookup=True).show() ``` This option now also shows up in the Python API docs. ### How was this patch tested? I tested this manually by creating the following directories with dummy data: ``` test-data ├── 1.txt └── nested └── 2.txt test-parquet ├── nested │ ├── _SUCCESS │ ├── part-00000-...-.parquet ├── _SUCCESS ├── part-00000-...-.parquet ``` I then ran the following tests and confirmed the output looked good: ```python spark.read.parquet("test-parquet", recursiveFileLookup=True).show() spark.read.text("test-data", recursiveFileLookup=True).show() spark.read.csv("test-data", recursiveFileLookup=True).show() ``` `python/pyspark/sql/tests/test_readwriter.py` seems pretty sparse. I'm happy to add my tests there, though it seems we have been deferring testing like this to the Scala side of things. Closes #26718 from nchammas/SPARK-27990-recursiveFileLookup-python. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 04 December 2019, 01:10:30 UTC
f3abee3 [SPARK-30051][BUILD] Clean up hadoop-3.2 dependency ### What changes were proposed in this pull request? This PR aims to cut `org.eclipse.jetty:jetty-webapp`and `org.eclipse.jetty:jetty-xml` transitive dependency from `hadoop-common`. ### Why are the changes needed? This will simplify our dependency management by the removal of unused dependencies. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the GitHub Action with all combinations and the Jenkins UT with (Hadoop-3.2). Closes #26742 from dongjoon-hyun/SPARK-30051. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 December 2019, 22:33:36 UTC
60f20e5 [SPARK-30060][CORE] Rename metrics enable/disable configs ### What changes were proposed in this pull request? This proposes to introduce a naming convention for Spark metrics configuration parameters used to enable/disable metrics source reporting using the Dropwizard metrics library: `spark.metrics.sourceNameCamelCase.enabled` and update 2 parameters to use this naming convention. ### Why are the changes needed? Currently Spark has a few parameters to enable/disable metrics reporting. Their naming pattern is not uniform and this can create confusion. Currently we have: `spark.metrics.static.sources.enabled` `spark.app.status.metrics.enabled` `spark.sql.streaming.metricsEnabled` ### Does this PR introduce any user-facing change? Update parameters for enabling/disabling metrics reporting new in Spark 3.0: `spark.metrics.static.sources.enabled` -> `spark.metrics.staticSources.enabled`, `spark.app.status.metrics.enabled` -> `spark.metrics.appStatusSource.enabled`. Note: `spark.sql.streaming.metricsEnabled` is left unchanged as it is already in use in Spark 2.x. ### How was this patch tested? Manually tested Closes #26692 from LucaCanali/uniformNamingMetricsEnableParameters. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 December 2019, 22:31:06 UTC
196ea93 [SPARK-30106][SQL][TEST] Fix the test of DynamicPartitionPruningSuite ### What changes were proposed in this pull request? Changed the test **DPP triggers only for certain types of query** in **DynamicPartitionPruningSuite**. ### Why are the changes needed? The sql has no partition key. The description "no predicate on the dimension table" is not right. So fix it. ``` Given("no predicate on the dimension table") withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true") { val df = sql( """ |SELECT * FROM fact_sk f |JOIN dim_store s |ON f.date_id = s.store_id """.stripMargin) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated UT Closes #26744 from deshanxiao/30106. Authored-by: xiaodeshan <xiaodeshan@xiaomi.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 December 2019, 22:27:48 UTC
4193d2f [SPARK-30012][CORE][SQL] Change classes extending scala collection classes to work with 2.13 ### What changes were proposed in this pull request? Move some classes extending Scala collections into parallel source trees, to support 2.13; other minor collection-related modifications. Modify some classes extending Scala collections to work with 2.13 as well as 2.12. In many cases, this means introducing parallel source trees, as the type hierarchy changed in ways that one class can't support both. ### Why are the changes needed? To support building for Scala 2.13 in the future. ### Does this PR introduce any user-facing change? There should be no behavior change. ### How was this patch tested? Existing tests. Note that the 2.13 changes are not tested by the PR builder, of course. They compile in 2.13 but can't even be tested locally. Later, once the project can be compiled for 2.13, thus tested, it's possible the 2.13 implementations will need updates. Closes #26728 from srowen/SPARK-30012. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 03 December 2019, 16:59:43 UTC
a3394e4 [SPARK-29477] Improve tooltip for Streaming tab ### What changes were proposed in this pull request? Added tooltip for duration columns in the batch table of streaming tab of Web UI. ### Why are the changes needed? Tooltips will help users in understanding columns of batch table of streaming tab. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Manually tested. Closes #26467 from iRakson/streaming_tab_tooltip. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 03 December 2019, 16:45:49 UTC
8c2849a [SPARK-30082][SQL] Do not replace Zeros when replacing NaNs ### What changes were proposed in this pull request? Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` ### How was this patch tested? Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes #26738 from johnhany97/SPARK-30082. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John Ayad <jayad@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 December 2019, 16:04:55 UTC
65552a8 [SPARK-30083][SQL] visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking ### What changes were proposed in this pull request? `UnaryPositive` only accepts numeric and interval as we defined, but what we do for this in `AstBuider.visitArithmeticUnary` is just bypassing it. This should not be omitted for the type checking requirement. ### Why are the changes needed? bug fix, you can find a pre-discussion here https://github.com/apache/spark/pull/26578#discussion_r347350398 ### Does this PR introduce any user-facing change? yes, +non-numeric-or-interval is now invalid. ``` -- !query 14 select +date '1900-01-01' -- !query 14 schema struct<DATE '1900-01-01':date> -- !query 14 output 1900-01-01 -- !query 15 select +timestamp '1900-01-01' -- !query 15 schema struct<TIMESTAMP '1900-01-01 00:00:00':timestamp> -- !query 15 output 1900-01-01 00:00:00 -- !query 16 select +map(1, 2) -- !query 16 schema struct<map(1, 2):map<int,int>> -- !query 16 output {1:2} -- !query 17 select +array(1,2) -- !query 17 schema struct<array(1, 2):array<int>> -- !query 17 output [1,2] -- !query 18 select -'1' -- !query 18 schema struct<(- CAST(1 AS DOUBLE)):double> -- !query 18 output -1.0 -- !query 19 select -X'1' -- !query 19 schema struct<> -- !query 19 output org.apache.spark.sql.AnalysisException cannot resolve '(- X'01')' due to data type mismatch: argument 1 requires (numeric or interval) type, however, 'X'01'' is of binary type.; line 1 pos 7 -- !query 20 select +X'1' -- !query 20 schema struct<X'01':binary> -- !query 20 output ``` ### How was this patch tested? add ut check Closes #26716 from yaooqinn/SPARK-30083. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 December 2019, 15:42:21 UTC
39291cf [SPARK-30048][SQL] Enable aggregates with interval type values for RelationalGroupedDataset ### What changes were proposed in this pull request? Now the min/max/sum/avg are support for intervals, we should also enable it in RelationalGroupedDataset ### Why are the changes needed? API consistency improvement ### Does this PR introduce any user-facing change? yes, Dataset support min/max/sum/avg(mean) on intervals ### How was this patch tested? add ut Closes #26681 from yaooqinn/SPARK-30048. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 December 2019, 10:40:14 UTC
d7b268a [SPARK-29348][SQL] Add observable Metrics for Streaming queries ### What changes were proposed in this pull request? Observable metrics are named arbitrary aggregate functions that can be defined on a query (Dataframe). As soon as the execution of a Dataframe reaches a completion point (e.g. finishes batch query or reaches streaming epoch) a named event is emitted that contains the metrics for the data processed since the last completion point. A user can observe these metrics by attaching a listener to spark session, it depends on the execution mode which listener to attach: - Batch: `QueryExecutionListener`. This will be called when the query completes. A user can access the metrics by using the `QueryExecution.observedMetrics` map. - (Micro-batch) Streaming: `StreamingQueryListener`. This will be called when the streaming query completes an epoch. A user can access the metrics by using the `StreamingQueryProgress.observedMetrics` map. Please note that we currently do not support continuous execution streaming. ### Why are the changes needed? This enabled observable metrics. ### Does this PR introduce any user-facing change? Yes. It adds the `observe` method to `Dataset`. ### How was this patch tested? - Added unit tests for the `CollectMetrics` logical node to the `AnalysisSuite`. - Added unit tests for `StreamingProgress` JSON serialization to the `StreamingQueryStatusAndProgressSuite`. - Added integration tests for streaming to the `StreamingQueryListenerSuite`. - Added integration tests for batch to the `DataFrameCallbackSuite`. Closes #26127 from hvanhovell/SPARK-29348. Authored-by: herman <herman@databricks.com> Signed-off-by: herman <herman@databricks.com> 03 December 2019, 10:25:49 UTC
075ae1e [SPARK-29537][SQL] throw exception when user defined a wrong base path ### What changes were proposed in this pull request? When user defined a base path which is not an ancestor directory for all the input paths, throw exception immediately. ### Why are the changes needed? Assuming that we have a DataFrame[c1, c2] be written out in parquet and partitioned by c1. When using `spark.read.parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c2 only. But if we use `spark.read.option("basePath", "/path/from").parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c1 and c2. This's happens because a wrong base path does not actually work in `parsePartition()`, so paring would continue until it reaches a directory without "=". And I think the result of the second read way doesn't make sense. ### Does this PR introduce any user-facing change? Yes, with this change, user would hit `IllegalArgumentException ` when given a wrong base path while previous behavior doesn't. ### How was this patch tested? Added UT. Closes #26195 from Ngone51/dev-wrong-basePath. Lead-authored-by: wuyi <ngone_5451@163.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 03 December 2019, 09:02:50 UTC
4021354 [SPARK-30044][ML] MNB/CNB/BNB use empty sigma matrix instead of null ### What changes were proposed in this pull request? MNB/CNB/BNB use empty sigma matrix instead of null ### Why are the changes needed? 1,Using empty sigma matrix will simplify the impl 2,I am reviewing FM impl these days, FMModels have optional bias and linear part. It seems more reasonable to set optional part an empty vector/matrix or zero value than `null` ### Does this PR introduce any user-facing change? yes, sigma from `null` to empty matrix ### How was this patch tested? updated testsuites Closes #26679 from zhengruifeng/nb_use_empty_sigma. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com> 03 December 2019, 02:02:23 UTC
332e593 [SPARK-29943][SQL] Improve error messages for unsupported data type ### What changes were proposed in this pull request? Improve error messages for unsupported data type. ### Why are the changes needed? When the spark reads the hive table and encounters an unsupported field type, the exception message has only one unsupported type, and the user cannot know which field of which table. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? ```create view t AS SELECT STRUCT('a' AS `$a`, 1 AS b) as q;``` current: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int> change: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>, column: q ```select * from t,t_normal_1,t_normal_2``` current: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int> change: org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>, column: q, db: default, table: t Closes #26577 from cxzl25/unsupport_data_type_msg. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 03 December 2019, 01:07:09 UTC
68034a8 [SPARK-30072][SQL] Create dedicated planner for subqueries ### What changes were proposed in this pull request? This PR changes subquery planning by calling the planner and plan preparation rules on the subquery plan directly. Before we were creating a `QueryExecution` instance for subqueries to get the executedPlan. This would re-run analysis and optimization on the subqueries plan. Running the analysis again on an optimized query plan can have unwanted consequences, as some rules, for example `DecimalPrecision`, are not idempotent. As an example, consider the expression `1.7 * avg(a)` which after applying the `DecimalPrecision` rule becomes: ``` promote_precision(1.7) * promote_precision(avg(a)) ``` After the optimization, more specifically the constant folding rule, this expression becomes: ``` 1.7 * promote_precision(avg(a)) ``` Now if we run the analyzer on this optimized query again, we will get: ``` promote_precision(1.7) * promote_precision(promote_precision(avg(a))) ``` Which will later optimized as: ``` 1.7 * promote_precision(promote_precision(avg(a))) ``` As can be seen, re-running the analysis and optimization on this expression results in an expression with extra nested promote_preceision nodes. Adding unneeded nodes to the plan is problematic because it can eliminate situations where we can reuse the plan. We opted to introduce dedicated planners for subuqueries, instead of making the DecimalPrecision rule idempotent, because this eliminates this entire category of problems. Another benefit is that planning time for subqueries is reduced. ### How was this patch tested? Unit tests Closes #26705 from dbaliafroozeh/CreateDedicatedPlannerForSubqueries. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com> 02 December 2019, 19:56:40 UTC
e04a634 [SPARK-30075][CORE][TESTS] Fix the hashCode implementation of ArrayKeyIndexType correctly ### What changes were proposed in this pull request? This patch fixes the bug on ArrayKeyIndexType.hashCode() as it is simply calling Array.hashCode() which in turn calls Object.hashCode(). That should be Arrays.hashCode() to reflect the elements in the array. ### Why are the changes needed? I've encountered the bug in #25811 while adding test codes for #25811, and I've split the fix into individual PR to speed up reviewing. Without this patch, ArrayKeyIndexType would bring various issues when it's used as type of collections. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've skipped adding UT as ArrayKeyIndexType is in test and the patch is pretty simple one-liner. Closes #26709 from HeartSaVioR/SPARK-30075. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 02 December 2019, 15:06:37 UTC
babefde [SPARK-30085][SQL][DOC] Standardize sql reference ### What changes were proposed in this pull request? Standardize sql reference ### Why are the changes needed? To have consistent docs ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Tested using jykyll build --serve Closes #26721 from huaxingao/spark-30085. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 02 December 2019, 15:05:40 UTC
e842033 [SPARK-27721][BUILD] Switch to use right leveldbjni according to the platforms This change adds a profile to switch to use the right leveldbjni package according to the platforms: aarch64 uses org.openlabtesting.leveldbjni:leveldbjni-all.1.8, and other platforms use the old one org.fusesource.leveldbjni:leveldbjni-all.1.8. And because some hadoop dependencies packages are also depend on org.fusesource.leveldbjni:leveldbjni-all, but hadoop merge the similar change on trunk, details see https://issues.apache.org/jira/browse/HADOOP-16614, so exclude the dependency of org.fusesource.leveldbjni for these hadoop packages related. Then Spark can build/test on aarch64 platform successfully. Closes #26636 from huangtianhua/add-aarch64-leveldbjni. Authored-by: huangtianhua <huangtianhua@huawei.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 02 December 2019, 15:04:00 UTC
54edaee [MINOR][SS] Add implementation note on overriding serialize/deserialize in HDFSMetadataLog methods' scaladoc ### What changes were proposed in this pull request? The patch adds scaladoc on `HDFSMetadataLog.serialize` and `HDFSMetadataLog.deserialize` for adding implementation note when overriding - HDFSMetadataLog calls `serialize` and `deserialize` inside try-finally and caller will do the resource (input stream, output stream) cleanup, so resource cleanup should not be performed in these methods, but there's no note on this (only code comment, not scaladoc) which is easy to be missed. ### Why are the changes needed? Contributors who are unfamiliar with the intention seem to think it as a bug if the resource is not cleaned up in serialize/deserialize of subclass of HDFSMetadataLog, and they couldn't know about the intention without reading the code of HDFSMetadataLog. Adding the note as scaladoc would expand the visibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Just a doc change. Closes #26732 from HeartSaVioR/MINOR-SS-HDFSMetadataLog-serde-scaladoc. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: dz <953396112@qq.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 02 December 2019, 15:01:45 UTC
e271664 [MINOR][SQL] Rename config name to spark.sql.analyzer.failAmbiguousSelfJoin.enabled ### What changes were proposed in this pull request? add `.enabled` postfix to `spark.sql.analyzer.failAmbiguousSelfJoin`. ### Why are the changes needed? to follow the existing naming style ### Does this PR introduce any user-facing change? no ### How was this patch tested? not needed Closes #26694 from cloud-fan/conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 13:05:06 UTC
4e073f3 [SPARK-30047][SQL] Support interval types in UnsafeRow ### What changes were proposed in this pull request? Optimize aggregates on interval values from sort-based to hash-based, and we can use the `org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch` for better performance. ### Why are the changes needed? improve aggerates ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut and existing ones Closes #26680 from yaooqinn/SPARK-30047. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 12:47:23 UTC
04a5b8f [SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE ### What changes were proposed in this pull request? In SPARK-29421 (#26097) , we can specify a different table provider for `CREATE TABLE LIKE` via `USING provider`. Hive support `STORED AS` new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` For Hive compatibility, we should also support `STORED AS` in `CREATE TABLE LIKE`. ### Why are the changes needed? See https://github.com/apache/spark/pull/26097#issue-327424759 ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat]; ### How was this patch tested? Add UTs. Closes #26466 from LantaoJin/SPARK-29839. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 08:11:58 UTC
169415f [SPARK-30025][CORE] Continuous shuffle block fetching should be disabled by default when the old fetch protocol is used ### What changes were proposed in this pull request? Disable continuous shuffle block fetching when the old fetch protocol in use. ### Why are the changes needed? The new feature of continuous shuffle block fetching depends on the latest version of the shuffle fetch protocol. We should keep this constraint in `BlockStoreShuffleReader.fetchContinuousBlocksInBatch`. ### Does this PR introduce any user-facing change? Users will not get the exception related to continuous shuffle block fetching when old version of the external shuffle service is used. ### How was this patch tested? Existing UT. Closes #26663 from xuanyuanking/SPARK-30025. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 07:59:12 UTC
03ac1b7 [SPARK-29959][ML][PYSPARK] Summarizer support more metrics ### What changes were proposed in this pull request? Summarizer support more metrics: sum, std ### Why are the changes needed? Those metrics are widely used, it will be convenient to directly obtain them other than a conversion. in `NaiveBayes`: we want the sum of vectors, mean & weightSum need to computed then multiplied in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std ### Does this PR introduce any user-facing change? yes, new metrics are exposed to end users ### How was this patch tested? added testsuites Closes #26596 from zhengruifeng/summarizer_add_metrics. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com> 02 December 2019, 06:44:31 UTC
85cb388 [SPARK-30050][SQL] analyze table and rename table should not erase hive table bucketing info ### What changes were proposed in this pull request? This patch adds Hive provider into table metadata in `HiveExternalCatalog.alterTableStats`. When we call `HiveClient.alterTable`, `alterTable` will erase if it can not find hive provider in given table metadata. Rename table also has this issue. ### Why are the changes needed? Because running `ANALYZE TABLE` on a Hive table, if the table has bucketing info, will erase existing bucket info. ### Does this PR introduce any user-facing change? Yes. After this PR, running `ANALYZE TABLE` on Hive table, won't erase existing bucketing info. ### How was this patch tested? Unit test. Closes #26685 from viirya/fix-hive-bucket. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 05:40:11 UTC
51e69fe [SPARK-29851][SQL][FOLLOW-UP] Use foreach instead of misusing map ### What changes were proposed in this pull request? This PR proposes to use foreach instead of misusing map as a small followup of #26476. This could cause some weird errors potentially and it's not a good practice anyway. See also SPARK-16694 ### Why are the changes needed? To avoid potential issues like SPARK-16694 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests should cover. Closes #26729 from HyukjinKwon/SPARK-29851. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 December 2019, 04:40:00 UTC
d1465a1 [SPARK-30074][SQL] The maxNumPostShufflePartitions config should obey reducePostShufflePartitions enabled ### What changes were proposed in this pull request? 1. Make maxNumPostShufflePartitions config obey reducePostShfflePartitions config. 2. Update the description for all the SQLConf affected by `spark.sql.adaptive.enabled`. ### Why are the changes needed? Make the relation between these confs clearer. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #26664 from xuanyuanking/SPARK-9853-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 04:37:06 UTC
5a1896a [SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate columns ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ |col1|col2|col2| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes #26700 from imback82/na_drop. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 04:25:28 UTC
87ebfaf [SPARK-29956][SQL] A literal number with an exponent should be parsed to Double ### What changes were proposed in this pull request? For a literal number with an exponent(e.g. 1e-45, 1E2), we'd parse it to Double by default rather than Decimal. And user could still use `spark.sql.legacy.exponentLiteralToDecimal.enabled=true` to fall back to previous behavior. ### Why are the changes needed? According to ANSI standard of SQL, we see that the (part of) definition of `literal` : ``` <approximate numeric literal> ::= <mantissa> E <exponent> ``` which indicates that a literal number with an exponent should be approximate numeric(e.g. Double) rather than exact numeric(e.g. Decimal). And when we test Presto, we found that Presto also conforms to this standard: ``` presto:default> select typeof(1E2); _col0 -------- double (1 row) ``` ``` presto:default> select typeof(1.2); _col0 -------------- decimal(2,1) (1 row) ``` We also find that, actually, literals like `1E2` are parsed as Double before Spark2.1, but changed to Decimal after #14828 due to *The difference between the two confuses most users* as it said. But we also see support(from DB2 test) of original behavior at #14828 (comment). Although, we also see that PostgreSQL has its own implementation: ``` postgres=# select pg_typeof(1E2); pg_typeof ----------- numeric (1 row) postgres=# select pg_typeof(1.2); pg_typeof ----------- numeric (1 row) ``` We still think that Spark should also conform to this standard while considering SQL standard and Spark own history and majority DBMS and also user experience. ### Does this PR introduce any user-facing change? Yes. For `1E2`, before this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [1E+2: decimal(1,-2)] ``` After this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [100.0: double] ``` And for `1E-45`, before this PR: ``` org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38 == SQL == select 1E-45 at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:131) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided ``` after this PR: ``` scala> spark.sql("select 1E-45"); res1: org.apache.spark.sql.DataFrame = [1.0E-45: double] ``` And before this PR, user may feel super weird to see that `select 1e40` works but `select 1e-40 fails`. And now, both of them work well. ### How was this patch tested? updated `literals.sql.out` and `ansi/literals.sql.out` Closes #26595 from Ngone51/SPARK-29956. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 02 December 2019, 03:34:56 UTC
708ab57 [SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. **Spark SQL**: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ | CAST(1 AS DECIMAL(38,18)) | +----------------------------+--+ | 1.000000000000000000 | +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ |1.000000000000000000 | +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ ``` **PostgreSQL**: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` **Presto**: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #26697 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 02 December 2019, 00:02:39 UTC
f32ca4b [SPARK-30076][BUILD][TESTS] Upgrade Mockito to 3.1.0 ### What changes were proposed in this pull request? We used 2.28.2 of Mockito as of https://github.com/apache/spark/pull/25139 because 3.0.0 might be unstable. Now 3.1.0 is released. See release notes - https://github.com/mockito/mockito/blob/v3.1.0/doc/release-notes/official.md ### Why are the changes needed? To bring the fixes made in the dependency. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Jenkins will test. Closes #26707 from HyukjinKwon/upgrade-Mockito. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 November 2019, 18:23:11 UTC
700a2ed [SPARK-30057][DOCS] Add a statement of platforms Spark runs on Closes #26690 from huangtianhua/add-note-spark-runs-on-arm64. Authored-by: huangtianhua <huangtianhua@huawei.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 November 2019, 15:07:01 UTC
f22177c [SPARK-29486][SQL][FOLLOWUP] Document the reason to add days field ### What changes were proposed in this pull request? Follow up of #26134 to document the reason to add days filed and explain how do we use it ### Why are the changes needed? only comment ### Does this PR introduce any user-facing change? no ### How was this patch tested? no need test Closes #26701 from LinhongLiu/spark-29486-followup. Authored-by: Liu,Linhong <liulinhong@baidu.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 30 November 2019, 14:43:34 UTC
91b83de [SPARK-30086][SQL][TESTS] Run HiveThriftServer2ListenerSuite on a dedicated JVM to fix flakiness ### What changes were proposed in this pull request? This PR tries to fix flakiness in `HiveThriftServer2ListenerSuite` by using a dedicated JVM (after we switch to Hive 2.3 by default in PR builders). Likewise in https://github.com/apache/spark/commit/4a73bed3180aeb79c92bb19aea2ac5a97899731a, there's no explicit evidence for this fix. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114653/testReport/org.apache.spark.sql.hive.thriftserver.ui/HiveThriftServer2ListenerSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ ``` sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.LinkageError: loader constraint violation: loader (instance of net/bytebuddy/dynamic/loading/MultipleParentClassLoader) previously initiated loading for a different type with name "org/apache/hive/service/ServiceStateChangeListener" at org.mockito.codegen.HiveThriftServer2$MockitoMock$1974707245.<clinit>(Unknown Source) at sun.reflect.GeneratedSerializationConstructorAccessor164.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.objenesis.instantiator.sun.SunReflectionFactoryInstantiator.newInstance(SunReflectionFactoryInstantiator.java:48) at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73) at org.mockito.internal.creation.instance.ObjenesisInstantiator.newInstance(ObjenesisInstantiator.java:19) at org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:47) at org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25) at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35) at org.mockito.internal.MockitoCore.mock(MockitoCore.java:62) at org.mockito.Mockito.mock(Mockito.java:1908) at org.mockito.Mockito.mock(Mockito.java:1880) at org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite.createAppStatusStore(HiveThriftServer2ListenerSuite.scala:156) at org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite.$anonfun$new$3(HiveThriftServer2ListenerSuite.scala:47) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ``` ### Why are the changes needed? To make test cases more robust. ### Does this PR introduce any user-facing change? No (dev only). ### How was this patch tested? Jenkins build. Closes #26720 from shahidki31/mock. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 30 November 2019, 11:30:04 UTC
32af700 [SPARK-25016][INFRA][FOLLOW-UP] Remove leftover for dropping Hadoop 2.6 in Jenkins's test script ### What changes were proposed in this pull request? This PR proposes to remove the leftover. After https://github.com/apache/spark/pull/22615, we don't have Hadoop 2.6 profile anymore in master. ### Why are the changes needed? Using "test-hadoop2.6" against master branch in a PR wouldn't work. ### Does this PR introduce any user-facing change? No (dev only). ### How was this patch tested? Manually tested at https://github.com/apache/spark/pull/26707 and Jenkins build will test. Without this fix, and hadoop2.6 in the pr title, it shows as below: ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.6 in the list. Valid options are dict_keys(['hadoop2.7', 'hadoop3.2']) Attempting to post to Github... ``` Closes #26708 from HyukjinKwon/SPARK-25016. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 30 November 2019, 03:49:14 UTC
4a73bed [SPARK-29991][INFRA] Support Hive 1.2 and Hive 2.3 (default) in PR builder ### What changes were proposed in this pull request? Currently, Apache Spark PR Builder using `hive-1.2` for `hadoop-2.7` and `hive-2.3` for `hadoop-3.2`. This PR aims to support - `[test-hive1.2]` in PR builder - `[test-hive2.3]` in PR builder to be consistent and independent of the default profile - After this PR, all PR builders will use Hive 2.3 by default (because Spark uses Hive 2.3 by default as of https://github.com/apache/spark/commit/c98e5eb3396a6db92f2420e743afa9ddff319ca2) - Use default profile in AppVeyor build. Note that this was reverted due to unexpected test failure at `ThriftServerPageSuite`, which was investigated in https://github.com/apache/spark/pull/26706 . This PR fixed it by letting it use their own forked JVM. There is no explicit evidence for this fix and it was just my speculation, and thankfully it fixed at least. ### Why are the changes needed? This new tag allows us more flexibility. ### Does this PR introduce any user-facing change? No. (This is a dev-only change.) ### How was this patch tested? Check the Jenkins triggers in this PR. Default: ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-2.7 -Phive-2.3 -Phive-thriftserver -Pmesos -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pkubernetes -Pkinesis-asl -Pyarn test:package streaming-kinesis-asl-assembly/assembly ``` `[test-hive1.2][test-hadoop3.2]`: ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-3.2 -Phive-1.2 -Phadoop-cloud -Pyarn -Pspark-ganglia-lgpl -Phive -Phive-thriftserver -Pmesos -Pkubernetes -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly ``` `[test-maven][test-hive-2.3]`: ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using Maven with these arguments: -Phadoop-2.7 -Phive-2.3 -Pspark-ganglia-lgpl -Pyarn -Phive -Phadoop-cloud -Pkinesis-asl -Pmesos -Pkubernetes -Phive-thriftserver clean package -DskipTests ``` Closes #26710 from HyukjinKwon/SPARK-29991. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 30 November 2019, 03:48:15 UTC
b182ed8 [SPARK-29724][SPARK-29726][WEBUI][SQL] Support JDBC/ODBC tab for HistoryServer WebUI ### What changes were proposed in this pull request? Support JDBC/ODBC tab for HistoryServer WebUI. Currently from Historyserver we can't access the JDBC/ODBC tab for thrift server applications. In this PR, I am doing 2 main changes 1. Refactor existing thrift server listener to support kvstore 2. Add history server plugin for thrift server listener and tab. ### Why are the changes needed? Users can access Thriftserver tab from History server for both running and finished applications, ### Does this PR introduce any user-facing change? Support for JDBC/ODBC tab for the WEBUI from History server ### How was this patch tested? Add UT and Manual tests 1. Start Thriftserver and Historyserver ``` sbin/stop-thriftserver.sh sbin/stop-historyserver.sh sbin/start-thriftserver.sh sbin/start-historyserver.sh ``` 2. Launch beeline `bin/beeline -u jdbc:hive2://localhost:10000` 3. Run queries Go to the JDBC/ODBC page of the WebUI from History server ![image](https://user-images.githubusercontent.com/23054875/68365501-cf013700-0156-11ea-84b4-fda8008c92c4.png) Closes #26378 from shahidki31/ThriftKVStore. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 30 November 2019, 03:44:31 UTC
9351e3e Revert "[SPARK-29991][INFRA] Support `test-hive1.2` in PR Builder" This reverts commit dde0d2fcadbdf59a6ec696da12bd72bdfc968bc5. 29 November 2019, 04:23:22 UTC
43556e4 [SPARK-29877][GRAPHX] static PageRank allow checkPoint from previous computations ### What changes were proposed in this pull request? Add an optional parameter to the staticPageRank computation with the result of a previous PageRank computation. This would make the algorithm start from a different starting point closer to the convergence configuration ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-29877 It would be really helpful to have the possibility, when computing staticPageRank to use a previous computation as a checkpoint to continue the iterations. ### Does this PR introduce any user-facing change? Yes, it allows to start the static page Rank computation from the point where an earlier one finished. Example: Compute 10 iteration first, and continue for 3 more iterations ```scala val partialPageRank = graph.ops.staticPageRank(numIter=10, resetProb=0.15) val continuationPageRank = graph.ops.staticPageRank(numIter=3, resetProb=0.15, Some(partialPageRank)) ``` ### How was this patch tested? Yes, some tests were added. Testing was done as follow: - Check how many iterations it takes for a static Page Rank computation to converge - Run the static Page Rank computation for half of these iterations and take result as checkpoint - Restart computation and check that number of iterations it takes to converge. It never has to be larger than the original one and in most of the cases it is much smaller. Due to the presence of sinks and the normalization done in [[SPARK-18847]] it is not exactly equivalent to compute static page rank for 2 iterations, take the result at checkpoint and run for 2 more iterations than to compute directly for 4 iterations. However this checkpointing can give the algorithm a hint about the true distribution of pageRanks in the graph Closes #26608 from JoanFM/pageRank_checkPoint. Authored-by: joanfontanals <jfontanals@ntent.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 28 November 2019, 14:36:54 UTC
dde0d2f [SPARK-29991][INFRA] Support `test-hive1.2` in PR Builder ### What changes were proposed in this pull request? Currently, Apache Spark PR Builder using `hive-1.2` for `hadoop-2.7` and `hive-2.3` for `hadoop-3.2`. This PR aims to support `[test-hive1.2]` in PR Builder in order to cut the correlation between `hive-1.2/2.3` to `hadoop-2.7/3.2`. After this PR, the PR Builder will use `hive-2.3` by default for all profiles (if there is no `test-hive1.2`.) ### Why are the changes needed? This new tag allows us more flexibility. ### Does this PR introduce any user-facing change? No. (This is a dev-only change.) ### How was this patch tested? Check the Jenkins triggers in this PR. **BEFORE** ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-2.7 -Phive-1.2 -Pyarn -Pkubernetes -Phive -Phadoop-cloud -Pspark-ganglia-lgpl -Phive-thriftserver -Pkinesis-asl -Pmesos test:package streaming-kinesis-asl-assembly/assembly ``` **AFTER** 1. Title: [[SPARK-29991][INFRA][test-hive1.2] Support `test-hive1.2` in PR Builder](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114550/testReport) ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-2.7 -Phive-1.2 -Pkinesis-asl -Phadoop-cloud -Pyarn -Phive -Pmesos -Pspark-ganglia-lgpl -Pkubernetes -Phive-thriftserver test:package streaming-kinesis-asl-assembly/assembly ``` 2. Title: [[SPARK-29991][INFRA] Support `test hive1.2` in PR Builder](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114551/testReport) - Note that I removed the hyphen intentionally from `test-hive1.2`. ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-2.7 -Phive-thriftserver -Pkubernetes -Pspark-ganglia-lgpl -Phadoop-cloud -Phive -Pmesos -Pyarn -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly ``` Closes #26695 from dongjoon-hyun/SPARK-29991. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 28 November 2019, 05:14:40 UTC
9459833 [SPARK-29989][INFRA] Add `hadoop-2.7/hive-2.3` pre-built distribution ### What changes were proposed in this pull request? This PR aims to add another pre-built binary distribution with `-Phadoop-2.7 -Phive-1.2` at `Apache Spark 3.0.0`. **PRE-BUILT BINARY DISTRIBUTION** ``` spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz.asc spark-3.0.0-SNAPSHOT-bin-hadoop2.7-hive1.2.tgz.sha512 ``` **CONTENTS (snippet)** ``` $ ls *hadoop-* hadoop-annotations-2.7.4.jar hadoop-mapreduce-client-shuffle-2.7.4.jar hadoop-auth-2.7.4.jar hadoop-yarn-api-2.7.4.jar hadoop-client-2.7.4.jar hadoop-yarn-client-2.7.4.jar hadoop-common-2.7.4.jar hadoop-yarn-common-2.7.4.jar hadoop-hdfs-2.7.4.jar hadoop-yarn-server-common-2.7.4.jar hadoop-mapreduce-client-app-2.7.4.jar hadoop-yarn-server-web-proxy-2.7.4.jar hadoop-mapreduce-client-common-2.7.4.jar parquet-hadoop-1.10.1.jar hadoop-mapreduce-client-core-2.7.4.jar parquet-hadoop-bundle-1.6.0.jar hadoop-mapreduce-client-jobclient-2.7.4.jar $ ls *hive-* hive-beeline-1.2.1.spark2.jar hive-jdbc-1.2.1.spark2.jar hive-cli-1.2.1.spark2.jar hive-metastore-1.2.1.spark2.jar hive-exec-1.2.1.spark2.jar spark-hive-thriftserver_2.12-3.0.0-SNAPSHOT.jar ``` ### Why are the changes needed? Since Apache Spark switched to use `-Phive-2.3` by default, all pre-built binary distribution will use `-Phive-2.3`. This PR adds `hadoop-2.7/hive-1.2` distribution to provide a similar combination like `Apache Spark 2.4` line. ### Does this PR introduce any user-facing change? Yes. This is additional distribution which resembles to `Apache Spark 2.4` line in terms of `hive` version. ### How was this patch tested? Manual. Please note that we need a dry-run mode, but the AS-IS release script do not generate additional combinations including this in `dry-run` mode. Closes #26688 from dongjoon-hyun/SPARK-29989. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com> 27 November 2019, 23:55:52 UTC
9cd174a Revert "[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column" This reverts commit 19af1fe3a2b604a653c9f736d11648b79b93bb17. 27 November 2019, 19:07:08 UTC
16da714 [SPARK-29979][SQL][FOLLOW-UP] improve the output of DesribeTableExec ### What changes were proposed in this pull request? refine the output of "DESC TABLE" command. After this PR, the output of "DESC TABLE" command is like below : ``` id bigint data string # Partitioning Part 0 id # Detailed Table Information Name testca.table_name Comment this is a test table Location /tmp/testcat/table_name Provider foo Table Properties [bar=baz] ``` ### Why are the changes needed? Currently, "DESC TABLE" will show reserved properties (eg. location, comment) in the "Table Property" section. Since reserved properties are different from common properties, displaying reserved properties together with other table detailed information and displaying other properties in single field should be reasonable, and it is consistent with hive and DescribeTableCommand action. ### Does this PR introduce any user-facing change? yes, the output of "DESC TABLE" command is refined as above. ### How was this patch tested? Update existing unit tests. Closes #26677 from fuwhu/SPARK-29979-FOLLOWUP-1. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 November 2019, 15:16:53 UTC
d075b33 [SPARK-28366][CORE][FOLLOW-UP] Improve the conf IO_WARNING_LARGEFILETHRESHOLD ### What changes were proposed in this pull request? Improve conf `IO_WARNING_LARGEFILETHRESHOLD` (a.k.a `spark.io.warning.largeFileThreshold`): * reword documentation * change type from `long` to `bytes` ### Why are the changes needed? Improvements according to https://github.com/apache/spark/pull/25134#discussion_r350570804 & https://github.com/apache/spark/pull/25134#discussion_r350570917. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #26691 from Ngone51/SPARK-28366-followup. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 November 2019, 12:34:22 UTC
7c0ce28 [SPARK-30056][INFRA] Skip building test artifacts in `dev/make-distribution.sh` ### What changes were proposed in this pull request? This PR aims to skip building test artifacts in `dev/make-distribution.sh`. Since Apache Spark 3.0.0, we need to build additional binary distribution, this helps the release process by speeding up building multiple binary distributions. ### Why are the changes needed? Since the generated binary artifacts are irrelevant to the test jars, we can skip this. **BEFORE** ``` $ time dev/make-distribution.sh 726.86 real 2526.04 user 45.63 sys ``` **AFTER** ``` $ time dev/make-distribution.sh 305.54 real 1099.99 user 26.52 sys ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually check `dev/make-distribution.sh` result and time. Closes #26689 from dongjoon-hyun/SPARK-30056. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 November 2019, 09:19:21 UTC
19af1fe [SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. **Spark SQL**: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ | CAST(1 AS DECIMAL(38,18)) | +----------------------------+--+ | 1.000000000000000000 | +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ |1.000000000000000000 | +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ ``` **PostgreSQL**: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` **Presto**: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #25214 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 27 November 2019, 09:13:33 UTC
a58d91b [SPARK-29768][SQL] Column pruning through nondeterministic expressions ### What changes were proposed in this pull request? Support columnar pruning through non-deterministic expressions. ### Why are the changes needed? In some cases, columns can still be pruned even though nondeterministic expressions appears. e.g. for the plan `Filter('a = 1, Project(Seq('a, rand() as 'r), LogicalRelation('a, 'b)))`, we shall still prune column b while non-deterministic expression appears. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a new test file: `ScanOperationSuite`. Added test in `FileSourceStrategySuite` to verify the right prune behavior for both DS v1 and v2. Closes #26629 from Ngone51/SPARK-29768. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 November 2019, 07:37:01 UTC
4fd585d [SPARK-30008][SQL] The dataType of collect_list/collect_set aggs should be ArrayType(_, false) ### What changes were proposed in this pull request? ```scala // Do not allow null values. We follow the semantics of Hive's collect_list/collect_set here. // See: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMkCollectionEvaluator ``` These two functions do not allow null values as they are defined, so their elements should not contain null. ### Why are the changes needed? Casting collect_list(a) to ArrayType(_, false) fails before this fix. ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #26651 from yaooqinn/SPARK-30008. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 November 2019, 04:40:21 UTC
08e2a39 [SPARK-29997][WEBUI] Show job name for empty jobs in WebUI ### What changes were proposed in this pull request? In current implementation, job name for empty jobs is not shown so I've made change to show it. ### Why are the changes needed? To make debug easier. ### Does this PR introduce any user-facing change? Yes. Before applying my change, the `Job Page` will show as follows as the result of submitting a job which contains no partitions. ![fix-ui-for-empty-job-before](https://user-images.githubusercontent.com/4736016/69410847-33bfb280-0d4f-11ea-9878-d67638cbe4cb.png) After applying my change, the page will show a display like a following screenshot. ![fix-ui-for-empty-job-after](https://user-images.githubusercontent.com/4736016/69411021-86996a00-0d4f-11ea-8dea-bb8456159d18.png) ### How was this patch tested? Manual test. Closes #26637 from sarutak/fix-ui-for-empty-job. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 27 November 2019, 03:38:46 UTC
5b628f8 Revert "[SPARK-26081][SPARK-29999]" ### What changes were proposed in this pull request? This reverts commit 31c4fab (#23052) to make sure the partition calling `ManifestFileCommitProtocol.newTaskTempFile` creates actual file. This also reverts part of commit 0d3d46d (#26639) since the commit fixes the issue raised from 31c4fab and we're reverting back. The reason of partial revert is that we found the UT be worth to keep as it is, preventing regression - given the UT can detect the issue on empty partition -> no actual file. This makes one more change to UT; moved intentionally to test both DSv1 and DSv2. ### Why are the changes needed? After the changes in SPARK-26081 (commit 31c4fab / #23052), CSV/JSON/TEXT don't create actual file if the partition is empty. This optimization causes a problem in `ManifestFileCommitProtocol`: the API `newTaskTempFile` is called without actual file creation. Then `fs.getFileStatus` throws `FileNotFoundException` since the file is not created. SPARK-29999 (commit 0d3d46d / #26639) fixes the problem. But it is too costly to check file existence on each task commit. We should simply restore the behavior before SPARK-26081. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Jenkins build will follow. Closes #26671 from HeartSaVioR/revert-SPARK-26081-SPARK-29999. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> 27 November 2019, 02:36:08 UTC
bdf0c60 [SPARK-28752][BUILD][FOLLOWUP] Fix to install `rouge` instead of `rogue` ### What changes were proposed in this pull request? This PR aims to fix a type; `rogue` -> `rouge` . This is a follow-up of https://github.com/apache/spark/pull/26521. ### Why are the changes needed? To support `Python 3`, we upgraded from `pygments` to `rouge`. ### Does this PR introduce any user-facing change? No. (This is for only document generation.) ### How was this patch tested? Manually. ``` $ docker build -t test dev/create-release/spark-rm/ ... 1 gem installed Successfully installed rouge-3.13.0 Parsing documentation for rouge-3.13.0 Installing ri documentation for rouge-3.13.0 Done installing documentation for rouge after 4 seconds 1 gem installed Removing intermediate container 9bd8707d9e84 ---> a18b2f6b0bb9 ... ``` Closes #26686 from dongjoon-hyun/SPARK-28752. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 November 2019, 22:58:20 UTC
fd2bf55 [SPARK-27651][CORE] Avoid the network when shuffle blocks are fetched from the same host ## What changes were proposed in this pull request? Before this PR `ShuffleBlockFetcherIterator` was partitioning the block fetches into two distinct sets: local reads and remote fetches. Within this PR (when the feature is enabled by "spark.shuffle.readHostLocalDisk.enabled") a new category is introduced: host-local reads. They are shuffle block fetches where although the block manager is different they are running on the same host along with the requester. Moreover to get the local directories of the other executors/block managers a new RPC message is introduced `GetLocalDirs` which is sent the the block manager master where it is answered as `BlockManagerLocalDirs`. In `BlockManagerMasterEndpoint` for answering this request the `localDirs` is extracted from the `BlockManagerInfo` and stored separately in a hash map called `executorIdLocalDirs`. Because the earlier used `blockManagerInfo` contains data for the alive block managers (see `org.apache.spark.storage.BlockManagerMasterEndpoint#removeBlockManager`). Now `executorIdLocalDirs` knows all the local dirs up to the application start (like the external shuffle service does) so in case of an RDD recalculation both host-local shuffle blocks and disk persisted RDD blocks on the same host can be served by reading the files behind the blocks directly. ## How was this patch tested? ### Unit tests `ExternalShuffleServiceSuite`: - "SPARK-27651: host local disk reading avoids external shuffle service on the same node" `ShuffleBlockFetcherIteratorSuite`: - "successful 3 local reads + 4 host local reads + 2 remote reads" And with extending existing suites where shuffle metrics was tested. ### Manual tests Running Spark on YARN in a 4 nodes cluster with 6 executors and having 12 shuffle blocks. ``` $ grep host-local experiment.log 19/07/30 03:57:12 INFO storage.ShuffleBlockFetcherIterator: Getting 12 (1496.8 MB) non-empty blocks including 2 (299.4 MB) local blocks and 2 (299.4 MB) host-local blocks and 8 (1197.4 MB) remote blocks 19/07/30 03:57:12 DEBUG storage.ShuffleBlockFetcherIterator: Start fetching host-local blocks: shuffle_0_2_1, shuffle_0_6_1 19/07/30 03:57:12 DEBUG storage.ShuffleBlockFetcherIterator: Got host-local blocks in 38 ms 19/07/30 03:57:12 INFO storage.ShuffleBlockFetcherIterator: Getting 12 (1496.8 MB) non-empty blocks including 2 (299.4 MB) local blocks and 2 (299.4 MB) host-local blocks and 8 (1197.4 MB) remote blocks 19/07/30 03:57:12 DEBUG storage.ShuffleBlockFetcherIterator: Start fetching host-local blocks: shuffle_0_0_0, shuffle_0_8_0 19/07/30 03:57:12 DEBUG storage.ShuffleBlockFetcherIterator: Got host-local blocks in 35 ms ``` Closes #25299 from attilapiros/SPARK-27651. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 26 November 2019, 19:02:25 UTC
e23c135 [SPARK-29293][BUILD] Move scalafmt to Scala 2.12 profile; bump to 0.12 ### What changes were proposed in this pull request? Move scalafmt to Scala 2.12 profile; bump to 0.12. ### Why are the changes needed? To facilitate a future Scala 2.13 build. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? This isn't covered by tests, it's a convenience for contributors. Closes #26655 from srowen/SPARK-29293. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 November 2019, 17:59:19 UTC
ed0c33f [SPARK-30026][SQL] Whitespaces can be identified as delimiters in interval string ### What changes were proposed in this pull request? We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do. This PR makes the whitespaces handling be consistent for nterval values. Typed interval literal, multi-unit representation, and casting from strings are all supported. ```sql postgres=# select interval E'1 \t day'; interval ---------- 1 day (1 row) postgres=# select interval E'1\t' day; interval ---------- 1 day (1 row) ``` ### Why are the changes needed? Whitespace handling should be consistent for interval value, and across different types in Spark. PostgreSQL feature parity. ### Does this PR introduce any user-facing change? Yes, the interval string of multi-units values which separated by whitespaces can be valid now. ### How was this patch tested? add ut. Closes #26662 from yaooqinn/SPARK-30026. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 November 2019, 17:20:38 UTC
2901802 [SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13 ### What changes were proposed in this pull request? Make separate source trees for Scala 2.12/2.13 in order to accommodate mutually-incompatible support for Ordering of double, float. Note: This isn't the last change that will need a split source tree for 2.13. But this particular change could go several ways: - (Split source tree) - Inline the Scala 2.12 implementation - Reflection For this change alone any are possible, and splitting the source tree is a bit overkill. But if it will be necessary for other JIRAs (see umbrella SPARK-25075), then it might be the easiest way to implement this. ### Why are the changes needed? Scala 2.13 split Ordering.Double into Ordering.Double.TotalOrdering and Ordering.Double.IeeeOrdering. Neither can be used in a single build that supports 2.12 and 2.13. TotalOrdering works like java.lang.Double.compare. IeeeOrdering works like Scala 2.12 Ordering.Double. They differ in how NaN is handled - compares always above other values? or always compares as 'false'? In theory they have different uses: TotalOrdering is important if floating-point values are sorted. IeeeOrdering behaves like 2.12 and JVM comparison operators. I chose TotalOrdering as I think we care more about stable sorting, and because elsewhere we rely on java.lang comparisons. It is also possible to support with two methods. ### Does this PR introduce any user-facing change? Pending tests, will see if it obviously affects any sort order. We need to see if it changes NaN sort order. ### How was this patch tested? Existing tests so far. Closes #26654 from srowen/SPARK-30009. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 November 2019, 16:25:53 UTC
7b1b60c [SPARK-28574][CORE][FOLLOW-UP] Several minor improvements for event queue capacity config ### What changes were proposed in this pull request? * Replace hard-coded conf `spark.scheduler.listenerbus.eventqueue` with a constant variable(`LISTENER_BUS_EVENT_QUEUE_PREFIX `) defined in `config/package.scala`. * Update documentation for `spark.scheduler.listenerbus.eventqueue.capacity` in both `config/package.scala` and `docs/configuration.md`. ### Why are the changes needed? * Better code maintainability * Better user guidance of the conf ### Does this PR introduce any user-facing change? No behavior changes but user will see the updated document. ### How was this patch tested? Pass Jenkins. Closes #26676 from Ngone51/SPARK-28574-followup. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 November 2019, 16:20:26 UTC
c2d513f [SPARK-30035][BUILD] Upgrade to Apache Commons Lang 3.9 ### What changes were proposed in this pull request? This PR aims to upgrade to `Apache Commons Lang 3.9`. ### Why are the changes needed? `Apache Commons Lang 3.9` is the first official release to support JDK9+. The following is the full release note. - https://commons.apache.org/proper/commons-lang/release-notes/RELEASE-NOTES-3.9.txt ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #26672 from dongjoon-hyun/SPARK-30035. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 November 2019, 12:31:02 UTC
9b9d130 [SPARK-30030][BUILD][FOLLOWUP] Remove unused org.apache.commons.lang ### What changes were proposed in this pull request? This PR aims to remove the unused test dependency `commons-lang:commons-lang` from `core` module. ### Why are the changes needed? SPARK-30030 already removed all usage of `Apache Commons Lang2` in `core`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes #26673 from dongjoon-hyun/SPARK-30030-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 26 November 2019, 11:55:02 UTC
373c2c3 [SPARK-29862][SQL] CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add CreateViewStatement and make CREARE VIEW go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC v // success and describe the view v from my_catalog CREATE VIEW v AS SELECT 1 // report view not found as there is no view v in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running CREATE VIEW ... Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog. ### How was this patch tested? unit tests Closes #26649 from huaxingao/spark-29862. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 November 2019, 06:10:46 UTC
780555b [MINOR][CORE] Make EventLogger codec be consistent between EventLogFileWriter and SparkContext ### What changes were proposed in this pull request? Use the same function (`codecName(conf: SparkConf)`) between `EventLogFileWriter` and `SparkContext` to get the consistent codec name for EventLogger. ### Why are the changes needed? #24921 added a new conf for EventLogger's compression codec. We should reflect this change into `SparkContext` as well. Though I didn't find any places that `SparkContext.eventLogCodec` really takes an effect, I think it'd be better to have it as a right value. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #26665 from Ngone51/consistent-eventLogCodec. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 November 2019, 04:54:34 UTC
8b0121b [MINOR][DOC] Fix the CalendarIntervalType description ### What changes were proposed in this pull request? fix the overdue and incorrect description about CalendarIntervalType ### Why are the changes needed? api doc correctness ### Does this PR introduce any user-facing change? no ### How was this patch tested? no Closes #26659 from yaooqinn/intervaldoc. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 26 November 2019, 04:49:56 UTC
53e19f3 [SPARK-30032][BUILD] Upgrade to ORC 1.5.8 ### What changes were proposed in this pull request? This PR aims to upgrade to Apache ORC 1.5.8. ### Why are the changes needed? This will bring the latest bug fixes. The following is the full release note. - https://issues.apache.org/jira/projects/ORC/versions/12346462 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #26669 from dongjoon-hyun/SPARK-ORC-1.5.8. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 26 November 2019, 04:08:11 UTC
2a28c73 [SPARK-30031][BUILD][SQL] Remove `hive-2.3` profile from `sql/hive` module ### What changes were proposed in this pull request? This PR aims to remove `hive-2.3` profile from `sql/hive` module. ### Why are the changes needed? Currently, we need `-Phive-1.2` or `-Phive-2.3` additionally to build `hive` or `hive-thriftserver` module. Without specifying it, the build fails like the following. This PR will recover it. ``` $ build/mvn -DskipTests compile --pl sql/hive ... [ERROR] [Error] /Users/dongjoon/APACHE/spark-merge/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:32: object serde is not a member of package org.apache.hadoop.hive ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? 1. Pass GitHub Action dependency check with no manifest change. 2. Pass GitHub Action build for all combinations. 3. Pass the Jenkins UT. Closes #26668 from dongjoon-hyun/SPARK-30031. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2019, 23:17:27 UTC
38240a7 [SPARK-30030][INFRA] Use RegexChecker instead of TokenChecker to check `org.apache.commons.lang.` ### What changes were proposed in this pull request? This PR replace `TokenChecker` with `RegexChecker` in `scalastyle` and fixes the missed instances. ### Why are the changes needed? This will remove the old `comons-lang2` dependency from `core` module **BEFORE** ``` $ dev/scalastyle Scalastyle checks failed at following occurrences: [error] /Users/dongjoon/PRS/SPARK-SerializationUtils/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:7: Use Commons Lang 3 classes (package org.apache.commons.lang3.*) instead [error] of Commons Lang 2 (package org.apache.commons.lang.*) [error] Total time: 23 s, completed Nov 25, 2019 11:47:44 AM ``` **AFTER** ``` $ dev/scalastyle Scalastyle checks passed. ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the GitHub Action linter. Closes #26666 from dongjoon-hyun/SPARK-29081-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2019, 20:03:15 UTC
1466863 [SPARK-30015][BUILD] Move hive-storage-api dependency from `hive-2.3` to `sql/core` # What changes were proposed in this pull request? This PR aims to relocate the following internal dependencies to compile `sql/core` without `-Phive-2.3` profile. 1. Move the `hive-storage-api` to `sql/core` which is using `hive-storage-api` really. **BEFORE (sql/core compilation)** ``` $ ./build/mvn -DskipTests --pl sql/core --am compile ... [ERROR] [Error] /Users/dongjoon/APACHE/spark/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala:21: object hive is not a member of package org.apache.hadoop ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ ``` **AFTER (sql/core compilation)** ``` $ ./build/mvn -DskipTests --pl sql/core --am compile ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:04 min [INFO] Finished at: 2019-11-25T00:20:11-08:00 [INFO] ------------------------------------------------------------------------ ``` 2. For (1), add `commons-lang:commons-lang` test dependency to `spark-core` module to manage the dependency explicitly. Without this, `core` module fails to build the test classes. ``` $ ./build/mvn -DskipTests --pl core --am package -Phadoop-3.2 ... [INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first) spark-core_2.12 --- [INFO] Using incremental compilation using Mixed compile order [INFO] Compiler bridge file: /Users/dongjoon/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar [INFO] Compiling 271 Scala sources and 26 Java sources to /spark/core/target/scala-2.12/test-classes ... [ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23: object lang is not a member of package org.apache.commons [ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49: not found: value SerializationUtils [ERROR] two errors found ``` **BEFORE (commons-lang:commons-lang)** The following is the previous `core` module's `commons-lang:commons-lang` dependency. 1. **branch-2.4** ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-core_2.11 --- [INFO] org.apache.spark:spark-core_2.11:jar:2.4.5-SNAPSHOT [INFO] \- org.spark-project.hive:hive-exec:jar:1.2.1.spark2:provided [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` 2. **v3.0.0-preview (-Phadoop-3.2)** ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang -Phadoop-3.2 [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-preview [INFO] \- org.apache.hive:hive-storage-api:jar:2.6.0:compile [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` 3. **v3.0.0-preview(default)** ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-preview [INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.4:compile [INFO] \- org.apache.hadoop:hadoop-common:jar:2.7.4:compile [INFO] \- commons-lang:commons-lang:jar:2.6:compile ``` **AFTER (commons-lang:commons-lang)** ``` $ mvn dependency:tree -Dincludes=commons-lang:commons-lang [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 --- [INFO] org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT [INFO] \- commons-lang:commons-lang:jar:2.6:test ``` Since we wanted to verify that this PR doesn't change `hive-1.2` profile, we merged [SPARK-30005 Update `test-dependencies.sh` to check `hive-1.2/2.3` profile](a1706e2fa7) before this PR. ### Why are the changes needed? - Apache Spark 2.4's `sql/core` is using `Apache ORC (nohive)` jars including shaded `hive-storage-api` to access ORC data sources. - Apache Spark 3.0's `sql/core` is using `Apache Hive` jars directly. Previously, `-Phadoop-3.2` hid this `hive-storage-api` dependency. Now, we are using `-Phive-2.3` instead. As I mentioned [previously](https://github.com/apache/spark/pull/26619#issuecomment-556926064), this PR is required to compile `sql/core` module without `-Phive-2.3`. - For `sql/hive` and `sql/hive-thriftserver`, it's natural that we need `-Phive-1.2` or `-Phive-2.3`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This will pass the Jenkins (with the dependency check and unit tests). We need to check manually with `./build/mvn -DskipTests --pl sql/core --am compile`. This closes #26657 . Closes #26658 from dongjoon-hyun/SPARK-30015. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2019, 18:54:14 UTC
bec2068 [SPARK-26260][CORE] For disk store tasks summary table should show only successful tasks summary …sks metrics for disk store ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/23088 task Summary table in the stage page shows successful tasks metrics for lnMemory store. In this PR, it added for disk store also. ### Why are the changes needed? Now both InMemory and disk store will be consistent in showing the task summary table in the UI, if there are non successful tasks ### Does this PR introduce any user-facing change? no ### How was this patch tested? Added UT. Manually verified Test steps: 1. add the config in spark-defaults.conf -> **spark.history.store.path /tmp/store** 2. sbin/start-hitoryserver 3. bin/spark-shell 4. `sc.parallelize(1 to 1000, 2).map(x => throw new Exception("fail")).count` ![Screenshot 2019-11-14 at 3 51 39 AM](https://user-images.githubusercontent.com/23054875/68809546-268d2e80-0692-11ea-8b2c-bee767478135.png) Closes #26508 from shahidki31/task. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 November 2019, 18:04:25 UTC
29ebd93 [SPARK-29979][SQL] Add basic/reserved property key constants in TableCatalog and SupportsNamespaces ### What changes were proposed in this pull request? Add "comment" and "location" property key constants in TableCatalog and SupportNamespaces. And update code of implementation classes to use these constants instead of hard code. ### Why are the changes needed? Currently, some basic/reserved keys (eg. "location", "comment") of table and namespace properties are hard coded or defined in specific logical plan implementation class. These keys can be centralized in TableCatalog and SupportsNamespaces interface and shared across different implementation classes. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Existing unit test Closes #26617 from fuwhu/SPARK-29979. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 November 2019, 17:24:43 UTC
f09c1a3 [SPARK-29890][SQL] DataFrameNaFunctions.fill should handle duplicate columns ### What changes were proposed in this pull request? `DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.fill("hello").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1268) ``` The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now the above example displays the following: ``` +----+-----+-----+ |col1| col2| col2| +----+-----+-----+ | 1|hello| 2| | 3| 4|hello| +----+-----+-----+ ``` ### How was this patch tested? Added new unit tests. Closes #26593 from imback82/na_fill. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 November 2019, 16:06:19 UTC
2d5de25 [SPARK-29415][CORE] Stage Level Sched: Add base ResourceProfile and Request classes ### What changes were proposed in this pull request? This PR is adding the base classes needed for Stage level scheduling. Its adding a ResourceProfile and the executor and task resource request classes. These are made private for now until we get all the parts implemented, at which point this will become public interfaces. I am adding them first as all the other subtasks for this feature require these classes. If people have better ideas on breaking this feature up please let me know. See https://issues.apache.org/jira/browse/SPARK-29415 for more detailed design. ### Why are the changes needed? New API for stage level scheduling. Its easier to add these first because the other jira for this features will all use them. ### Does this PR introduce any user-facing change? Yes adds API to create a ResourceProfile with executor/task resources, see the spip jira https://issues.apache.org/jira/browse/SPARK-27495 Example of the api: val rp = new ResourceProfile() rp.require(new ExecutorResourceRequest("cores", 2)) rp.require(new ExecutorResourceRequest("gpu", 1, Some("/opt/gpuScripts/getGpus"))) rp.require(new TaskResourceRequest("gpu", 1)) ### How was this patch tested? Tested using Unit tests added with this PR. Closes #26284 from tgravescs/SPARK-29415. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org> 25 November 2019, 15:36:39 UTC
bd9ce83 [SPARK-29975][SQL][FOLLOWUP] document --CONFIG_DIM ### What changes were proposed in this pull request? add document to address https://github.com/apache/spark/pull/26612#discussion_r349844327 ### Why are the changes needed? help people understand how to use --CONFIG_DIM ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #26661 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> 25 November 2019, 11:45:31 UTC
de21f28 [SPARK-29986][SQL] casting string to date/timestamp/interval should trim all whitespaces ### What changes were proposed in this pull request? A java like string trim method trims all whitespaces that less or equal than 0x20. currently, our UTF8String handle the space =0x20 ONLY. This is not suitable for many cases in Spark, like trim for interval strings, date, timestamps, PostgreSQL like cast string to boolean. ### Why are the changes needed? improve the white spaces handling in UTF8String, also with some bugs fixed ### Does this PR introduce any user-facing change? yes, string with `control character` at either end can be convert to date/timestamp and interval now ### How was this patch tested? add ut Closes #26626 from yaooqinn/SPARK-29986. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 November 2019, 06:37:04 UTC
456cfe6 [SPARK-29939][CORE] Add spark.shuffle.mapStatus.compression.codec conf ### What changes were proposed in this pull request? Add a new conf named `spark.shuffle.mapStatus.compression.codec` for user to decide which codec should be used(default by `zstd`) for `MapStatus` compression. ### Why are the changes needed? We already have this functionality for `broadcast`/`rdd`/`shuffle`/`shuflleSpill`, so it might be better to have the same functionality for `MapStatus` as well. ### Does this PR introduce any user-facing change? Yes, user now could use `spark.shuffle.mapStatus.compression.codec` to decide which codec should be used during `MapStatus` compression. ### How was this patch tested? N/A Closes #26611 from Ngone51/SPARK-29939. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2019, 05:21:19 UTC
5cf475d [SPARK-30000][SQL] Trim the string when cast string type to decimals ### What changes were proposed in this pull request? https://bugs.openjdk.java.net/browse/JDK-8170259 https://bugs.openjdk.java.net/browse/JDK-8170563 When we cast string type to decimal type, we rely on java.math. BigDecimal. It can't accept leading and training spaces, as you can see in the above links. This behavior is not consistent with other numeric types now. we need to fix it and keep consistency. ### Why are the changes needed? make string to numeric types be consistent ### Does this PR introduce any user-facing change? yes, string removed trailing or leading white spaces will be able to convert to decimal if the trimmed is valid ### How was this patch tested? 1. modify ut #### Benchmark ```scala /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.spark.sql.execution.benchmark import org.apache.spark.benchmark.Benchmark /** * Benchmark trim the string when casting string type to Boolean/Numeric types. * To run this benchmark: * {{{ * 1. without sbt: * bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar> * 2. build/sbt "sql/test:runMain <this class>" * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>" * Results will be written to "benchmarks/CastBenchmark-results.txt". * }}} */ object CastBenchmark extends SqlBasedBenchmark { override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val title = "Cast String to Integral" runBenchmark(title) { withTempPath { dir => val N = 500L << 14 val df = spark.range(N) val types = Seq("decimal") (1 to 5).by(2).foreach { i => df.selectExpr(s"concat(id, '${" " * i}') as str") .write.mode("overwrite").parquet(dir + i.toString) } val benchmark = new Benchmark(title, N, minNumIters = 5, output = output) Seq(true, false).foreach { trim => types.foreach { t => val str = if (trim) "trim(str)" else "str" val expr = s"cast($str as $t) as c_$t" (1 to 5).by(2).foreach { i => benchmark.addCase(expr + s" - with $i spaces") { _ => spark.read.parquet(dir + i.toString).selectExpr(expr).collect() } } } } benchmark.run() } } } } ``` #### string trim vs not trim ```java [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1 [info] Intel(R) Core(TM) i9-9980HK CPU 2.40GHz [info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] cast(trim(str) as decimal) as c_decimal - with 1 spaces 3362 5486 NaN 2.4 410.4 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 3 spaces 3251 5655 NaN 2.5 396.8 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 5 spaces 3208 5725 NaN 2.6 391.7 1.0X [info] cast(str as decimal) as c_decimal - with 1 spaces 13962 16233 1354 0.6 1704.3 0.2X [info] cast(str as decimal) as c_decimal - with 3 spaces 14273 14444 179 0.6 1742.4 0.2X [info] cast(str as decimal) as c_decimal - with 5 spaces 14318 14535 125 0.6 1747.8 0.2X ``` #### string trim vs this fix ```java [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1 [info] Intel(R) Core(TM) i9-9980HK CPU 2.40GHz [info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] cast(trim(str) as decimal) as c_decimal - with 1 spaces 3265 6299 NaN 2.5 398.6 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 3 spaces 3183 6241 693 2.6 388.5 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 5 spaces 3167 5923 1151 2.6 386.7 1.0X [info] cast(str as decimal) as c_decimal - with 1 spaces 3161 5838 1126 2.6 385.9 1.0X [info] cast(str as decimal) as c_decimal - with 3 spaces 3046 3457 837 2.7 371.8 1.1X [info] cast(str as decimal) as c_decimal - with 5 spaces 3053 4445 NaN 2.7 372.7 1.1X [info] ``` Closes #26640 from yaooqinn/SPARK-30000. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> 25 November 2019, 04:47:07 UTC
13896e4 [SPARK-30013][SQL] For scala 2.13, omit parens in various BigDecimal value() methods ### What changes were proposed in this pull request? Omit parens on calls like BigDecimal.longValue() ### Why are the changes needed? For some reason, this won't compile in Scala 2.13. The calls are otherwise equivalent in 2.12. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #26653 from srowen/SPARK-30013. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2019, 02:23:34 UTC
a8d907c [SPARK-29937][SQL] Make FileSourceScanExec class fields lazy ### What changes were proposed in this pull request? Since JIRA SPARK-28346,PR [25111](https://github.com/apache/spark/pull/25111), QueryExecution will copy all node stage-by-stage. This make all node instance twice almost. So we should make all class fields lazy to avoid create more unexpected object. ### Why are the changes needed? Avoid create more unexpected object. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Exists UT. Closes #26565 from ulysses-you/make-val-lazy. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 25 November 2019, 00:32:09 UTC
0d3d46d [SPARK-29999][SS] Handle FileStreamSink metadata correctly for empty partition ### What changes were proposed in this pull request? This patch checks the existence of output file for each task while committing the task, so that it doesn't throw FileNotFoundException while creating SinkFileStatus. The check is newly required for DSv2 implementation of FileStreamSink, as it is changed to create the output file lazily (as an improvement). JSON writer for example: https://github.com/apache/spark/blob/9ec2a4e58caa4128e9c690d72239cebd6b732084/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonOutputWriter.scala#L49-L60 ### Why are the changes needed? Without this patch, FileStreamSink throws FileNotFoundException when writing empty partition. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #26639 from HeartSaVioR/SPARK-29999. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 23:31:06 UTC
cb68e58 [MINOR][INFRA] Use GitHub Action Cache for `build` ### What changes were proposed in this pull request? This PR adds `GitHub Action Cache` task on `build` directory. ### Why are the changes needed? This will replace the Maven downloading with the cache. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually check the GitHub Action log of this PR. Closes #26652 from dongjoon-hyun/SPARK-MAVEN-CACHE. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 20:35:57 UTC
a1706e2 [SPARK-30005][INFRA] Update `test-dependencies.sh` to check `hive-1.2/2.3` profile ### What changes were proposed in this pull request? This PR aims to update `test-dependencies.sh` to validate all available `Hadoop/Hive` combination. ### Why are the changes needed? Previously, we have been checking only `Hadoop2.7/Hive1.2` and `Hadoop3.2/Hive2.3`. We need to validate `Hadoop2.7/Hive2.3` additionally for Apache Spark 3.0. ### Does this PR introduce any user-facing change? No. (This is a dev-only change). ### How was this patch tested? Pass the GitHub Action (Linter) with the newly updated manifest because this is only dependency check. Closes #26646 from dongjoon-hyun/SPARK-30005. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 18:14:02 UTC
3f3a18f [SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation ### What changes were proposed in this pull request? This pr proposes a new independent config so that `LogicalRelation` could use `rowCount` to compute data statistics in logical plans even if CBO disabled. In the master, we currently cannot enable `StarSchemaDetection.reorderStarJoins` because we need to turn off CBO to enable it but `StarSchemaDetection` internally references the `rowCount` that is used in LogicalRelation if CBO disabled. ### Why are the changes needed? Plan stats are pretty useful other than CBO, e.g., star-schema detector and dynamic partition pruning. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `DataFrameJoinSuite`. Closes #21668 from maropu/PlanStatsConf. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 16:30:24 UTC
3d74090 [SPARK-29973][SS] Make `processedRowsPerSecond` calculated more accurately and meaningfully ### What changes were proposed in this pull request? Give `processingTimeSec` 0.001 when a micro-batch completed under 1ms. ### Why are the changes needed? The `processingTimeSec` of batch may be less than 1 ms. As `processingTimeSec` is calculated in ms, so `processingTimeSec` equals 0L. If there is no data in this batch, the `processedRowsPerSecond` equals `0/0.0d`, i.e. `Double.NaN`. If there are some data in this batch, the `processedRowsPerSecond` equals `N/0.0d`, i.e. `Double.Infinity`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add new UT Closes #26610 from uncleGen/SPARK-29973. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> 24 November 2019, 14:08:15 UTC
a60da23 [SPARK-30007][INFRA] Publish snapshot/release artifacts with `-Phive-2.3` only ### What changes were proposed in this pull request? This PR aims to add `-Phive-2.3` to publish profiles. Since Apache Spark 3.0.0, Maven artifacts will be publish with Apache Hive 2.3 profile only. This PR also will recover `SNAPSHOT` publishing Jenkins job. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ We will provide the pre-built distributions (with Hive 1.2.1 also) like Apache Spark 2.4. SPARK-29989 will update the release script to generate all combinations. ### Why are the changes needed? This will reduce the explicit dependency on the illegitimate Hive fork in Maven repository. ### Does this PR introduce any user-facing change? Yes, but this is dev only changes. ### How was this patch tested? Manual. Closes #26648 from dongjoon-hyun/SPARK-30007. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 06:34:21 UTC
13338ea [SPARK-29554][SQL][FOLLOWUP] Rename Version to SparkVersion ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26209 . This renames class `Version` to class `SparkVersion`. ### Why are the changes needed? According to the review comment, this uses more specific class name. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #26647 from dongjoon-hyun/SPARK-29554. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 03:53:52 UTC
564826d [SPARK-28812][SQL][DOC] Document SHOW PARTITIONS in SQL Reference ### What changes were proposed in this pull request? Document SHOW PARTITIONS statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before** **After** ![image](https://user-images.githubusercontent.com/14225158/69405056-89468180-0cb3-11ea-8eb7-93046eaf551c.png) ![image](https://user-images.githubusercontent.com/14225158/69405067-93688000-0cb3-11ea-810a-11cab9e4a041.png) ![image](https://user-images.githubusercontent.com/14225158/69405120-c01c9780-0cb3-11ea-91c0-91eeaa9238a0.png) Closes #26635 from dilipbiswal/show_partitions. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 03:34:19 UTC
6898be9 [SPARK-29681][WEBUI] Support column sorting in Environment tab ### What changes were proposed in this pull request? Add extra classnames to table headers in EnvironmentPage tables in Spark UI. ### Why are the changes needed? SparkUI uses sorttable.js to provide the sort functionality in different tables. This library tries to guess the datatype of each column during initialization phase - numeric/alphanumeric etc and uses it to sort the columns whenever user clicks on a column. That way it guesses incorrect data type for environment tab. sorttable.js has way to hint the datatype of table columns explicitly. This is done by passing custom HTML class attribute. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested sorting in tables in Environment tab in Spark UI. ![Annotation 2019-11-22 154058](https://user-images.githubusercontent.com/2551496/69417432-a8d6bc00-0d3e-11ea-865b-f8017976c6f4.png) ![Annotation 2019-11-22 153600](https://user-images.githubusercontent.com/2551496/69417433-a8d6bc00-0d3e-11ea-9a75-8e1f4d66107e.png) ![Annotation 2019-11-22 153841](https://user-images.githubusercontent.com/2551496/69417435-a96f5280-0d3e-11ea-85f6-9f61b015e161.png) Closes #26638 from prakharjain09/SPARK-29681-SPARK-UI-SORT. Authored-by: Prakhar Jain <prakjai@microsoft.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 02:09:02 UTC
6cd6d5f [SPARK-29970][WEBUI] Preserver open/close state of Timelineview ### What changes were proposed in this pull request? Fix a bug related to Timelineview that does not preserve open/close state properly. ### Why are the changes needed? To preserve open/close state is originally intended but it doesn't work. ### Does this PR introduce any user-facing change? Yes. open/close state for Timeineview is to be preserved so if you open Timelineview in Stage page and go to another page, and then go back to Stage page, Timelineview should keep open. ### How was this patch tested? Manual test. Closes #26607 from sarutak/fix-timeline-view-state. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 24 November 2019, 00:16:24 UTC
6625b69 [SPARK-29981][BUILD][FOLLOWUP] Change hive.version.short ### What changes were proposed in this pull request? This is a follow-up according to liancheng 's advice. - https://github.com/apache/spark/pull/26619#discussion_r349326090 ### Why are the changes needed? Previously, we chose the full version to be carefully. As of today, it seems that `Apache Hive 2.3` branch seems to become stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the compile combination on GitHub Action. 1. hadoop-2.7/hive-1.2/JDK8 2. hadoop-2.7/hive-2.3/JDK8 3. hadoop-3.2/hive-2.3/JDK8 4. hadoop-3.2/hive-2.3/JDK11 Also, pass the Jenkins with `hadoop-2.7` and `hadoop-3.2` for (1) and (4). (2) and (3) is not ready in Jenkins. Closes #26645 from dongjoon-hyun/SPARK-RENAME-HIVE-DIRECTORY. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 November 2019, 20:50:50 UTC
c98e5eb [SPARK-29981][BUILD] Add hive-1.2/2.3 profiles ### What changes were proposed in this pull request? This PR aims the followings. - Add two profiles, `hive-1.2` and `hive-2.3` (default) - Validate if we keep the existing combination at least. (Hadoop-2.7 + Hive 1.2 / Hadoop-3.2 + Hive 2.3). For now, we assumes that `hive-1.2` is explicitly used with `hadoop-2.7` and `hive-2.3` with `hadoop-3.2`. The followings are beyond the scope of this PR. - SPARK-29988 Adjust Jenkins jobs for `hive-1.2/2.3` combination - SPARK-29989 Update release-script for `hive-1.2/2.3` combination - SPARK-29991 Support `hive-1.2/2.3` in PR Builder ### Why are the changes needed? This will help to switch our dependencies to update the exposed dependencies. ### Does this PR introduce any user-facing change? This is a dev-only change that the build profile combinations are changed. - `-Phadoop-2.7` => `-Phadoop-2.7 -Phive-1.2` - `-Phadoop-3.2` => `-Phadoop-3.2 -Phive-2.3` ### How was this patch tested? Pass the Jenkins with the dependency check and tests to make it sure we don't change anything for now. - [Jenkins (-Phadoop-2.7 -Phive-1.2)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull) - [Jenkins (-Phadoop-3.2 -Phive-2.3)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull) Also, from now, GitHub Action validates the following combinations. ![gha](https://user-images.githubusercontent.com/9700541/69355365-822d5e00-0c36-11ea-93f7-e00e5459e1d0.png) Closes #26619 from dongjoon-hyun/SPARK-29981. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 23 November 2019, 18:02:22 UTC
fc7a37b [SPARK-30003][SQL] Do not throw stack overflow exception in non-root unknown hint resolution ### What changes were proposed in this pull request? This is rather a followup of https://github.com/apache/spark/pull/25464 (see https://github.com/apache/spark/pull/25464/files#r349543286) It will cause an infinite recursion via mapping children - we should return the hint rather than its parent plan in unknown hint resolution. ### Why are the changes needed? Prevent Stack over flow during hint resolution. ### Does this PR introduce any user-facing change? Yes, it avoids stack overflow exception It was caused by https://github.com/apache/spark/pull/25464 and this is only in the master. No behaviour changes to end users as it happened only in the master. ### How was this patch tested? Unittest was added. Closes #26642 from HyukjinKwon/SPARK-30003. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> 23 November 2019, 08:24:56 UTC
f28eab2 [SPARK-29971][CORE] Fix buffer leaks in `TransportFrameDecoder/TransportCipher` ### What changes were proposed in this pull request? - Correctly release `ByteBuf` in `TransportCipher` in all cases - Move closing / releasing logic to `handlerRemoved(...)` so we are guaranteed that is always called. - Correctly release `frameBuf` it is not null when the handler is removed (and so also when the channel becomes inactive) ### Why are the changes needed? We need to carefully manage the ownership / lifecycle of `ByteBuf` instances so we don't leak any of these. We did not correctly do this in all cases: - when end up in invalid cipher state. - when partial data was received and the channel is closed before the full frame is decoded Fixes https://github.com/netty/netty/issues/9784. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the newly added UTs. Closes #26609 from normanmaurer/fix_leaks. Authored-by: Norman Maurer <norman_maurer@apple.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 22 November 2019, 23:20:54 UTC
6b0e391 [SPARK-29427][SQL] Add API to convert RelationalGroupedDataset to KeyValueGroupedDataset ### What changes were proposed in this pull request? This PR proposes to add `as` API to RelationalGroupedDataset. It creates KeyValueGroupedDataset instance using given grouping expressions, instead of a typed function in groupByKey API. Because it can leverage existing columns, it can use existing data partition, if any, when doing operations like cogroup. ### Why are the changes needed? Currently if users want to do cogroup on DataFrames, there is no good way to do except for KeyValueGroupedDataset. 1. KeyValueGroupedDataset ignores existing data partition if any. That is a problem. 2. groupByKey calls typed function to create additional keys. You can not reuse existing columns, if you just need grouping by them. ```scala // df1 and df2 are certainly partitioned and sorted. val df1 = Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c") .repartition($"a").sortWithinPartitions("a") val df2 = Seq((1, 2, 4), (2, 3, 5)).toDF("a", "b", "c") .repartition($"a").sortWithinPartitions("a") ``` ```scala // This groupBy.as.cogroup won't unnecessarily repartition the data val df3 = df1.groupBy("a").as[Int] .cogroup(df2.groupBy("a").as[Int]) { case (key, data1, data2) => data1.zip(data2).map { p => p._1.getInt(2) + p._2.getInt(2) } } ``` ``` == Physical Plan == *(5) SerializeFromObject [input[0, int, false] AS value#11247] +- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4922/12067092816eec1b6f, a#11209: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [a#11209], [a#11225], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11246: int :- *(2) Sort [a#11209 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(a#11209, 5), false, [id=#10218] : +- *(1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211] : +- *(1) LocalTableScan [_1#11202, _2#11203, _3#11204] +- *(4) Sort [a#11225 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#11225, 5), false, [id=#10223] +- *(3) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227] +- *(3) LocalTableScan [_1#11218, _2#11219, _3#11220] ``` ```scala // Current approach creates additional AppendColumns and repartition data again val df4 = df1.groupByKey(r => r.getInt(0)).cogroup(df2.groupByKey(r => r.getInt(0))) { case (key, data1, data2) => data1.zip(data2).map { p => p._1.getInt(2) + p._2.getInt(2) } } ``` ``` == Physical Plan == *(7) SerializeFromObject [input[0, int, false] AS value#11257] +- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4933/138102700737171997, value#11252: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [value#11252], [value#11254], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11256: int :- *(3) Sort [value#11252 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(value#11252, 5), true, [id=#10302] : +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4930/19529195347ce07f47, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11252] : +- *(2) Sort [a#11209 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(a#11209, 5), false, [id=#10297] : +- *(1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211] : +- *(1) LocalTableScan [_1#11202, _2#11203, _3#11204] +- *(6) Sort [value#11254 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(value#11254, 5), true, [id=#10312] +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4932/15265288491f0e0c1f, createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11254] +- *(5) Sort [a#11225 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#11225, 5), false, [id=#10307] +- *(4) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227] +- *(4) LocalTableScan [_1#11218, _2#11219, _3#11220] ``` ### Does this PR introduce any user-facing change? Yes, this adds a new `as` API to RelationalGroupedDataset. Users can use it to create KeyValueGroupedDataset and do cogroup. ### How was this patch tested? Unit tests. Closes #26509 from viirya/SPARK-29427-2. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> 22 November 2019, 18:34:26 UTC
back to top