https://github.com/apache/spark

sort by:
Revision Author Date Message Commit Date
fc28ba3 Preparing Spark release v2.2.2-rc2 27 June 2018, 13:55:11 UTC
72575d0 [SPARK-24552][CORE][BRANCH-2.2] Use unique id instead of attempt number for writes . This passes a unique attempt id to the Hadoop APIs, because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21616 from vanzin/SPARK-24552-2.2. 25 June 2018, 23:56:12 UTC
a600004 [SPARK-22897][CORE] Expose stageAttemptId in TaskContext stageAttemptId added in TaskContext and corresponding construction modification Added a new test in TaskContextSuite, two cases are tested: 1. Normal case without failure 2. Exception case with resubmitted stages Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897) Author: Xianjin YE <advancedxygmail.com> Closes #20082 from advancedxy/SPARK-22897. Conflicts: project/MimaExcludes.scala ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianjin YE <advancedxy@gmail.com> Author: Thomas Graves <tgraves@unharmedunarmed.corp.ne1.yahoo.com> Closes #21609 from tgravescs/SPARK-22897. 22 June 2018, 12:56:45 UTC
751b008 [SPARK-24589][CORE] Correctly identify tasks in output commit coordinator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21577 from vanzin/SPARK-24552. (cherry picked from commit c8e909cd498b67b121fa920ceee7631c652dac38) Signed-off-by: Thomas Graves <tgraves@apache.org> 21 June 2018, 19:10:16 UTC
7bfefc9 Preparing development version 2.2.3-SNAPSHOT 18 June 2018, 16:21:21 UTC
e2e4d58 Preparing development version 2.2-3-SNAPSHOT 18 June 2018, 14:45:19 UTC
8ce9e2a Preparing Spark release v2.2.2-rc1 18 June 2018, 14:45:11 UTC
090b883 [SPARK-23732][DOCS] Fix source links in generated scaladoc. Apply the suggestion on the bug to fix source links. Tested with the 2.3.1 release docs. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21521 from vanzin/SPARK-23732. (cherry picked from commit dc22465f3e1ef5ad59306b1f591d6fd16d674eb7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 13 June 2018, 23:12:20 UTC
1f81ade [SPARK-24506][UI] Add UI filters to tabs added after binding Currently, `spark.ui.filters` are not applied to the handlers added after binding the server. This means that every page which is added after starting the UI will not have the filters configured on it. This can allow unauthorized access to the pages. The PR adds the filters also to the handlers added after the UI starts. manual tests (without the patch, starting the thriftserver with `--conf spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter --conf spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"` you can access `http://localhost:4040/sqlserver`; with the patch, 401 is the response as for the other pages). Author: Marco Gaido <marcogaido91@gmail.com> Closes #21523 from mgaido91/SPARK-24506. (cherry picked from commit f53818d35bdef5d20a2718b14a2fed4c468545c6) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 12 June 2018, 23:43:53 UTC
bf0b212 [SPARK-24531][TESTS] Remove version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite Removing version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite as it is not present anymore in the mirrors and this is blocking all the open PRs. running UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21540 from mgaido91/SPARK-24531. (cherry picked from commit 2824f1436bb0371b7216730455f02456ef8479ce) Signed-off-by: Xiao Li <gatorsmile@gmail.com> 12 June 2018, 16:58:29 UTC
c306a84 [WEBUI] Avoid possibility of script in query param keys As discussed separately, this avoids the possibility of XSS on certain request param keys. CC vanzin Author: Sean Owen <srowen@gmail.com> Closes #21464 from srowen/XSS2. (cherry picked from commit 698b9a0981f0ec322e15d6ac89cc38c8f49ed33d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 31 May 2018, 16:43:30 UTC
fb86eb0 [MINOR] Add port SSL config in toString and scaladoc ## What changes were proposed in this pull request? SPARK-17874 introduced a new configuration to set the port where SSL services bind to. We missed to update the scaladoc and the `toString` method, though. The PR adds it in the missing places ## How was this patch tested? checked the `toString` output in the logs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21429 from mgaido91/minor_ssl. (cherry picked from commit fd315f5884c03c6dd21abca178897584dee83f1a) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 25 May 2018, 19:49:54 UTC
8abd0a7 fix compilation caused by SPARK-24257 24 May 2018, 04:44:26 UTC
2379074 [SPARK-24257][SQL] LongToUnsafeRowMap calculate the new size may be wrong LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted. Author: sychen <sychen@ctrip.com> Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size. (cherry picked from commit 888340151f737bb68d0e419b1e949f11469881f9) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 May 2018, 03:19:37 UTC
6a55d8b [SPARK-23850][SQL][BRANCH-2.2] Add separate config for SQL options redaction. The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the related redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI. (cherry picked from commit ed7ba7db8fa344ff182b72d23ae458e711f63432) Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21365 from vanzin/SPARK-23850-2.2. 18 May 2018, 23:25:30 UTC
8c223b6 [R][BACKPORT-2.2] backport lint fix ## What changes were proposed in this pull request? backport part of the commit that addresses lintr issue Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #21325 from felixcheung/rlintfix22. 16 May 2018, 02:31:14 UTC
f96d13d [SPARKR] Match pyspark features in SparkR communication protocol. (cherry picked from commit 628c7b517969c4a7ccb26ea67ab3dd61266073ca) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 16cd9ac5264831e061c033b26fe1173ebc88e5d1) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 10 May 2018, 19:42:58 UTC
850b7d8 [PYSPARK] Update py4j to version 0.10.7. (cherry picked from commit cc613b552e753d03cb62661591de59e1c8d82c74) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 323dc3ad02e63a7c99b5bd6da618d6020657ecba) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 10 May 2018, 19:20:58 UTC
f9d6a16 [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6. **BEFORE** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +--------------------+ |(id + 17.1335742042)| +--------------------+ | 17.1335742042| +--------------------+ ``` **AFTER** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-------------------------+ |(id + 17.133574204226083)| +-------------------------+ | 17.133574204226083| +-------------------------+ ``` Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18546 from dongjoon-hyun/SPARK-21278. (cherry picked from commit c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 08 May 2018, 18:21:22 UTC
866270e [SPARK-23697][CORE] LegacyAccumulatorWrapper should define isZero correctly ## What changes were proposed in this pull request? It's possible that Accumulators of Spark 1.x may no longer work with Spark 2.x. This is because `LegacyAccumulatorWrapper.isZero` may return wrong answer if `AccumulableParam` doesn't define equals/hashCode. This PR fixes this by using reference equality check in `LegacyAccumulatorWrapper.isZero`. ## How was this patch tested? a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #21229 from cloud-fan/accumulator. (cherry picked from commit 4d5de4d303a773b1c18c350072344bd7efca9fc4) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 04 May 2018, 11:21:16 UTC
768d0b7 [SPARK-23489][SQL][TEST][BRANCH-2.2] HiveExternalCatalogVersionsSuite should verify the downloaded file ## What changes were proposed in this pull request? This is a backport of #21210 because `branch-2.2` also faces the same failures. Although [SPARK-22654](https://issues.apache.org/jira/browse/SPARK-22654) made `HiveExternalCatalogVersionsSuite` download from Apache mirrors three times, it has been flaky because it didn't verify the downloaded file. Some Apache mirrors terminate the downloading abnormally, the *corrupted* file shows the following errors. ``` gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now 22:46:32.700 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.hive.HiveExternalCatalogVersionsSuite, thread names: Keep-Alive-Timer ===== *** RUN ABORTED *** java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.0"): error=2, No such file or directory ``` This has been reported weirdly in two ways. For example, the above case is reported as Case 2 `no failures`. - Case 1. [Test Result (1 failure / +1)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389/) - Case 2. [Test Result (no failures)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/) This PR aims to make `HiveExternalCatalogVersionsSuite` more robust by verifying the downloaded `tgz` file by extracting and checking the existence of `bin/spark-submit`. If it turns out that the file is empty or corrupted, `HiveExternalCatalogVersionsSuite` will do retry logic like the download failure. ## How was this patch tested? Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21232 from dongjoon-hyun/SPARK-23489-2. 04 May 2018, 00:10:15 UTC
154bbc9 [SPARK-23433][CORE] Late zombie task completions update all tasksets Fetch failure lead to multiple tasksets which are active for a given stage. While there is only one "active" version of the taskset, the earlier attempts can still have running tasks, which can complete successfully. So a task completion needs to update every taskset so that it knows the partition is completed. That way the final active taskset does not try to submit another task for the same partition, and so that it knows when it is completed and when it should be marked as a "zombie". Added a regression test. Author: Imran Rashid <irashid@cloudera.com> Closes #21131 from squito/SPARK-23433. (cherry picked from commit 94641fe6cc68e5977dd8663b8f232a287a783acb) Signed-off-by: Imran Rashid <irashid@cloudera.com> 03 May 2018, 15:59:45 UTC
4f1ae3a [SPARK-23941][MESOS] Mesos task failed on specific spark app name ## What changes were proposed in this pull request? Shell escaped the name passed to spark-submit and change how conf attributes are shell escaped. ## How was this patch tested? This test has been tested manually with Hive-on-spark with mesos or with the use case described in the issue with the sparkPi application with a custom name which contains illegal shell characters. With this PR, hive-on-spark on mesos works like a charm with hive 3.0.0-SNAPSHOT. I state that this contribution is my original work and that I license the work to the project under the project’s open source license Author: Bounkong Khamphousone <bounkong.khamphousone@ebiznext.com> Closes #21014 from tiboun/fix/SPARK-23941. (cherry picked from commit 6782359a04356e4cde32940861bf2410ef37f445) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 01 May 2018, 15:28:49 UTC
e77d62a [MINOR][DOCS] Fix comments of SQLExecution#withExecutionId ## What changes were proposed in this pull request? Fix comment. Change `BroadcastHashJoin.broadcastFuture` to `BroadcastExchangeExec.relationFuture`: https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L66 ## How was this patch tested? N/A Author: seancxmao <seancxmao@gmail.com> Closes #21113 from seancxmao/SPARK-13136. (cherry picked from commit c303b1b6766a3dc5961713f98f62cd7d7ac7972a) Signed-off-by: hyukjinkwon <gurwls223@apache.org> 24 April 2018, 08:17:02 UTC
041aec4 [SPARK-23963][SQL] Properly handle large number of columns in query on text-based Hive table ## What changes were proposed in this pull request? TableReader would get disproportionately slower as the number of columns in the query increased. I fixed the way TableReader was looking up metadata for each column in the row. Previously, it had been looking up this data in linked lists, accessing each linked list by an index (column number). Now it looks up this data in arrays, where indexing by column number works better. ## How was this patch tested? Manual testing All sbt unit tests python sql tests Author: Bruce Robbins <bersprockets@gmail.com> Closes #21043 from bersprockets/tabreadfix. 18 April 2018, 16:50:13 UTC
a902323 [SPARK-24007][SQL] EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen. `EqualNullSafe` for `FloatType` and `DoubleType` might generate a wrong result by codegen. ```scala scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF() df: org.apache.spark.sql.DataFrame = [_1: double, _2: double] scala> df.show() +----+----+ | _1| _2| +----+----+ |-1.0|null| |null|-1.0| +----+----+ scala> df.filter("_1 <=> _2").show() +----+----+ | _1| _2| +----+----+ |-1.0|null| |null|-1.0| +----+----+ ``` The result should be empty but the result remains two rows. Added a test. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21094 from ueshin/issues/SPARK-24007/equalnullsafe. (cherry picked from commit f09a9e9418c1697d198de18f340b1288f5eb025c) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 18 April 2018, 15:23:46 UTC
e957c4e [SPARK-23816][CORE] Killed tasks should ignore FetchFailures. SPARK-19276 ensured that FetchFailures do not get swallowed by other layers of exception handling, but it also meant that a killed task could look like a fetch failure. This is particularly a problem with speculative execution, where we expect to kill tasks as they are reading shuffle data. The fix is to ensure that we always check for killed tasks first. Added a new unit test which fails before the fix, ran it 1k times to check for flakiness. Full suite of tests on jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes #20987 from squito/SPARK-23816. (cherry picked from commit 10f45bb8233e6ac838dd4f053052c8556f5b54bd) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 09 April 2018, 18:31:49 UTC
6b5f9c3 [SPARK-23788][SS] Fix race in StreamingQuerySuite ## What changes were proposed in this pull request? The serializability test uses the same MemoryStream instance for 3 different queries. If any of those queries ask it to commit before the others have run, the rest will see empty dataframes. This can fail the test if q3 is affected. We should use one instance per query instead. ## How was this patch tested? Existing unit test. If I move q2.processAllAvailable() before starting q3, the test always fails without the fix. Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20896 from jose-torres/fixrace. 25 March 2018, 01:22:15 UTC
85ab72b [SPARK-23759][UI] Unable to bind Spark UI to specific host name / IP ## What changes were proposed in this pull request? Fixes SPARK-23759 by moving connector.start() after connector.setHost() Problem was created due connector.setHost(hostName) call was after connector.start() ## How was this patch tested? Patch was tested after build and deployment. This patch requires SPARK_LOCAL_IP environment variable to be set on spark-env.sh Author: bag_of_tricks <falbani@hortonworks.com> Closes #20883 from felixalbani/SPARK-23759. (cherry picked from commit 8b56f16640fc4156aa7bd529c54469d27635b951) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 23 March 2018, 17:36:52 UTC
367a161 [SPARK-23649][SQL] Skipping chars disallowed in UTF-8 The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65 https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190 If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline. The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet). Added a test and a file with a char which is disallowed in UTF-8 - 0xFF. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #20796 from MaxGekk/skip-wrong-utf8-chars. (cherry picked from commit 5e7bc2acef4a1e11d0d8056ef5c12cd5c8f220da) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 20 March 2018, 17:37:29 UTC
175b221 [SPARK-23525][BACKPORT][SQL] Support ALTER TABLE CHANGE COLUMN COMMENT for external hive table ## What changes were proposed in this pull request? The following query doesn't work as expected: ``` CREATE EXTERNAL TABLE ext_table(a STRING, b INT, c STRING) PARTITIONED BY (d STRING) LOCATION 'sql/core/spark-warehouse/ext_table'; ALTER TABLE ext_table CHANGE a a STRING COMMENT "new comment"; DESC ext_table; ``` The comment of column `a` is not updated, that's because `HiveExternalCatalog.doAlterTable` ignores table schema changes. To fix the issue, we should call `doAlterTableDataSchema` instead of `doAlterTable`. ## How was this patch tested? Updated `DDLSuite.testChangeColumn`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #20768 from jiangxb1987/SPARK-23525-2.2. 08 March 2018, 05:55:11 UTC
4864d21 [SPARK-23434][SQL][BRANCH-2.2] Spark should not warn `metadata directory` for a HDFS file path ## What changes were proposed in this pull request? In a kerberized cluster, when Spark reads a file path (e.g. `people.json`), it warns with a wrong warning message during looking up `people.json/_spark_metadata`. The root cause of this situation is the difference between `LocalFileSystem` and `DistributedFileSystem`. `LocalFileSystem.exists()` returns `false`, but `DistributedFileSystem.exists` raises `org.apache.hadoop.security.AccessControlException`. ```scala scala> spark.read.json("file:///usr/hdp/current/spark-client/examples/src/main/resources/people.json").show +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ scala> spark.read.json("hdfs:///tmp/people.json") 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory. 18/02/15 05:00:48 WARN streaming.FileStreamSink: Error while looking for metadata directory. ``` After this PR, ```scala scala> spark.read.json("hdfs:///tmp/people.json").show +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+ ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20715 from dongjoon-hyun/SPARK-23434-2.2. 05 March 2018, 22:29:04 UTC
9bd25c9 [SPARK-23508][CORE] Fix BlockmanagerId in case blockManagerIdCache cause oom … cause oom ## What changes were proposed in this pull request? blockManagerIdCache in BlockManagerId will not remove old values which may cause oom `val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()` Since whenever we apply a new BlockManagerId, it will put into this map. This patch will use guava cahce for blockManagerIdCache instead. A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508) ## How was this patch tested? Exist tests. Author: zhoukang <zhoukang199191@gmail.com> Closes #20667 from caneGuy/zhoukang/fix-history. (cherry picked from commit 6a8abe29ef3369b387d9bc2ee3459a6611246ab1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 28 February 2018, 15:17:06 UTC
fa3667e [SPARK-23438][DSTREAMS] Fix DStreams data loss with WAL when driver crashes There is a race condition introduced in SPARK-11141 which could cause data loss. The problem is that ReceivedBlockTracker.insertAllocatedBatch function assumes that all blocks from streamIdToUnallocatedBlockQueues allocated to the batch and clears the queue. In this PR only the allocated blocks will be removed from the queue which will prevent data loss. Additional unit test + manually. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #20620 from gaborgsomogyi/SPARK-23438. (cherry picked from commit b308182f233b8840dfe0e6b5736d2f2746f40757) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 26 February 2018, 16:56:57 UTC
1cc34f3 [SPARK-22700][ML] Bucketizer.transform incorrectly drops row containing NaN - for branch-2.2 ## What changes were proposed in this pull request? for branch-2.2 only drops the rows containing NaN in the input columns ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #20539 from zhengruifeng/bucketizer_nan_2.2. 22 February 2018, 01:26:33 UTC
a95c3e2 [SPARK-23230][SQL][BRANCH-2.2] When hive.default.fileformat is other kinds of file types, create textfile table cause a serde error When hive.default.fileformat is other kinds of file types, create textfile table cause a serde error. We should take the default type of textfile and sequencefile both as org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe. ``` set hive.default.fileformat=orc; create table tbl( i string ) stored as textfile; desc formatted tbl; Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ``` Author: sychen <sychen@ctrip.com> Closes #20593 from cxzl25/default_serde_2.2. 14 February 2018, 04:59:31 UTC
73263b2 [SPARK-23053][CORE] taskBinarySerialization and task partitions calculate in DagScheduler.submitMissingTasks should keep the same RDD checkpoint status ## What changes were proposed in this pull request? When we run concurrent jobs using the same rdd which is marked to do checkpoint. If one job has finished running the job, and start the process of RDD.doCheckpoint, while another job is submitted, then submitStage and submitMissingTasks will be called. In [submitMissingTasks](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L961), will serialize taskBinaryBytes and calculate task partitions which are both affected by the status of checkpoint, if the former is calculated before doCheckpoint finished, while the latter is calculated after doCheckpoint finished, when run task, rdd.compute will be called, for some rdds with particular partition type such as [UnionRDD](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala) who will do partition type cast, will get a ClassCastException because the part params is actually a CheckpointRDDPartition. This error occurs because rdd.doCheckpoint occurs in the same thread that called sc.runJob, while the task serialization occurs in the DAGSchedulers event loop. ## How was this patch tested? the exist uts and also add a test case in DAGScheduerSuite to show the exception case. Author: huangtengfei <huangtengfei@huangtengfeideMacBook-Pro.local> Closes #20244 from ivoson/branch-taskpart-mistype. (cherry picked from commit 091a000d27f324de8c5c527880854ecfcf5de9a4) Signed-off-by: Imran Rashid <irashid@cloudera.com> 13 February 2018, 16:00:23 UTC
14b5dbf [SPARK-23391][CORE] It may lead to overflow for some integer multiplication In the `getBlockData`,`blockId.reduceId` is the `Int` type, when it is greater than 2^28, `blockId.reduceId*8` will overflow In the `decompress0`, `len` and `unitSize` are Int type, so `len * unitSize` may lead to overflow N/A Author: liuxian <liu.xian3@zte.com.cn> Closes #20581 from 10110346/overflow2. (cherry picked from commit 4a4dd4f36f65410ef5c87f7b61a960373f044e61) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 February 2018, 14:52:39 UTC
1694834 [SPARK-23376][SQL] creating UnsafeKVExternalSorter with BytesToBytesMap may fail This is a long-standing bug in `UnsafeKVExternalSorter` and was reported in the dev list multiple times. When creating `UnsafeKVExternalSorter` with `BytesToBytesMap`, we need to create a `UnsafeInMemorySorter` to sort the data in `BytesToBytesMap`. The data format of the sorter and the map is same, so no data movement is required. However, both the sorter and the map need a point array for some bookkeeping work. There is an optimization in `UnsafeKVExternalSorter`: reuse the point array between the sorter and the map, to avoid an extra memory allocation. This sounds like a reasonable optimization, the length of the `BytesToBytesMap` point array is at least 4 times larger than the number of keys(to avoid hash collision, the hash table size should be at least 2 times larger than the number of keys, and each key occupies 2 slots). `UnsafeInMemorySorter` needs the pointer array size to be 4 times of the number of entries, so we are safe to reuse the point array. However, the number of keys of the map doesn't equal to the number of entries in the map, because `BytesToBytesMap` supports duplicated keys. This breaks the assumption of the above optimization and we may run out of space when inserting data into the sorter, and hit error ``` java.lang.IllegalStateException: There is no space for new record at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:239) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:149) ... ``` This PR fixes this bug by creating a new point array if the existing one is not big enough. a new test Author: Wenchen Fan <wenchen@databricks.com> Closes #20561 from cloud-fan/bug. (cherry picked from commit 4bbd7443ebb005f81ed6bc39849940ac8db3b3cc) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 February 2018, 16:15:14 UTC
1b4c6ab [SPARK-23186][SQL][BRANCH-2.2] Initialize DriverManager first before loading JDBC Drivers ## What changes were proposed in this pull request? Since some JDBC Drivers have class initialization code to call `DriverManager`, we need to initialize `DriverManager` first in order to avoid potential executor-side **deadlock** situations like the following (or [STORM-2527](https://issues.apache.org/jira/browse/STORM-2527)). ``` Thread 9587: (state = BLOCKED) - sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, java.lang.Object[]) bci=0 (Compiled frame; information may be imprecise) - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=85, line=62 (Compiled frame) - sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=5, line=45 (Compiled frame) - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) bci=79, line=423 (Compiled frame) - java.lang.Class.newInstance() bci=138, line=442 (Compiled frame) - java.util.ServiceLoader$LazyIterator.nextService() bci=119, line=380 (Interpreted frame) - java.util.ServiceLoader$LazyIterator.next() bci=11, line=404 (Interpreted frame) - java.util.ServiceLoader$1.next() bci=37, line=480 (Interpreted frame) - java.sql.DriverManager$2.run() bci=21, line=603 (Interpreted frame) - java.sql.DriverManager$2.run() bci=1, line=583 (Interpreted frame) - java.security.AccessController.doPrivileged(java.security.PrivilegedAction) bci=0 (Compiled frame) - java.sql.DriverManager.loadInitialDrivers() bci=27, line=583 (Interpreted frame) - java.sql.DriverManager.<clinit>() bci=32, line=101 (Interpreted frame) - org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String, java.lang.Integer, java.lang.String, java.util.Properties) bci=12, line=98 (Interpreted frame) - org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration, java.util.Properties) bci=22, line=57 (Interpreted frame) - org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext, org.apache.hadoop.conf.Configuration) bci=61, line=116 (Interpreted frame) - org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) bci=10, line=71 (Interpreted frame) - org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(org.apache.spark.rdd.NewHadoopRDD, org.apache.spark.Partition, org.apache.spark.TaskContext) bci=233, line=156 (Interpreted frame) Thread 9170: (state = BLOCKED) - org.apache.phoenix.jdbc.PhoenixDriver.<clinit>() bci=35, line=125 (Interpreted frame) - sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, java.lang.Object[]) bci=0 (Compiled frame) - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=85, line=62 (Compiled frame) - sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) bci=5, line=45 (Compiled frame) - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) bci=79, line=423 (Compiled frame) - java.lang.Class.newInstance() bci=138, line=442 (Compiled frame) - org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String) bci=89, line=46 (Interpreted frame) - org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() bci=7, line=53 (Interpreted frame) - org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() bci=1, line=52 (Interpreted frame) - org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD, org.apache.spark.Partition, org.apache.spark.TaskContext) bci=81, line=347 (Interpreted frame) - org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition, org.apache.spark.TaskContext) bci=7, line=339 (Interpreted frame) ``` ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20563 from dongjoon-hyun/SPARK-23186-2. 10 February 2018, 02:29:12 UTC
f65e653 [SPARK-23358][CORE] When the number of partitions is greater than 2^28, it will result in an error result ## What changes were proposed in this pull request? In the `checkIndexAndDataFile`,the `blocks` is the ` Int` type, when it is greater than 2^28, `blocks*8` will overflow, and this will result in an error result. In fact, `blocks` is actually the number of partitions. ## How was this patch tested? Manual test Author: liuxian <liu.xian3@zte.com.cn> Closes #20544 from 10110346/overflow. (cherry picked from commit f77270b8811bbd8956d0c08fa556265d2c5ee20e) Signed-off-by: Sean Owen <sowen@cloudera.com> 09 February 2018, 14:45:24 UTC
cb73ecd revert the removal of import in SPARK-23281 31 January 2018, 22:01:57 UTC
5273cc7 [SPARK-23281][SQL] Query produces results in incorrect order when a composite order by clause refers to both original columns and aliases ## What changes were proposed in this pull request? Here is the test snippet. ``` SQL scala> Seq[(Integer, Integer)]( | (1, 1), | (1, 3), | (2, 3), | (3, 3), | (4, null), | (5, null) | ).toDF("key", "value").createOrReplaceTempView("src") scala> sql( | """ | |SELECT MAX(value) as value, key as col2 | |FROM src | |GROUP BY key | |ORDER BY value desc, key | """.stripMargin).show +-----+----+ |value|col2| +-----+----+ | 3| 3| | 3| 2| | 3| 1| | null| 5| | null| 4| +-----+----+ ```SQL Here is the explain output : ```SQL == Parsed Logical Plan == 'Sort ['value DESC NULLS LAST, 'key ASC NULLS FIRST], true +- 'Aggregate ['key], ['MAX('value) AS value#9, 'key AS col2#10] +- 'UnresolvedRelation `src` == Analyzed Logical Plan == value: int, col2: int Project [value#9, col2#10] +- Sort [value#9 DESC NULLS LAST, col2#10 DESC NULLS LAST], true +- Aggregate [key#5], [max(value#6) AS value#9, key#5 AS col2#10] +- SubqueryAlias src +- Project [_1#2 AS key#5, _2#3 AS value#6] +- LocalRelation [_1#2, _2#3] ``` SQL The sort direction is being wrongly changed from ASC to DSC while resolving ```Sort``` in resolveAggregateFunctions. The above testcase models TPCDS-Q71 and thus we have the same issue in Q71 as well. ## How was this patch tested? A few tests are added in SQLQuerySuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #20453 from dilipbiswal/local_spark. 31 January 2018, 21:58:46 UTC
0e58fee [DOCS] change to dataset for java code in structured-streaming-kafka-integration document ## What changes were proposed in this pull request? In latest structured-streaming-kafka-integration document, Java code example for Kafka integration is using `DataFrame<Row>`, shouldn't it be changed to `DataSet<Row>`? ## How was this patch tested? manual test has been performed to test the updated example Java code in Spark 2.2.1 with Kafka 1.0 Author: brandonJY <brandonJY@users.noreply.github.com> Closes #20312 from brandonJY/patch-2. (cherry picked from commit 6121e91b7f5c9513d68674e4d5edbc3a4a5fd5fd) Signed-off-by: Sean Owen <sowen@cloudera.com> 19 January 2018, 00:58:06 UTC
d09eecc [SPARK-23095][SQL] Decorrelation of scalar subquery fails with java.util.NoSuchElementException ## What changes were proposed in this pull request? The following SQL involving scalar correlated query returns a map exception. ``` SQL SELECT t1a FROM t1 WHERE t1a = (SELECT count(*) FROM t2 WHERE t2c = t1c HAVING count(*) >= 1) ``` ``` SQL key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) java.util.NoSuchElementException: key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430) at org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426) ``` In this case, after evaluating the HAVING clause "count(*) > 1" statically against the binding of aggregtation result on empty input, we determine that this query will not have a the count bug. We should simply return the evalSubqueryOnZeroTups with empty value. (Please fill in changes proposed in this fix) ## How was this patch tested? A new test was added in the Subquery bucket. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #20283 from dilipbiswal/scalar-count-defect. (cherry picked from commit 0c2ba427bc7323729e6ffb34f1f06a97f0bf0c1d) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 17 January 2018, 01:58:09 UTC
7022ef8 [SPARK-23038][TEST] Update docker/spark-test (JDK/OS) ## What changes were proposed in this pull request? This PR aims to update the followings in `docker/spark-test`. - JDK7 -> JDK8 Spark 2.2+ supports JDK8 only. - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel) The end of life of `precise` was April 28, 2017. ## How was this patch tested? Manual. * Master ``` $ cd external/docker $ ./build $ export SPARK_HOME=... $ docker run -v $SPARK_HOME:/opt/spark spark-test-master CONTAINER_IP=172.17.0.3 ... 18/01/11 06:50:25 INFO MasterWebUI: Bound MasterWebUI to 172.17.0.3, and started at http://172.17.0.3:8080 18/01/11 06:50:25 INFO Utils: Successfully started service on port 6066. 18/01/11 06:50:25 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066 18/01/11 06:50:25 INFO Master: I have been elected leader! New state: ALIVE ``` * Slave ``` $ docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://172.17.0.3:7077 CONTAINER_IP=172.17.0.4 ... 18/01/11 06:51:54 INFO Worker: Successfully registered with master spark://172.17.0.3:7077 ``` After slave starts, master will show ``` 18/01/11 06:51:54 INFO Master: Registering worker 172.17.0.4:8888 with 4 cores, 1024.0 MB RAM ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20230 from dongjoon-hyun/SPARK-23038. (cherry picked from commit 7a3d0aad2b89aef54f7dd580397302e9ff984d9d) Signed-off-by: Felix Cheung <felixcheung@apache.org> 14 January 2018, 07:30:45 UTC
105ae86 [SPARK-22975][SS] MetricsReporter should not throw exception when there was no progress reported ## What changes were proposed in this pull request? `MetricsReporter ` assumes that there has been some progress for the query, ie. `lastProgress` is not null. If this is not true, as it might happen in particular conditions, a `NullPointerException` can be thrown. The PR checks whether there is a `lastProgress` and if this is not true, it returns a default value for the metrics. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20189 from mgaido91/SPARK-22975. (cherry picked from commit 54277398afbde92a38ba2802f4a7a3e5910533de) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 12 January 2018, 19:25:59 UTC
20eea20 [SPARK-22982] Remove unsafe asynchronous close() call from FileDownloadChannel ## What changes were proposed in this pull request? This patch fixes a severe asynchronous IO bug in Spark's Netty-based file transfer code. At a high-level, the problem is that an unsafe asynchronous `close()` of a pipe's source channel creates a race condition where file transfer code closes a file descriptor then attempts to read from it. If the closed file descriptor's number has been reused by an `open()` call then this invalid read may cause unrelated file operations to return incorrect results. **One manifestation of this problem is incorrect query results.** For a high-level overview of how file download works, take a look at the control flow in `NettyRpcEnv.openChannel()`: this code creates a pipe to buffer results, then submits an asynchronous stream request to a lower-level TransportClient. The callback passes received data to the sink end of the pipe. The source end of the pipe is passed back to the caller of `openChannel()`. Thus `openChannel()` returns immediately and callers interact with the returned pipe source channel. Because the underlying stream request is asynchronous, errors may occur after `openChannel()` has returned and after that method's caller has started to `read()` from the returned channel. For example, if a client requests an invalid stream from a remote server then the "stream does not exist" error may not be received from the remote server until after `openChannel()` has returned. In order to be able to propagate the "stream does not exist" error to the file-fetching application thread, this code wraps the pipe's source channel in a special `FileDownloadChannel` which adds an `setError(t: Throwable)` method, then calls this `setError()` method in the FileDownloadCallback's `onFailure` method. It is possible for `FileDownloadChannel`'s `read()` and `setError()` methods to be called concurrently from different threads: the `setError()` method is called from within the Netty RPC system's stream callback handlers, while the `read()` methods are called from higher-level application code performing remote stream reads. The problem lies in `setError()`: the existing code closed the wrapped pipe source channel. Because `read()` and `setError()` occur in different threads, this means it is possible for one thread to be calling `source.read()` while another asynchronously calls `source.close()`. Java's IO libraries do not guarantee that this will be safe and, in fact, it's possible for these operations to interleave in such a way that a lower-level `read()` system call occurs right after a `close()` call. In the best-case, this fails as a read of a closed file descriptor; in the worst-case, the file descriptor number has been re-used by an intervening `open()` operation and the read corrupts the result of an unrelated file IO operation being performed by a different thread. The solution here is to remove the `stream.close()` call in `onError()`: the thread that is performing the `read()` calls is responsible for closing the stream in a `finally` block, so there's no need to close it here. If that thread is blocked in a `read()` then it will become unblocked when the sink end of the pipe is closed in `FileDownloadCallback.onFailure()`. After making this change, we also need to refine the `read()` method to always check for a `setError()` result, even if the underlying channel `read()` call has succeeded. This patch also makes a slight cleanup to a dodgy-looking `catch e: Exception` block to use a safer `try-finally` error handling idiom. This bug was introduced in SPARK-11956 / #9941 and is present in Spark 1.6.0+. ## How was this patch tested? This fix was tested manually against a workload which non-deterministically hit this bug. Author: Josh Rosen <joshrosen@databricks.com> Closes #20179 from JoshRosen/SPARK-22982-fix-unsafe-async-io-in-file-download-channel. (cherry picked from commit edf0a48c2ec696b92ed6a96dcee6eeb1a046b20b) Signed-off-by: Sean Owen <sowen@cloudera.com> 12 January 2018, 16:39:57 UTC
acab4e7 [SPARK-23001][SQL] Fix NullPointerException when DESC a database with NULL description ## What changes were proposed in this pull request? When users' DB description is NULL, users might hit `NullPointerException`. This PR is to fix the issue. ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #20215 from gatorsmile/SPARK-23001. (cherry picked from commit 87c98de8b23f0e978958fc83677fdc4c339b7e6a) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 11 January 2018, 10:18:21 UTC
0d943d9 [SPARK-22972] Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc ## What changes were proposed in this pull request? Fix the warning: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc. This PR is for branch-2.2 and cherry-pick from https://github.com/apache/spark/commit/8032cf852fccd0ab8754f633affdc9ba8fc99e58 The old PR is https://github.com/apache/spark/pull/20165 ## How was this patch tested? Please see test("SPARK-22972: hive orc source") Author: xubo245 <601450868@qq.com> Closes #20195 from xubo245/HiveSerDeForBranch2.2. 10 January 2018, 15:27:45 UTC
24f1f2a [SPARK-22984] Fix incorrect bitmap copying and offset adjustment in GenerateUnsafeRowJoiner ## What changes were proposed in this pull request? This PR fixes a longstanding correctness bug in `GenerateUnsafeRowJoiner`. This class was introduced in https://github.com/apache/spark/pull/7821 (July 2015 / Spark 1.5.0+) and is used to combine pairs of UnsafeRows in TungstenAggregationIterator, CartesianProductExec, and AppendColumns. ### Bugs fixed by this patch 1. **Incorrect combining of null-tracking bitmaps**: when concatenating two UnsafeRows, the implementation "Concatenate the two bitsets together into a single one, taking padding into account". If one row has no columns then it has a bitset size of 0, but the code was incorrectly assuming that if the left row had a non-zero number of fields then the right row would also have at least one field, so it was copying invalid bytes and and treating them as part of the bitset. I'm not sure whether this bug was also present in the original implementation or whether it was introduced in https://github.com/apache/spark/pull/7892 (which fixed another bug in this code). 2. **Incorrect updating of data offsets for null variable-length fields**: after updating the bitsets and copying fixed-length and variable-length data, we need to perform adjustments to the offsets pointing the start of variable length fields's data. The existing code was _conditionally_ adding a fixed offset to correct for the new length of the combined row, but it is unsafe to do this if the variable-length field has a null value: we always represent nulls by storing `0` in the fixed-length slot, but this code was incorrectly incrementing those values. This bug was present since the original version of `GenerateUnsafeRowJoiner`. ### Why this bug remained latent for so long The PR which introduced `GenerateUnsafeRowJoiner` features several randomized tests, including tests of the cases where one side of the join has no fields and where string-valued fields are null. However, the existing assertions were too weak to uncover this bug: - If a null field has a non-zero value in its fixed-length data slot then this will not cause problems for field accesses because the null-tracking bitmap should still be correct and we will not try to use the incorrect offset for anything. - If the null tracking bitmap is corrupted by joining against a row with no fields then the corruption occurs in field numbers past the actual field numbers contained in the row. Thus valid `isNullAt()` calls will not read the incorrectly-set bits. The existing `GenerateUnsafeRowJoinerSuite` tests only exercised `.get()` and `isNullAt()`, but didn't actually check the UnsafeRows for bit-for-bit equality, preventing these bugs from failing assertions. It turns out that there was even a [GenerateUnsafeRowJoinerBitsetSuite](https://github.com/apache/spark/blob/03377d2522776267a07b7d6ae9bddf79a4e0f516/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala) but it looks like it also didn't catch this problem because it only tested the bitsets in an end-to-end fashion by accessing them through the `UnsafeRow` interface instead of actually comparing the bitsets' bytes. ### Impact of these bugs - This bug will cause `equals()` and `hashCode()` to be incorrect for these rows, which will be problematic in case`GenerateUnsafeRowJoiner`'s results are used as join or grouping keys. - Chained / repeated invocations of `GenerateUnsafeRowJoiner` may result in reads from invalid null bitmap positions causing fields to incorrectly become NULL (see the end-to-end example below). - It looks like this generally only happens in `CartesianProductExec`, which our query optimizer often avoids executing (usually we try to plan a `BroadcastNestedLoopJoin` instead). ### End-to-end test case demonstrating the problem The following query demonstrates how this bug may result in incorrect query results: ```sql set spark.sql.autoBroadcastJoinThreshold=-1; -- Needed to trigger CartesianProductExec create table a as select * from values 1; create table b as select * from values 2; SELECT t3.col1, t1.col1 FROM a t1 CROSS JOIN b t2 CROSS JOIN b t3 ``` This should return `(2, 1)` but instead was returning `(null, 1)`. Column pruning ends up trimming off all columns from `t2`, so when `t2` joins with another table this triggers the bitmap-copying bug. This incorrect bitmap is subsequently copied again when performing the final join, causing the final output to have an incorrectly-set null bit for the first field. ## How was this patch tested? Strengthened the assertions in existing tests in GenerateUnsafeRowJoinerSuite. Also verified that the end-to-end test case which uncovered this now passes. Author: Josh Rosen <joshrosen@databricks.com> Closes #20181 from JoshRosen/SPARK-22984-fix-generate-unsaferow-joiner-bitmap-bugs. (cherry picked from commit f20131dd35939734fe16b0005a086aa72400893b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 09 January 2018, 04:00:02 UTC
7c30ae3 [SPARK-22983] Don't push filters beneath aggregates with empty grouping expressions ## What changes were proposed in this pull request? The following SQL query should return zero rows, but in Spark it actually returns one row: ``` SELECT 1 from ( SELECT 1 AS z, MIN(a.x) FROM (select 1 as x) a WHERE false ) b where b.z != b.z ``` The problem stems from the `PushDownPredicate` rule: when this rule encounters a filter on top of an Aggregate operator, e.g. `Filter(Agg(...))`, it removes the original filter and adds a new filter onto Aggregate's child, e.g. `Agg(Filter(...))`. This is sometimes okay, but the case above is a counterexample: because there is no explicit `GROUP BY`, we are implicitly computing a global aggregate over the entire table so the original filter was not acting like a `HAVING` clause filtering the number of groups: if we push this filter then it fails to actually reduce the cardinality of the Aggregate output, leading to the wrong answer. In 2016 I fixed a similar problem involving invalid pushdowns of data-independent filters (filters which reference no columns of the filtered relation). There was additional discussion after my fix was merged which pointed out that my patch was an incomplete fix (see #15289), but it looks I must have either misunderstood the comment or forgot to follow up on the additional points raised there. This patch fixes the problem by choosing to never push down filters in cases where there are no grouping expressions. Since there are no grouping keys, the only columns are aggregate columns and we can't push filters defined over aggregate results, so this change won't cause us to miss out on any legitimate pushdown opportunities. ## How was this patch tested? New regression tests in `SQLQueryTestSuite` and `FilterPushdownSuite`. Author: Josh Rosen <joshrosen@databricks.com> Closes #20180 from JoshRosen/SPARK-22983-dont-push-filters-beneath-aggs-with-empty-grouping-expressions. (cherry picked from commit 2c73d2a948bdde798aaf0f87c18846281deb05fd) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 08 January 2018, 08:05:04 UTC
41f705a [SPARK-22889][SPARKR] Set overwrite=T when install SparkR in tests ## What changes were proposed in this pull request? Since all CRAN checks go through the same machine, if there is an older partial download or partial install of Spark left behind the tests fail. This PR overwrites the install files when running tests. This shouldn't affect Jenkins as `SPARK_HOME` is set when running Jenkins tests. ## How was this patch tested? Test manually by running `R CMD check --as-cran` Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #20060 from shivaram/sparkr-overwrite-cran. (cherry picked from commit 1219d7a4343e837749c56492d382b3c814b97271) Signed-off-by: Felix Cheung <felixcheung@apache.org> 23 December 2017, 18:27:35 UTC
7a97943 [SPARK-20694][EXAMPLES] Update SQLDataSourceExample.scala ## What changes were proposed in this pull request? Create table using the right DataFrame. peopleDF->usersDF peopleDF: +----+-------+ | age| name| +----+-------+ usersDF: +------+--------------+----------------+ | name|favorite_color|favorite_numbers| +------+--------------+----------------+ ## How was this patch tested? Manually tested. Author: CNRui <13266776177@163.com> Closes #20052 from CNRui/patch-2. (cherry picked from commit ea2642eb0ebcf02e8fba727a82c140c9f2284725) Signed-off-by: Sean Owen <sowen@cloudera.com> 23 December 2017, 14:18:16 UTC
1cf3e3a [SPARK-22862] Docs on lazy elimination of columns missing from an encoder This behavior has confused some users, so lets clarify it. Author: Michael Armbrust <michael@databricks.com> Closes #20048 from marmbrus/datasetAsDocs. (cherry picked from commit 8df1da396f64bb7fe76d73cd01498fdf3b8ed964) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 22 December 2017, 05:38:34 UTC
1e4cca0 [SPARK-22817][R] Use fixed testthat version for SparkR tests in AppVeyor ## What changes were proposed in this pull request? `testthat` 2.0.0 is released and AppVeyor now started to use it instead of 1.0.2. And then, we started to have R tests failed in AppVeyor. See - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1967-master ``` Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : object 'run_tests' not found Calls: ::: -> get ``` This seems because we rely on internal `testthat:::run_tests` here: https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75 https://github.com/apache/spark/blob/dc4c351837879dab26ad8fb471dc51c06832a9e4/R/pkg/tests/run-all.R#L49-L52 However, seems it was removed out from 2.0.0. I tried few other exposed APIs like `test_dir` but I failed to make a good compatible fix. Seems we better fix the `testthat` version first to make the build passed. ## How was this patch tested? Manually tested and AppVeyor tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20003 from HyukjinKwon/SPARK-22817. (cherry picked from commit c2aeddf9eae2f8f72c244a4b16af264362d6cf5d) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 17 December 2017, 05:41:05 UTC
b4f4be3 [SPARK-22574][MESOS][SUBMIT] Check submission request parameters ## What changes were proposed in this pull request? PR closed with all the comments -> https://github.com/apache/spark/pull/19793 It solves the problem when submitting a wrong CreateSubmissionRequest to Spark Dispatcher was causing a bad state of Dispatcher and making it inactive as a Mesos framework. https://issues.apache.org/jira/browse/SPARK-22574 ## How was this patch tested? All spark test passed successfully. It was tested sending a wrong request (without appArgs) before and after the change. The point is easy, check if the value is null before being accessed. This was before the change, leaving the dispatcher inactive: ``` Exception in thread "Thread-22" java.lang.NullPointerException at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) ``` And after: ``` "message" : "Malformed request: org.apache.spark.deploy.rest.SubmitRestProtocolException: Validation of message CreateSubmissionRequest failed!\n\torg.apache.spark.deploy.rest.SubmitRestProtocolMessage.validate(SubmitRestProtocolMessage.scala:70)\n\torg.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:272)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:707)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\torg.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)\n\torg.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)\n\torg.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\torg.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\torg.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\torg.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\torg.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\torg.spark_project.jetty.server.Server.handle(Server.java:524)\n\torg.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)\n\torg.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)\n\torg.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\torg.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\torg.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tjava.lang.Thread.run(Thread.java:745)" ``` Author: German Schiavon <germanschiavon@gmail.com> Closes #19966 from Gschiavon/fix-submission-request. (cherry picked from commit 0bdb4e516c425ea7bf941106ac6449b5a0a289e3) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 13 December 2017, 21:37:35 UTC
0230515 Revert "[SPARK-22574][MESOS][SUBMIT] Check submission request parameters" This reverts commit 728a45e5a68a20bdd17227edc70e6a38d178af1c. 12 December 2017, 21:42:21 UTC
728a45e [SPARK-22574][MESOS][SUBMIT] Check submission request parameters ## What changes were proposed in this pull request? It solves the problem when submitting a wrong CreateSubmissionRequest to Spark Dispatcher was causing a bad state of Dispatcher and making it inactive as a Mesos framework. https://issues.apache.org/jira/browse/SPARK-22574 ## How was this patch tested? All spark test passed successfully. It was tested sending a wrong request (without appArgs) before and after the change. The point is easy, check if the value is null before being accessed. This was before the change, leaving the dispatcher inactive: ``` Exception in thread "Thread-22" java.lang.NullPointerException at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555) at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621) ``` And after: ``` "message" : "Malformed request: org.apache.spark.deploy.rest.SubmitRestProtocolException: Validation of message CreateSubmissionRequest failed!\n\torg.apache.spark.deploy.rest.SubmitRestProtocolMessage.validate(SubmitRestProtocolMessage.scala:70)\n\torg.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:272)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:707)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\torg.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)\n\torg.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)\n\torg.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\torg.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\torg.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\torg.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\torg.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\torg.spark_project.jetty.server.Server.handle(Server.java:524)\n\torg.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)\n\torg.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)\n\torg.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\torg.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\torg.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tjava.lang.Thread.run(Thread.java:745)" ``` Author: German Schiavon <germanschiavon@gmail.com> Closes #19793 from Gschiavon/fix-submission-request. (cherry picked from commit 7a51e71355485bb176a1387d99ec430c5986cbec) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> 12 December 2017, 19:47:11 UTC
00cdb38 [SPARK-22289][ML] Add JSON support for Matrix parameters (LR with coefficients bound) ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-22289 add JSON encoding/decoding for Param[Matrix]. The issue was reported by Nic Eggert during saving LR model with LowerBoundsOnCoefficients. There're two ways to resolve this as I see: 1. Support save/load on LogisticRegressionParams, and also adjust the save/load in LogisticRegression and LogisticRegressionModel. 2. Directly support Matrix in Param.jsonEncode, similar to what we have done for Vector. After some discussion in jira, we prefer the fix to support Matrix as a valid Param type, for simplicity and convenience for other classes. Note that in the implementation, I added a "class" field in the JSON object to match different JSON converters when loading, which is for preciseness and future extension. ## How was this patch tested? new unit test to cover the LR case and JsonMatrixConverter Author: Yuhao Yang <yuhao.yang@intel.com> Closes #19525 from hhbyyh/lrsave. (cherry picked from commit 10c27a6559803797e89c28ced11c1087127b82eb) Signed-off-by: Yanbo Liang <ybliang8@gmail.com> 12 December 2017, 19:27:40 UTC
9e2d96d [SPARK-22688][SQL][HOTFIX] Upgrade Janino version to 3.0.8 ## What changes were proposed in this pull request? Hotfix inadvertent change to xmlbuilder dep when updating Janino. See backport of https://github.com/apache/spark/pull/19890 ## How was this patch tested? N/A Author: Sean Owen <sowen@cloudera.com> Closes #19922 from srowen/SPARK-22688.2. 07 December 2017, 23:08:37 UTC
2084675 [SPARK-22688][SQL] Upgrade Janino version to 3.0.8 This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode. * SIPUSH bytecode is not used for short integer constant [#33](https://github.com/janino-compiler/janino/issues/33). Please see detail in [this discussion thread](https://github.com/apache/spark/pull/19518#issuecomment-346674976). Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19890 from kiszk/SPARK-22688. (cherry picked from commit 8ae004b4602266d1f210e4c1564246d590412c06) Signed-off-by: Sean Owen <sowen@cloudera.com> 07 December 2017, 18:52:07 UTC
7fd6d53 [SPARK-22686][SQL] DROP TABLE IF EXISTS should not show AnalysisException ## What changes were proposed in this pull request? During [SPARK-22488](https://github.com/apache/spark/pull/19713) to fix view resolution issue, there occurs a regression at `2.2.1` and `master` branch like the following. This PR fixes that. ```scala scala> spark.version res2: String = 2.2.1 scala> sql("DROP TABLE IF EXISTS t").show 17/12/04 21:01:06 WARN DropTableCommand: org.apache.spark.sql.AnalysisException: Table or view not found: t; org.apache.spark.sql.AnalysisException: Table or view not found: t; ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19888 from dongjoon-hyun/SPARK-22686. (cherry picked from commit 82183f7b57f2a93e646c56a9e37fac64b348ff0b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 06 December 2017, 02:53:07 UTC
5b63000 [SPARK-22162][BRANCH-2.2] Executors and the driver should use consistent JobIDs in the RDD commit protocol I have modified SparkHadoopMapReduceWriter so that executors and the driver always use consistent JobIds during the hadoop commit. Before SPARK-18191, spark always used the rddId, it just incorrectly named the variable stageId. After SPARK-18191, it used the rddId as the jobId on the driver's side, and the stageId as the jobId on the executors' side. With this change executors and the driver will consistently uses rddId as the jobId. Also with this change, during the hadoop commit protocol spark uses actual stageId to check whether a stage can be committed unlike before that it was using executors' jobId to do this check. In addition to the existing unit tests, a test has been added to check whether executors and the driver are using the same JobId. The test failed before this change and passed after applying this fix. Author: Reza Safi <rezasafi@cloudera.com> Closes #19886 from rezasafi/stagerdd22. 05 December 2017, 17:16:22 UTC
f3f8c87 [SPARK-22635][SQL][ORC] FileNotFoundException while reading ORC files containing special characters ## What changes were proposed in this pull request? SPARK-22146 fix the FileNotFoundException issue only for the `inferSchema` method, ie. only for the schema inference, but it doesn't fix the problem when actually reading the data. Thus nearly the same exception happens when someone tries to use the data. This PR covers fixing the problem also there. ## How was this patch tested? enhanced UT Author: Marco Gaido <mgaido@hortonworks.com> Closes #19844 from mgaido91/SPARK-22635. 01 December 2017, 09:18:57 UTC
ba00bd9 [SPARK-22601][SQL] Data load is getting displayed successful on providing non existing nonlocal file path ## What changes were proposed in this pull request? When user tries to load data with a non existing hdfs file path system is not validating it and the load command operation is getting successful. This is misleading to the user. already there is a validation in the scenario of none existing local file path. This PR has added validation in the scenario of nonexisting hdfs file path ## How was this patch tested? UT has been added for verifying the issue, also snapshots has been added after the verification in a spark yarn cluster Author: sujith71955 <sujithchacko.2010@gmail.com> Closes #19823 from sujith71955/master_LoadComand_Issue. (cherry picked from commit 16adaf634bcca3074b448d95e72177eefdf50069) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 01 December 2017, 04:46:46 UTC
af8a692 [SPARK-22653] executorAddress registered in CoarseGrainedSchedulerBac… https://issues.apache.org/jira/browse/SPARK-22653 executorRef.address can be null, pass the executorAddress which accounts for it being null a few lines above the fix. Manually tested this patch. You can reproduce the issue by running a simple spark-shell in yarn client mode with dynamic allocation and request some executors up front. Let those executors idle timeout. Get a heap dump. Without this fix, you will see that addressToExecutorId still contains the ids, with the fix addressToExecutorId is properly cleaned up. Author: Thomas Graves <tgraves@oath.com> Closes #19850 from tgravescs/SPARK-22653. (cherry picked from commit dc365422bb337d19ef39739c7c3cf9e53ec85d09) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 01 December 2017, 02:57:25 UTC
0121ebc [SPARK-22373] Bump Janino dependency version to fix thread safety issue… … with Janino when compiling generated code. ## What changes were proposed in this pull request? Bump up Janino dependency version to fix thread safety issue during compiling generated code ## How was this patch tested? Check https://issues.apache.org/jira/browse/SPARK-22373 for details. Converted part of the code in CodeGenerator into a standalone application, so the issue can be consistently reproduced locally. Verified that changing Janino dependency version resolved this issue. Author: Min Shen <mshen@linkedin.com> Closes #19839 from Victsm/SPARK-22373. (cherry picked from commit 7da1f5708cc96c18ddb3acd09542621275e71d83) Signed-off-by: Sean Owen <sowen@cloudera.com> 01 December 2017, 01:24:52 UTC
d7b1474 [SPARK-22654][TESTS] Retry Spark tarball download if failed in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? Adds a simple loop to retry download of Spark tarballs from different mirrors if the download fails. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19851 from srowen/SPARK-22654. (cherry picked from commit 6eb203fae7bbc9940710da40f314b89ffb4dd324) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 30 November 2017, 16:22:06 UTC
38a0532 [SPARK-22637][SQL] Only refresh a logical plan once. ## What changes were proposed in this pull request? `CatalogImpl.refreshTable` uses `foreach(..)` to refresh all tables in a view. This traverses all nodes in the subtree and calls `LogicalPlan.refresh()` on these nodes. However `LogicalPlan.refresh()` is also refreshing its children, as a result refreshing a large view can be quite expensive. This PR just calls `LogicalPlan.refresh()` on the top node. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19837 from hvanhovell/SPARK-22637. (cherry picked from commit 475a29f11ef488e7cb19bf7e0696d9d099d77c92) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 29 November 2017, 00:04:02 UTC
eef72d3 [SPARK-22603][SQL] Fix 64KB JVM bytecode limit problem with FormatString ## What changes were proposed in this pull request? This PR changes `FormatString` code generation to place generated code for expressions for arguments into separated methods if these size could be large. This PR passes variable arguments by using an `Object` array. ## How was this patch tested? Added new test cases into `StringExpressionSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19817 from kiszk/SPARK-22603. (cherry picked from commit 2dbe275b2d26035b610ed8385d88e3c9562eaf19) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 27 November 2017, 12:33:05 UTC
2cd4898 [SPARK-22607][BUILD] Set large stack size consistently for tests to avoid StackOverflowError Set `-ea` and `-Xss4m` consistently for tests, to fix in particular: ``` OrderingSuite: ... - GenerateOrdering with ShortType *** RUN ABORTED *** java.lang.StackOverflowError: at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) ... ``` Existing tests. Manually verified it resolves the StackOverflowError this intends to resolve. Author: Sean Owen <sowen@cloudera.com> Closes #19820 from srowen/SPARK-22607. (cherry picked from commit fba63c1a7bc5c907b909bfd9247b85b36efba469) Signed-off-by: Sean Owen <sowen@cloudera.com> 26 November 2017, 13:45:38 UTC
455cea6 Preparing development version 2.2.2-SNAPSHOT 24 November 2017, 21:11:41 UTC
e30e269 Preparing Spark release v2.2.1-rc2 24 November 2017, 21:11:35 UTC
c3b5df2 fix typo 24 November 2017, 20:06:57 UTC
b606cc2 [SPARK-22495] Fix setup of SPARK_HOME variable on Windows ## What changes were proposed in this pull request? This is a cherry pick of the original PR 19370 onto branch-2.2 as suggested in https://github.com/apache/spark/pull/19370#issuecomment-346526920. Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files. First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup. Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system. ## How was this patch tested? Tested on local installation. Author: Jakub Nowacki <j.s.nowacki@gmail.com> Closes #19807 from jsnowacki/fix_spark_cmds_2. 24 November 2017, 20:05:57 UTC
ad57141 [SPARK-22595][SQL] fix flaky test: CastSuite.SPARK-22500: cast for struct should not generate codes beyond 64KB This PR reduces the number of fields in the test case of `CastSuite` to fix an issue that is pointed at [here](https://github.com/apache/spark/pull/19800#issuecomment-346634950). ``` java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at org.codehaus.janino.UnitCompiler.findClass(UnitCompiler.java:10971) at org.codehaus.janino.UnitCompiler.findTypeByName(UnitCompiler.java:7607) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:5758) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5732) at org.codehaus.janino.UnitCompiler.access$13200(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5668) at org.codehaus.janino.UnitCompiler$18.visitReferenceType(UnitCompiler.java:5660) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3356) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5660) at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2892) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2764) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) ... ``` Used existing test case Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19806 from kiszk/SPARK-22595. (cherry picked from commit 554adc77d24c411a6df6d38c596aa33cdf68f3c1) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 November 2017, 11:10:32 UTC
f4c457a [SPARK-22591][SQL] GenerateOrdering shouldn't change CodegenContext.INPUT_ROW ## What changes were proposed in this pull request? When I played with codegen in developing another PR, I found the value of `CodegenContext.INPUT_ROW` is not reliable. Under wholestage codegen, it is assigned to null first and then suddenly changed to `i`. The reason is `GenerateOrdering` changes `CodegenContext.INPUT_ROW` but doesn't restore it back. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19800 from viirya/SPARK-22591. (cherry picked from commit 62a826f17c549ed93300bdce562db56bddd5d959) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 24 November 2017, 10:47:10 UTC
f8e73d0 [SPARK-17920][SQL] [FOLLOWUP] Backport PR 19779 to branch-2.2 ## What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/19795 , to simplify the file creation. ## How was this patch tested? Only test case is updated Author: vinodkc <vinod.kc.in@gmail.com> Closes #19809 from vinodkc/br_FollowupSPARK-17920_branch-2.2. 24 November 2017, 10:42:47 UTC
b17f406 [SPARK-17920][SPARK-19580][SPARK-19878][SQL] Backport PR 19779 to branch-2.2 - Support writing to Hive table which uses Avro schema url 'avro.schema.url' ## What changes were proposed in this pull request? > Backport https://github.com/apache/spark/pull/19779 to branch-2.2 SPARK-19580 Support for avro.schema.url while writing to hive table SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url Support writing to Hive table which uses Avro schema url 'avro.schema.url' For ex: create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem java.lang.NullPointerException at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174) ## Changes proposed in this fix Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object ## How was this patch tested? Added new test case in VersionsSuite Author: vinodkc <vinod.kc.in@gmail.com> Closes #19795 from vinodkc/br_Fix_SPARK-17920_branch-2.2. 22 November 2017, 17:21:26 UTC
df9228b [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source ## What changes were proposed in this pull request? Let’s say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to [SPARK-12218](https://github.com/apache/spark/pull/10362) for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that: - The current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not - If I have (p1 AND p2) AND p3, it will be split into p1, p2, p3. There won't be nested AND expression. - The current Spark code logic for OR is OK. It either pushes both legs or nothing. The same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite gatorsmile Author: Jia Li <jiali@us.ibm.com> Closes #19776 from jliwork/spark-22548. (cherry picked from commit 881c5c807304a305ef96e805d51afbde097f7f4f) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 22 November 2017, 01:30:15 UTC
11a599b [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem with cast This PR changes `cast` code generation to place generated code for expression for fields of a structure into separated methods if these size could be large. Added new test cases into `CastSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19730 from kiszk/SPARK-22500. (cherry picked from commit ac10171bea2fc027d6691393b385b3fc0ef3293d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 November 2017, 21:45:40 UTC
94f9227 [SPARK-22550][SQL] Fix 64KB JVM bytecode limit problem with elt This PR changes `elt` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `elt` with a lot of argument Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19778 from kiszk/SPARK-22550. (cherry picked from commit 9bdff0bcd83e730aba8dc1253da24a905ba07ae3) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 November 2017, 11:53:55 UTC
23eb4d7 [SPARK-22508][SQL] Fix 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create() ## What changes were proposed in this pull request? This PR changes `GenerateUnsafeRowJoiner.create()` code generation to place generated code for statements to operate bitmap and offset into separated methods if these size could be large. ## How was this patch tested? Added a new test case into `GenerateUnsafeRowJoinerSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19737 from kiszk/SPARK-22508. (cherry picked from commit c9577148069d2215dc79cbf828a378591b4fba5d) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 November 2017, 11:17:06 UTC
ca02575 [SPARK-22549][SQL] Fix 64KB JVM bytecode limit problem with concat_ws ## What changes were proposed in this pull request? This PR changes `concat_ws` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat_ws` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19777 from kiszk/SPARK-22549. (cherry picked from commit 41c6f36018eb086477f21574aacd71616513bd8e) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 21 November 2017, 00:42:23 UTC
710d618 [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat ## What changes were proposed in this pull request? This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19728 from kiszk/SPARK-22498. (cherry picked from commit d54bfec2e07f2eb934185402f915558fe27b9312) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 18 November 2017, 18:40:18 UTC
53a6076 [SPARK-22544][SS] FileStreamSource should use its own hadoop conf to call globPathIfNecessary ## What changes were proposed in this pull request? Pass the FileSystem created using the correct Hadoop conf into `globPathIfNecessary` so that it can pick up user's hadoop configurations, such as credentials. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19771 from zsxwing/fix-file-stream-conf. (cherry picked from commit bf0c0ae2dcc7fd1ce92cd0fb4809bb3d65b2e309) Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> 17 November 2017, 23:35:36 UTC
3bc37e5 [SPARK-22538][ML] SQLTransformer should not unpersist possibly cached input dataset ## What changes were proposed in this pull request? `SQLTransformer.transform` unpersists input dataset when dropping temporary view. We should not change input dataset's cache status. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19772 from viirya/SPARK-22538. (cherry picked from commit fccb337f9d1e44a83cfcc00ce33eae1fad367695) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 17 November 2017, 16:44:05 UTC
ef7ccc1 [SPARK-22540][SQL] Ensure HighlyCompressedMapStatus calculates correct avgSize ## What changes were proposed in this pull request? Ensure HighlyCompressedMapStatus calculates correct avgSize ## How was this patch tested? New unit test added. Author: yucai <yucai.yu@intel.com> Closes #19765 from yucai/avgsize. (cherry picked from commit d00b55d4b25ba0bf92983ff1bb47d8528e943737) Signed-off-by: Sean Owen <sowen@cloudera.com> 17 November 2017, 13:54:08 UTC
be68f86 [SPARK-22535][PYSPARK] Sleep before killing the python worker in PythRunner.MonitorThread (branch-2.2) ## What changes were proposed in this pull request? Backport #19762 to 2.2 ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19768 from zsxwing/SPARK-22535-2.2. 16 November 2017, 22:41:05 UTC
0b51fd3 [SPARK-22501][SQL] Fix 64KB JVM bytecode limit problem with in ## What changes were proposed in this pull request? This PR changes `In` code generation to place generated code for expression for expressions for arguments into separated methods if these size could be large. ## How was this patch tested? Added new test cases into `PredicateSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19733 from kiszk/SPARK-22501. (cherry picked from commit 7f2e62ee6b9d1f32772a18d626fb9fd907aa7733) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 November 2017, 17:25:01 UTC
52b05b6 [SPARK-22494][SQL] Fix 64KB limit exception with Coalesce and AtleastNNonNulls ## What changes were proposed in this pull request? Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when used with a lot of arguments and/or complex expressions. This PR splits their expressions in order to avoid the issue. ## How was this patch tested? Added UTs Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19720 from mgaido91/SPARK-22494. (cherry picked from commit 4e7f07e2550fa995cc37406173a937033135cf3b) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 November 2017, 17:19:26 UTC
17ba7b9 [SPARK-22499][SQL] Fix 64KB JVM bytecode limit problem with least and greatest ## What changes were proposed in this pull request? This PR changes `least` and `greatest` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved two cases: * `least` with a lot of argument * `greatest` with a lot of argument ## How was this patch tested? Added a new test case into `ArithmeticExpressionsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19729 from kiszk/SPARK-22499. (cherry picked from commit ed885e7a6504c439ffb6730e6963efbd050d43dd) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 November 2017, 16:56:33 UTC
b17ba35 [SPARK-22479][SQL][BRANCH-2.2] Exclude credentials from SaveintoDataSourceCommand.simpleString ## What changes were proposed in this pull request? Do not include jdbc properties which may contain credentials in logging a logical plan with `SaveIntoDataSourceCommand` in it. ## How was this patch tested? new tests Author: osatici <osatici@palantir.com> Closes #19761 from onursatici/os/redact-jdbc-creds-2.2. 16 November 2017, 11:49:29 UTC
3ae187b [SPARK-22469][SQL] Accuracy problem in comparison with string and numeric This fixes a problem caused by #15880 `select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. ` When compare string and numeric, cast them as double like Hive. Author: liutang123 <liutang123@yeah.net> Closes #19692 from liutang123/SPARK-22469. (cherry picked from commit bc0848b4c1ab84ccef047363a70fd11df240dbbf) Signed-off-by: Wenchen Fan <wenchen@databricks.com> 16 November 2017, 11:47:22 UTC
3cefdde [SPARK-22490][DOC] Add PySpark doc for SparkSession.builder ## What changes were proposed in this pull request? In PySpark API Document, [SparkSession.build](http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html) is not documented and shows default value description. ``` SparkSession.builder = <pyspark.sql.session.Builder object ... ``` This PR adds the doc. ![screen](https://user-images.githubusercontent.com/9700541/32705514-1bdcafaa-c7ca-11e7-88bf-05566fea42de.png) The following is the diff of the generated result. ``` $ diff old.html new.html 95a96,101 > <dl class="attribute"> > <dt id="pyspark.sql.SparkSession.builder"> > <code class="descname">builder</code><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt> > <dd><p>A class attribute having a <a class="reference internal" href="#pyspark.sql.SparkSession.Builder" title="pyspark.sql.SparkSession.Builder"><code class="xref py py-class docutils literal"><span class="pre">Builder</span></code></a> to construct <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal"><span class="pre">SparkSession</span></code></a> instances</p> > </dd></dl> > 212,216d217 < <dt id="pyspark.sql.SparkSession.builder"> < <code class="descname">builder</code><em class="property"> = &lt;pyspark.sql.session.SparkSession.Builder object&gt;</em><a class="headerlink" href="#pyspark.sql.SparkSession.builder" title="Permalink to this definition">¶</a></dt> < <dd></dd></dl> < < <dl class="attribute"> ``` ## How was this patch tested? Manual. ``` cd python/docs make html open _build/html/pyspark.sql.html ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19726 from dongjoon-hyun/SPARK-22490. (cherry picked from commit aa88b8dbbb7e71b282f31ae775140c783e83b4d6) Signed-off-by: gatorsmile <gatorsmile@gmail.com> 15 November 2017, 17:00:39 UTC
210f292 [SPARK-22511][BUILD] Update maven central repo address Use repo.maven.apache.org repo address; use latest ASF parent POM version 18 Existing tests; no functional change Author: Sean Owen <sowen@cloudera.com> Closes #19742 from srowen/SPARK-22511. (cherry picked from commit b009722591d2635698233c84f6e7e6cde7177019) Signed-off-by: Sean Owen <sowen@cloudera.com> 14 November 2017, 23:59:37 UTC
3ea6fd0 [SPARK-22377][BUILD] Use /usr/sbin/lsof if lsof does not exists in release-build.sh ## What changes were proposed in this pull request? This PR proposes to use `/usr/sbin/lsof` if `lsof` is missing in the path to fix nightly snapshot jenkins jobs. Please refer https://github.com/apache/spark/pull/19359#issuecomment-340139557: > Looks like some of the snapshot builds are having lsof issues: > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console > >https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-maven-snapshots/134/console > >spark-build/dev/create-release/release-build.sh: line 344: lsof: command not found >usage: kill [ -s signal | -p ] [ -a ] pid ... >kill -l [ signal ] Up to my knowledge, the full path of `lsof` is required for non-root user in few OSs. ## How was this patch tested? Manually tested as below: ```bash #!/usr/bin/env bash LSOF=lsof if ! hash $LSOF 2>/dev/null; then echo "a" LSOF=/usr/sbin/lsof fi $LSOF -P | grep "a" ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #19695 from HyukjinKwon/SPARK-22377. (cherry picked from commit c8b7f97b8a58bf4a9f6e3a07dd6e5b0f646d8d99) Signed-off-by: hyukjinkwon <gurwls223@gmail.com> 13 November 2017, 23:28:28 UTC
d905e85 [SPARK-22471][SQL] SQLListener consumes much memory causing OutOfMemoryError ## What changes were proposed in this pull request? This PR addresses the issue [SPARK-22471](https://issues.apache.org/jira/browse/SPARK-22471). The modified version of `SQLListener` respects the setting `spark.ui.retainedStages` and keeps the number of the tracked stages within the specified limit. The hash map `_stageIdToStageMetrics` does not outgrow the limit, hence overall memory consumption does not grow with time anymore. A 2.2-compatible fix. Maybe incompatible with 2.3 due to #19681. ## How was this patch tested? A new unit test covers this fix - see `SQLListenerMemorySuite.scala`. Author: Arseniy Tashoyan <tashoyan@gmail.com> Closes #19711 from tashoyan/SPARK-22471-branch-2.2. 13 November 2017, 21:50:12 UTC
af0b185 Preparing development version 2.2.2-SNAPSHOT 13 November 2017, 19:04:34 UTC
back to top